FPGA DNN Acceleration with BRAMAC
FPGA DNN Acceleration with BRAMAC
Multiply-Accumulate on FPGAs
Yuzong Chen and Mohamed S. Abdelfattah
Department of Electrical and Computer Engineering, Cornell University
{yc2367, mohamed}@cornell.edu
Abstract—Deep neural network (DNN) inference using reduced random access memory (BRAM) for model storage and digital
integer precision has been shown to achieve significant improve- signal processing (DSP) units for implementing multiply-
ments in memory utilization and compute throughput with little accumulate (MAC)—the fundamental primitive in DNNs.
or no accuracy loss compared to full-precision floating-point.
Nevertheless, most FPGA vendors’ DSP blocks do not natively
arXiv:2304.03974v1 [cs.AR] 8 Apr 2023
Row Decoder A
Row Decoder B
(MEM, CIM)
enabling efficient tiling-based DNN acceleration. Cell Cell Cell
Input Crossbar
0xfff
...
4) We quantify the benefits of employing BRAMAC in =
128x160
portA_addr Main BRAM
a tiled FPGA DNN accelerator, which achieves up
Cell Cell Cell
to 2.04× and 1.52× performance improvements for opA_addr
AlexNet and ResNet-34, respectively over the baseline Column 4:1 MUX Column
Decoder A Decoder B
accelerator without BRAMAC.
dataB main_dout
Port A & B
II. R ELATED W ORK Sense Amps, Write Drivers
Output Crossbar
dataA
I1,I2,ctrl 40b 40b
40b
In this section. we discuss previous work that targeted mac_precision
Config Sign-Ext Mux
efficient MAC implementation on FPGAs including logic 160b 160b
rd,wr enable 40b
block, DSP, and BRAM enhancements. Embedded Dummy BRAM Array
(7x160 Dual-Port SRAM) dummy_dout
FSM (eFSM) rd,wr addr
160b 160b
A. Logic Block with Fast Arithmetic mac_precision
Precision-Configurable Adder
To efficiently implement arithmetic operations in soft logic,
Fig. 1: Top-level block diagram of BRAMAC modified from Intel’s
modern FPGAs contain hardened adder circuitry in their logic M20K BRAM. New circuit blocks are orange-shaded.
blocks (LBs) [19]. These adders range from simple ripple-
carry adders to more complex variants such as carry-bypass parallelism. However, the circuit implementation of CCB re-
adders and carry-lookahead adders. In order to reduce the quires an additional voltage supply to mitigate the read-disturb
carry propagation delay, dedicated routing is used to propagate issue associated with activating two word-lines from one
carry signals between different LBs. Inspired by the superior BRAM port, which is challenging to implement in practice.
efficiency of adopting low-precision in DNN, recent research Arora et al. [18] later designed a new compute-in-BRAM
started to investigate adding more hardened arithmetic in LBs. architecture called CoMeFa to overcome some limitations of
For example, Boutros et al. [20] proposed three LB archi- CCB. CoMeFa also relies on bit-serial arithmetic but exploits
tectural enhancements to improve the performance of MAC the dual-port nature of BRAM to read out two operands from
implemented in soft logic. Their most promising proposal two ports, respectively instead of activating two word-lines
increases the MAC density by 1.7× while simultaneously from one port, thus eliminating the read-disturb issue.
improving the MAC speed. Both CCB and CoMeFa require transposed data layout for
B. Low-Precision DSP Architectures bit-serial computation, i.e., each word occupies one column
and multiple rows instead of one row and multiple columns
Modern commercial FPGAs include DSP blocks that im- in a conventional data layout. However, transposing data
plement efficient multiplication with additional features such is expensive in both latency and additional hardware cost
as pre-addition and accumulation commonly used in signal (e.g. a swizzle module in CoMeFa) for online execution.
processing applications [19]. Nevertheless, most FPGA ven- Furthermore, these two BRAM architectures compute directly
dors’ DSP multipliers have a minimum precision of 18-bit, on the main BRAM array and receive the CIM instruction
making them less competitive in accelerating low-precision through a BRAM write port—this prevents tiling. As a result,
DNNs. To address this limitation, researchers have proposed these two works are limited to accelerating only persistent-
new DSP architectures to support low-precision MAC. Boutros style DNN inference where the model weights are transposed
et al. [15] introduced an enhanced Intel DSP (eDSP) that offline and remain persistent in the on-chip memory. Different
supports four 9-bit or eight 4-bit multiplications without using from CCB and CoMeFa, BRAMAC adopts a hybrid bit-serial
additional routing ports. Rasoulinezhad et al. [16] presented & bit-parallel MAC dataflow that eliminates the requirement
a modified Xilinx DSP, called PIR-DSP, that can carry out of transposed data layout. In addition, BRAMAC doesn’t
six 9-bit, twelve 4-bit, or twenty-four 2-bit multiplications. compute on the main BRAM array which is typically large and
Regarding industry DSP trends, the recent Xilinx Versal and therefore, slow and power-hungry. Rather, it copies the main
Intel Agilex devices added support for 8-bit multiplication in BRAM’s data to a special, separate dummy BRAM array for
their DSP blocks [21], [22]. In addition, Intel’s latest Stratix- computation. This dummy array has only 7 rows and therefore
10 NX device added a new DSP (called AI tensor) block that can be accessed much faster compared to the main BRAM
contains 30 INT8 multipliers and can also be configured as 60 array. It can also free up the read and write ports of the main
INT4 multipliers [23]. BRAM during CIM to allow tiling-based DNN acceleration.
C. Computing In-BRAM
III. BRAMAC A RCHITECTURE AND DATAFLOW
With the emergence of CIM to overcome the von-Neumann
bottleneck [24], some FPGA researchers suggest augmenting A. Overall Architecture
existing BRAM architectures with compute capability. Wang et Fig. 1 shows the top-level block diagram of BRAMAC
al. [17] proposed a compute-capable BRAM (CCB) that uses modified from Intel’s M20K BRAM [25] with added circuit
bit-serial arithmetic to enable a high degree of computation blocks orange-shaded. The routing interface (i.e., input and
output crossbar) of BRAMAC is the same as that of M20K. Algorithm 1: Hybrid Bit-Serial & Bit-Parallel MAC2
The main BRAM array’s dimension is 128-row × 160-column, Require : All numbers are integers in 2’s complement
i.e., 20 kb memory capacity. The 4:1 column multiplexing Input : W ∈ Z2 , I ∈ Z2 , precision n ≥ 2
feature of M20K is preserved. One additional SRAM cell is Output : P ∈ Z
added to select one of the two operation modes of BRAMAC: 1 Initialization P = 0
2 for i = (n − 1) downto 0 do
1) MEM: In this memory mode, the behavior of BRAMAC 3 psum = W1 ∗ I1 [i] + W2 ∗ I2 [i]
is identical to that of a conventional M20K. The input crossbar 4 if i == (n − 1) then
sends the address and data to portA and portB. For memory 5 P = P + inv(psum) + 1
reads, the two addresses are decoded by the row and column 6 P = P << 1
decoders. The 40-bit BRAM output data from sense amplifiers 7 else if i 6= 0 then
8 P = P + psum
is sent to the output crossbar. For memory writes, the data is 9 P = P << 1
sent to the write drivers for updating the main BRAM. 10 else
2) CIM: This is the compute mode where BRAMAC can 11 P = P + psum
compute MAC2, P = (W1 I1 +W2 I2 ), using 2-bit, 4-bit, or 8-
12 return P
bit operand precision. The two groups of operands, (W1 , W2 )
and (I1 , I2 ), can be thought of as weights and inputs of
DNN in the remainder of this paper, respectively. At a high 1st Weight in Dummy Array
level, BRAMAC computes MAC2 by keeping weights inside MAC2 W1,1 W2,1 W3,1 W4,1 W5,1 W6,1 W7,1 W8,1
BRAMAC while streaming inputs from outside. W1,1 W1,2 ... W1,6 I1 W1,2 W2,2 W3,2 W4,2 W5,2 W6,2 W7,2 W8,2
W2,1 W2,2 ... W2,6 I2 I1
The main BRAM is automatically configured as a simple X
...
...
...
...
...
I2
dual-port memory with a maximum data width of 40-bit, W8,1 W8,2 ... W8,6 I6
Input
W1,1I1 W2,1I1 W3,1I1
+ W1,2I2 + W2,2I2 + W3,2I2
... W8,1I1
+ W8,2I2
and a depth of 512 to maximize the read/write throughput. Stream
A special address (0xfff) is reserved and compared with Fig. 2: Example of MAC2 to compute matrix-vector multiplication.
the portA address, and if equal, the 40-bit portA data is
treated as a CIM instruction. The CIM instruction contains of the nth matrix column. To exploit this input-sharing in
two addresses for reading two 40-bit data from the main BRAMAC, two inputs are packed into the CIM instruction that
BRAM, respectively. Each 40-bit data is a vector that contains is sent to BRAMAC, then multiplied by all elements of the
multiple low-precision W1 /W2 elements. The configurable corresponding two matrix columns copied to the dummy array,
sign-extension mux sign-extends the 40-bit vectors to 160-bit respectively. Copying a matrix column requires the weight
before copying them to a dummy BRAM array which is a 7- matrix to be transposed so that matrix columns correspond to
row × 160-column true dual-port BRAM without the column a BRAM row. This can be easily done offline for DNNs. Fig. 2
multiplexing feature. The CIM instruction also contains the illustrates an example of using MAC2 to compute MVM where
two inputs, I1 and I2 , and several control signals that are sent the matrix dimension is 8×6. For the first MAC2, the first and
to an eFSM to trigger and control the MAC2 operation. The second matrix columns are copied from the main BRAM to the
precision-configurable adder can read two 160-bit vectors from dummy array. Two vector elements I1 and I2 are streamed to
the dummy array, performs a single-instruction-multiple-data BRAMAC through the CIM instruction and multiplied by all
(SIMD) add, and writes the sum back to the dummy array. 8 elements of the first and second matrix columns to obtain 8
Since the dummy array has the same number of columns as partial sums. For large matrices, the number of matrix elements
the main BRAM array, it can read out 40-bit data similar to that can be loaded to the dummy array depends on the MAC
the main BRAM. A 2-to-1 mux is added to select the data precision. Since the two read ports of the main BRAM have
between the main BRAM and the dummy array. a total data width of 80-bit, they can copy ten 8-bit, twenty
4-bit, or forty 2-bit weights to a dummy array for one MAC2,
B. Hybrid Bit-Serial & Bit-Parallel MAC Dataflow providing a parallelism of 10, 20, or 40 MACs, respectively.
BRAMAC computes 2’s complement MAC2 by adopting
a hybrid bit-serial & bit-parallel dataflow [26] as described C. Circuit Design to Support MAC2
in Algorithm 1. The for-loop in line 2-11 iterates through two We now describe the new circuit blocks in BRAMAC to
inputs bit-by-bit. Each iteration involves multiplying the entire support MAC2. These circuit blocks are shown in Fig. 3,
W1 and W2 by a single bit from I1 and I2 , respectively, including a dual-port “dummy” BRAM array (Fig. 3(a)), a
followed by a bit-parallel addition to obtain the partial sum configurable sign-extension mux (Fig. 3(b)), a 160-bit SIMD
(psum) as shown in line 3. If the current input bit is the most- adder implemented using 1-bit full adders, and read/write
significant bit (MSB), then psum is subtracted from P (line 5) circuits (Fig. 3(c)).
since the MSB is negative in 2’s complement representation. 1) Dual-Port Dummy BRAM Array: The dual-port dummy
If the current input bit is not the least-significant bit (LSB), BRAM array is 7-row × 160-column without column mul-
then P also needs to be shifted left by 1-bit after adding psum tiplexing as shown in Fig. 3(a). Its SRAM cell is identical
(lines 6, 9). to that used in the main BRAM. Each column contains two
The hybrid bit-serial & bit-parallel MAC2 algorithm is sense amplifiers and two write drivers to allow true dual-port
efficient for computing matrix-vector multiplication (MVM) access. Its 1st row is hard-coded to always store 0. The 2nd
where the nth vector element is multiplied by all elements and 3rd rows store the W1 and W2 vectors, respectively that
160 Columns I2[i], I1[i] extend one 8-bit element to one 32-bit element (blue crosses
BLA BLbA BLB BLbB BLA BLbA BLB BLbB
160'b0
SA1 SA2
in Fig. 3(b)), or two 4-bit elements to two 16-bit elements
2:4 Demux
WD1 WenA WD2 WenB
W1 (green crosses in Fig. 3(b)), or four 2-bit elements to four 8-bit
Decoder Logic x2
W2 M1 SelA M2 SelB
A A B B elements (red crosses in Fig. 3(b)). Moreover, since a 2/4/8-bit
7 Rows
W1 + W2
ramA S SRight ramB B 1'b0
Inverter RdA_addr MAC2 only requires a maximum bit-width of 5/9/17 bits to
WrA_addr B Cin Cin
P
A A
store the result, the proposed sign-extension mux can provide
RdB_addr
Accumulator 0 1
WrB_addr B S
B
Cout
a higher bit-width required by MAC2. This allows multiple
Sense Amps &
Write Drivers
RenA, RenB A
1 0
sequential MAC2 results to be accumulated by adding the 6th
(a)
B
(c) row (that stores the MAC2 result P) and the 7th row (that
D0 8-bit MAC Mode
stores the Accumulator) of the dummy array.
b0
D1 D0 4-bit MAC Mode b1 3) Bit-Parallel SIMD Adder with Read/Write Circuits:
D3 D2 D1 D0 2-bit MAC Mode b2
b7 b6 b5 b4 b3 b2 b1 b0 Sign-Extension
The 160-bit SIMD adder in BRAMAC is designed using the
b3
8-bit from BRAM b4 conventional 1-bit full adder as shown in Fig. 3(c). It supports
b5
b6
bit-parallel SIMD addition by configuring itself to twenty 8-
b7 bit adders, ten 16-bit adders, and five 32-bit adders for 2-
31 30 29 28 27 26 25 24 23 22 21 20 19 18 17 16 15 14 13 12 11 10 9 8 7 6 5 4 3 2 1 0 bit, 4-bit, and 8-bit MAC2, respectively, giving a worst-case
(b) 32-bit to Dummy Array
delay equal to 32-bit addition. The two operands A and B
Fig. 3: BRAMAC circuit blocks for computing MAC2: (a) dual-port of the SIMD adder come from two sense amplifiers, SA1 and
dummy BRAM array, (b) configurable sign-extension mux (here we SA2 that compare the voltage differential of two bit-line pairs,
are displaying one out of five identical blocks), (c) 1-bit full-adder (BLA, BLbA) and (BLB, BLbB). To support the addition
with read/write circuits.
followed by 1-bit shift-left operation (required in lines 6 and
9 of Algorithm 1), a write-back mux M1 before the write
are copied from the main BRAM array. The 4th row stores a driver WD1 is used to select either sum S from the current
(W1 + W2 ) vector. The 5th Inverter row is used to store the full adder or sum from the right full adder SRight . M1 can
temporary inverted psum required by the binary subtraction also select ramA to copy the first data W1 from the main
(line 5 of Algorithm 1). The 6th row stores the MAC2 result BRAM. Similarly, a write-back mux M2 before the write
P. The 7th row is a wide Accumulator to accumulate multiple driver WD2 is used to select between three signals: B-bar
MAC2 results that form a large dot product. to perform inverting, ramB to copy the second data W2 from
The read and write operations of the dummy array are the main BRAM, and 1’b0 to initialize either P (line 1) or the
controlled by address and enable signals (blue signals in Fig. Accumulator. Both M1 and M2 are controlled by the eFSM.
3(a)) sent from the eFSM as described in Section III-A2. The
access to 1st – 4th rows during MAC2 is managed by both the IV. BRAMAC VARIANTS
decoder logic and a 2-to-4 demux. The 2-bit selection signal A. BRAMAC with Two Synchronous Dummy Arrays (2SA)
of the demux comes from the current two processing bits of This variant, called BRAMAC-2SA, has two synchronous
the two inputs I1 and I2 , respectively. This allows calculating dummy arrays that share the same clock domain as the main
psum (line 3 of Algorithm 1) using a look-up table [27]. If BRAM. In this architecture, each dummy array is fed by one
{I2 [i], I1 [i]} is 2’b00, then the 1st zero row will be read out port of the main BRAM during weight copy. Since BRAMAC
and added to the 6th row P. If {I2 [i], I1 [i]} is 2’b11, then the intrinsically supports multiplying the same input with many
4th row (W1 + W2 ) will be read out and added to P. If {I2 [i], weights as discussed in Section III-B, this variant adopts an
I1 [i]} is 2’b01 or 2’b10, then then the 2nd row W1 or the 3rd input-sharing approach to balance the data reuse. Specifically,
row W2 will be read out and added to P. in each MAC2 iteration, the two dummy arrays copy the same
Since the dummy array copies data from the main BRAM weights but process different inputs. The first dummy array
array for computation, a coherency issue may arise where receives two inputs I1 , I2 and calculates W1 I1 + W2 I2 , while
the main BRAM is being updated while the dummy array the second dummy array receives another two inputs I3 , I4
is still computing using the stale data. We leave it for the and calculates W1 I3 + W2 I4 .
programmer/compiler to explicitly ensure the memory co- An example 4-bit MAC2 operation for one dummy array
herency similar to the explicit handling of the read-during- of BRAMAC-2SA is illustrated in Fig. 4. Note that we are
write behavior of Intel’s BRAM [28]. displaying 2 out of 10 lanes with 10-bit sign-extension due to
2) Configurable Sign-Extension Mux: Although not re- space limitation (instead of 16-bit sign-extension as described
flected in Algorithm 1, the W1 and W2 vectors from the main in Section III-C2). In Cycle 1 and Cycle 2, W1 and W2 are
BRAM need to be sign-extended before being copied to the sign-extended and copied to the dummy array. During these
dummy array in order to prevent overflow during MAC2. To two cycles, the two inputs for each dummy array are also sent
support this, two configurable sign-extension muxes are added to BRAMAC-2SA through the CIM instruction and latched for
between the main BRAM and the dummy array. Each mux has further processing. In Cycle 3, W1 and W2 are read out and
five identical blocks, one of which is shown in Fig. 3(b). Since added. The sum is written back to the 4th row to store (W1 +
the main BRAM has a data width of 40 bits, it can copy five W2 ). Simultaneously, the 6th row P can also be initialized to
8-bit, ten 4-bit, or twenty 2-bit elements to the dummy array zero. In Cycle 4, the MSB of two inputs is streamed to the
simultaneously. Each of the five identical mux blocks can sign- dummy array. The selected row W1 is inverted to prepare for
W1 W2 Overlap
...
W1 1111111011 0000000100
Sign-Ext Mux Sign-Ext Mux W2 0000000111 1111111101 (a) BRAMAC-2SA Pipeline for 4-bit MAC2
W1+W2 0000000010 0000000001
1111111011 0000000100 0000000111 1111111101 Inverter x x Overlap
P 0000000000 0000000000 Cycle # 1 2 3 4 5 6
Cycle 4: Stream {I2[3], I1[3]} = 01. Cycle 5: (Inverter + P + 1). Cycle 6: Stream {I2[2], I1[2]} = 00. Rd W1, W2 W1,W2 W1+W2 Add P Add P Add P
1st MAC2 Invert Add P Accum
Inverting. Shift-left 1-bit. (0 + P) & shift-left 1-bit. from BRAM Copy Init P Shi� Shi� Shi�
Read W3, W4 W3,W4 W3+W4
0 0 0 0 0 0 0 2nd MAC2
from BRAM Copy Init P
W1 1111111011 0000000100 1111111011 0000000100 1111111011 0000000100
...
W2 0000000111 1111111101 0000000111 1111111101 0000000111 1111111101
W1+W2 0000000010 0000000001 0000000010 0000000001 0000000010 0000000001 (b) BRAMAC-1DA Pipeline for 4-bit MAC2 Main BRAM Busy
Inverter 0000000100 1111111011 0000000100 1111111011 0000000100 1111111011 Main BRAM Idle
P 0000000000 0000000000 0000001010 1111111000 0000010100 1111110000
Fig. 5: Pipeline diagram of 4-bit MAC2 in (a) BRAMAC-2SA and
Cycle 7: Stream {I2[1], I1[1]} = 10. Cycle 8: Stream {I2[0], I1[0]} = 11. (b) BRAMAC-1DA.
(W2 + P) & shift-left 1-bit. (W1 + W2 + P). No shift. Cycle 9: Add P to Accumulator.
0 0 0 0 0 0 0
W1 1111111011 0000000100 1111111011 0000000100 1111111011 0000000100 0-1 2 3 4 5 6 7 8 - 14 15 - 16 17 - 24 25 - 32
W2 0000000111 1111111101 0000000111 1111111101 0000000111 1111111101 prec inType reset start copy w1_w2 done bramRow bramCol input_1 input_2
W1+W2 0000000010 0000000001 0000000010 0000000001 0000000010 0000000001
Inverter 0000000100 1111111011 0000000100 1111111011 0000000100 1111111011 (a) CIM Instruc�on Format for BRAMAC-2SA
P 0000110110 1111011010 0000111000 1111011011 0000111000 1111011011
0-1 2 3 4 5 6 7 - 13 14 - 20 21 - 22 23 - 30 31 - 38
+ + prec
+56 -37 Accumulator Accumulator inType reset start copy done bramRow_1 bramRow_2 bramCol input_1 input_2
(b) CIM Instruc�on Format for BRAMAC-1DA
Fig. 4: Example operation of one dummy array in BRAMAC-2SA
Fig. 6: CIM instruction format for (a) BRAMAC-2SA and (b)
for 4-bit MAC2. We are displaying 2 out of 10 lanes with 10-bit sign
BRAMAC-1DA.
extension instead of 16 bits (due to space limitation).
the binary subtraction. In Cycle 5, Inverter is added to P. The easily handled. Fig. 5(b) shows the pipeline diagram of 4-
sum is shifted left by 1-bit and written back to P. The input bit MAC2 for BRAMAC-1DA. In Cycle 1, the main BRAM
streaming continues to Cycle 8 where the LSB of two inputs reads out two weights W1 and W2 . In the first half of Cycle
is processed and the correct MAC2 result P is obtained. In 2, the dummy array copies W1 and W2 using its two write
Cycle 9, P is added to the 7th Accumulator row. Then it can ports. Then the dummy array can compute the MAC2 using
be initialized for the subsequent MAC2. the same operation flow as BRAMAC-2SA, except that every
The above example indicates that BRAMAC-2SA can com- cycle in BRAMAC-2SA is now half a cycle in BRAMAC-
plete a 4-bit MAC2 using 9 cycles. However, during the write- 1DA. Similar to the pipelining optimization for BRAMAC-
back phase of the last two cycles, i.e., Cycle 8 and Cycle 9, 2SA, the main BRAM can start to read the next two weights
the current two weights W1 and W2 are no longer needed in W3 and W4 in Cycle 5 while the dummy array is computing.
the dummy array since the current MAC2 result P is already As a result, the 4-bit MAC2 can be completed using 4 cycles.
obtained at the bit-parallel adder’s output. As a result, these This pipelining can also be applied to 2-bit and 8-bit MAC2.
two cycles can also be used to copy the next two weights Hence, 2-bit and 8-bit MAC2 can take 3 and 6 cycles to
W3 and W4 , respectively as illustrated in Fig. 5(a). Therefore, complete, respectively.
the 4-bit MAC2 in BRAMAC-2SA only requires 7 cycles to
C. Embedded FSM to Free Up BRAM Ports
complete. This pipelining can also be applied to 2-bit and 8-
bit MAC2. The only difference between 2-bit, 4-bit, and 8-bit Since the dummy array’s behavior is deterministic for
MAC2 is the number of cycles spent for processing every input computing MAC2, we propose to control it using an eFSM.
bit as described in line 2-11 of Algorithm 1. Thus, 2-bit and This eFSM receives a CIM instruction to trigger the MAC2
8-bit MAC2 can take 5 and 11 cycles to complete, respectively. computation and control the dummy array’s read/write access.
The CIM instruction is only required when the main BRAM
B. BRAMAC with One Double-Pumped Dummy Array (1DA) needs to send data to the dummy array (indicated by the red
This variant, called BRAMAC-1DA, has only one dummy boxes in Fig. 5). As a result, the main BRAM is busy for
array to reduce the area overhead. Using one dummy array 2 cycles in BRAMAC-2SA and 1 cycle in BRAMAC-1DA.
degrades the MAC throughput by 2× compared to BRAMAC- When the main BRAM is idle, it can perform normal read
2SA, however, we propose to double-pump the dummy array operations to feed LBs/DSPs or write operations to load the
with a 2× main BRAM clock frequency. Memory multi- next tile of weights from off-chip DRAM, allowing tiling-
pumping is a commonly used technique in FPGA design to based DNN acceleration. This is different from CCB and
improve the system throughput [29], [30]. The double-pumped CoMeFa whose BRAM ports are always busy during CIM.
dummy array doesn’t add any additional area overhead com- Fig. 6(a) and (b) show the proposed CIM instruction format
pared to a synchronous dummy array. Rather, it only requires for BRAMAC-2SA and BRAMAC-1DA, respectively. For
a separate clock routing during compilation. BRAMAC-2SA, bramRow and bramCol are combined to
Because the main BRAM and the dummy array only interact form one BRAM address during each copy operation. On
during weight copy, synchronization between them can be the other hand, BRAMAC-1DA needs to receive two BRAM
TABLE I: Resource Counts and Area Ratio of the Baseline
Arria 10 GX900 FPGA.
Resource Count Area Ratio
Logic Blocks (LBs) 33920 70.4%
DSP Units 1518 9.5%
BRAMs (M20K) 33920 20.1%
Fig. 13: Comparison between DLA and DLA-BRAMAC for accelerating AlexNet and ResNet at different precisions: a) performance, (b)
utilized DSP-plus-BRAM area, (c) performance per area.
where an additional 2 cycles are required to start the initial 2SA is slightly improved over BRAMAC-1DA, it has 2×
weight copy. However, this overhead is negligible given that BRAM area overhead compared to BRAMAC-1DA. While our
each CNN layer takes thousands of cycles to complete. results more than justify the area overhead of BRAMAC, we
Table III summarizes the optimal configuration for each expect higher gains for a DNN accelerator that is: (1) purpose-
(accelerator, model, precision) case. The performance and built around the capabilities of BRAMAC, and (2) used to
utilized DSP-plus-BRAM area of DLA-BRAMAC, normalized accelerate DNNs with more matrix multiplications such as
to those of DLA, are shown in Fig. 13. The utilized DSP- transformers [37]—we will work on both aspects in the future.
plus-BRAM area is calculated based on the area overhead of
VII. C ONCLUSION
BRAMAC and the area model from [34]. On average, com-
pared to the baseline DLA for AlexNet, employing BRAMAC- This paper proposes BRAMAC, a compute-in-BRAM ar-
2SA/BRAMAC-1DA achieves 2.05×/1.7× speedup at the cost chitecture for MAC on FPGAs. To the best of our knowl-
of 2.01×/1.52× DSP-plus-BRAM area, giving 1.01×/1.12× edge, BRAMAC is the first compute-in-BRAM architecture
performance gains per utilized area. For ResNet-34, employing that: (1) adopts a hybrid bit-serial & bit-parallel dataflow
BRAMAC-2SA/BRAMAC-1DA achieves a lower speedup of to support variable-precision MAC using 2’s complement
1.33×/1.52× on average at the cost of 1.2×/1.22× DSP-plus- representation, (2) computes in a separate dummy array
BRAM area, which corresponds to 1.11×/1.25× performance which improves the main BRAM array’s utilization efficiency,
gains per utilized area. The larger DSP-plus-BRAM area is (3) employs an embedded finite-state machine to free up
mainly attributed to more BRAM usage for computation and the main BRAM ports during in-memory computation. The
BRAMAC’s area overhead. two proposed variants, BRAMAC-2SA/BRAMAC-1DA, boost
In general, BRAMAC-2SA and BRAMAC-1DA achieve the peak MAC throughput of a large Arria 10 FPGA by
higher speedup for AlexNet compared to ResNet-34 as shown 2.6×/2.1×, 2.3×/2.0×, and 1.9×/1.7× for 2-bit, 4-bit, and
in Fig. 13(a). This is because that BRAMAC is better at 8-bit precisions, respectively at the cost of 6.8%/3.4% in-
supporting a higher Kvec that allows the same input fea- crease in FPGA core area. BRAMAC also improves the
ture to be multiplied by many kernels. The early and most BRAM utilization efficiency by 1.3× and 1.1× compared
compute-intensive residual blocks of ResNet-34 only have to two recent compute-in-BRAM architectures, CCB and
an output channel depth of 64, while the first convolution CoMeFa, respectively while significantly outperforming both
layer of AlexNet has an output channel depth of 96. The architectures on matrix-vector multiplications. Combining
latter gives more freedom for DLA-BRAMAC to optimize its BRAMAC-2SA/BRAMAC-1DA with Intel’s DLA, a tiling-
configuration with high vectorization efficiency. However, a based DNN accelerator, an average speedup of 2.05×/1.7×
higher speedup for AlexNet comes with a larger utilized area and 1.33×/1.52× can be achieved for AlexNet and ResNet-
as illustrated in Fig. 13(b). Comparing the two BRAMAC 34, respectively. With its ability to support both persistent and
variants, BRAMAC-2SA has a lower performance gain per tiling-based DNN acceleration, BRAMAC has the potential
utilized area for all model-precision combinations as observed to be a highly practical and valuable addition to future AI-
from Fig. 13(c). Although the MAC throughput of BRAMAC- optimized FPGAs.
R EFERENCES [20] M. Eldafrawy, A. Boutros, S. Yazdanshenas, and V. Betz, “FPGA Logic
Block Architectures for Efficient Deep Learning Inference,” ACM Trans-
actions on Reconfigurable Technology and Systems (TRETS), vol. 13, pp.
[1] A. Krizhevsky, I. Sutskever, and G. E. Hinton, “ImageNet Classification 1–34, 2020.
with Deep Convolutional Neural Networks,” in Advances in Neural [21] Xilinx, “DSP58 Architecture,” 2022. [Online]. Available: https://docs.
Information Processing Systems, 2012. xilinx.com/r/en-US/am004-versal-dsp-engine/DSP58-Architecture
[2] J. Deng, W. Dong, R. Socher, L.-J. Li, K. Li, and L. Fei-Fei, “Imagenet: [22] Intel, “Intel Agilex Variable Precision DSP Blocks User Guide,”
A Large-Scale Hierarchical Image Database,” in Computer Vision and 2021. [Online]. Available: https://www.intel.com/programmable/
Pattern Recognition, 2009. CVPR 2009. IEEE Conference on, 2009, pp. technical-pdfs/683037.pdf
248–255. [23] M. Langhammer, E. Nurvitadhi, S. Gribok, and B. M. Pasca, “Stratix
[3] T. Brown, B. Mann, N. Ryder, M. Subbiah, J. D. Kaplan, P. Dhariwal, 10 NX Architecture,” ACM Transactions on Reconfigurable Technology
A. Neelakantan, P. Shyam, G. Sastry, A. Askell, S. Agarwal, A. Herbert- and Systems (TRETS), vol. 15, pp. 1 – 32, 2022.
Voss, G. Krueger, T. Henighan, R. Child, A. Ramesh, D. Ziegler, J. Wu, [24] M. Horowitz, “Computing’s Energy Problem (and what we can do
C. Winter, C. Hesse, M. Chen, E. Sigler, M. Litwin, S. Gray, B. Chess, about it),” IEEE International Solid-State Circuits Conference Digest
J. Clark, C. Berner, S. McCandlish, A. Radford, I. Sutskever, and of Technical Papers (ISSCC), pp. 10–14, 2014.
D. Amodei, “Language Models are Few-Shot Learners,” in Advances [25] D. M. Lewis, D. Cashman, M. Chan, J. Chromczak, G. Lai, A. Lee,
in Neural Information Processing Systems, 2020, pp. 1877–1901. T. Vanderhoek, and H. Yu, “Architectural Enhancements in Stratix V,”
[4] M. Nagel, M. Fournarakis, R. A. Amjad, Y. Bondarenko, M. van Baalen, in ACM/SIGDA International Symposium on Field Programmable Gate
and T. Blankevoort, “A White Paper on Neural Network Quantization,” Arrays (FPGA), 2013.
arxiv:abs/2106.08295, 2021. [26] P. Judd, J. Albericio, and A. Moshovos, “Stripes: Bit-serial deep neural
[5] B. Jacob, S. Kligys, B. Chen, M. Zhu, M. Tang, A. G. Howard, H. Adam, network computing,” 49th Annual IEEE/ACM International Symposium
and D. Kalenichenko, “Quantization and Training of Neural Networks on Microarchitecture (MICRO), pp. 1–12, 2016.
for Efficient Integer-Arithmetic-Only Inference,” in Conference on Com- [27] C.-F. Lee, C. Lu, C.-E. Lee, H. Mori, H. Fujiwara, Y.-C. Shih, T.-
puter Vision and Pattern Recognition, 2018, pp. 2704–2713. L. Chou, Y. D. Chih, and T.-Y. J. Chang, “A 12nm 121-TOPS/W
[6] H. Wu, “Low Precision Inference on GPU,” 2019. [Online]. Available: 41.6-TOPS/mm2 All Digital Full Precision SRAM-based Compute-in-
https://developer.download.nvidia.com/video/gputechconf/gtc/2019/ Memory with Configurable Bit-width For AI Edge Applications,” IEEE
presentation/s9659-inference-at-reduced-precision-on-gpus.pdf Symposium on VLSI Technology and Circuits, pp. 24–25, 2022.
[7] Nvidia, “INT4 Precision for AI Inference,” 2019. [Online]. Available: [28] Intel, “Intel Arria 10 Core Fabric and General Purpose
https://developer.nvidia.com/blog/int4-for-ai-inference/ I/Os Handbook,” 2022. [Online]. Available: https://www.intel.com/
[8] J. Fowers, K. Ovtcharov, M. Papamichael, T. Massengill, M. Liu, programmable/technical-pdfs/683461.pdf
D. Lo, S. Alkalay, M. Haselman, L. Adams, M. Ghandi, S. Heil, [29] J. Choi, K. Nam, A. Canis, J. H. Anderson, S. D. Brown, and
P. Patel, A. Sapek, G. Weisz, L. Woods, S. Lanka, S. K. Reinhardt, T. S. Czajkowski, “Impact of Cache Architecture and Interface on
A. M. Caulfield, E. S. Chung, and D. Burger, “A Configurable Cloud- Performance and Area of FPGA-Based Processor/Parallel-Accelerator
Scale DNN Processor for Real-Time AI,” ACM/IEEE 45th Annual Systems,” IEEE 20th International Symposium on Field-Programmable
International Symposium on Computer Architecture (ISCA), pp. 1–14, Custom Computing Machines (FCCM), pp. 17–24, 2012.
2018. [30] R. Shi, Y. Ding, X. Wei, H. Li, H. Liu, H. K.-H. So, and C. Ding, “FTDL:
[9] U. Aydonat, S. O’Connell, D. Capalija, A. C. Ling, and G. R. Chiu, A Tailored FPGA-Overlay for Deep Learning with High Scalability,”
“An OpenCL™ Deep Learning Accelerator on Arria 10,” ACM/SIGDA 57th ACM/IEEE Design Automation Conference (DAC), pp. 1–6, 2020.
International Symposium on Field-Programmable Gate Arrays (FPGA), [31] S. Yazdanshenas, K. Tatsumura, and V. Betz, “Don’t Forget the Mem-
2017. ory: Automatic Block RAM Modelling, Optimization, and Architec-
[10] M. S. Abdelfattah, D. Han, A. Bitar, R. Dicecco, S. O’Connell, ture Exploration,” ACM/SIGDA International Symposium on Field-
N. Shanker, J. Chu, I. Prins, J. Fender, A. C. Ling, and G. R. Chiu, Programmable Gate Arrays (FPGA), pp. 115–124, 2017.
“DLA: Compiler and FPGA Overlay for Neural Network Inference [32] Arizona State University, “Predictive Technology Model,” 2012.
Acceleration,” 28th International Conference on Field Programmable [Online]. Available: http://ptm.asu.edu/
Logic and Applications (FPL), pp. 411–4117, 2018. [33] Intel, “Arria 10 Device Overview,” 2022. [Online]. Available:
https://www.intel.com/programmable/technical-pdfs/683332.pdf
[11] Intel, “Intel Stratix 10 NX FPGA Overview,” 2020. [Online].
[34] R. Rashid, J. G. Steffan, and V. Betz, “Comparing performance, produc-
Available: https://www.intel.com/content/www/us/en/products/details/
tivity and scalability of the TILT overlay processor to OpenCL HLS,”
fpga/stratix/10/nx.html
International Conference on Field-Programmable Technology (FPT), pp.
[12] Xilinx, “UltraScale Architecture DSP Slice User Guide, (UG579 20–27, 2014.
v1.11),” 2021. [Online]. Available: https://docs.xilinx.com/v/u/en-US/ [35] University of California, Berkeley, “ECE241, Lecture 18 Adders,” 2003.
ug579-ultrascale-dsp [Online]. Available: http://bwrcs.eecs.berkeley.edu/Classes/icdesign/
[13] Intel, “Intel Stratix 10 Variable Precision DSP Blocks User Guide ee241 s03/Lectures/lecture18-adders.pdf
(UG-S10-DSP),” 2021. [Online]. Available: https://www.intel.com/ [36] J. Sommer, M. A. Özkan, O. Keszocze, and J. Teich, “DSP-
programmable/technical-pdfs/683832.pdf Packing: Squeezing Low-precision Arithmetic into FPGA DSP Blocks,”
[14] Achronix, “Speedcore eFPGAs).” [Online]. Available: arxiv.org/abs/2203.11028, 2022.
https://www.achronix.com/sites/default/files/docs/Speedcore eFPGA [37] A. Vaswani, N. M. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N.
Product Brief PB028.pdf Gomez, L. Kaiser, and I. Polosukhin, “Attention is All you Need,”
[15] A. Boutros, S. Yazdanshenas, and V. Betz, “Embracing Diversity: arxiv:abs/1706.03762, 2017.
Enhanced DSP Blocks for Low-Precision Deep Learning on FPGAs,”
28th International Conference on Field Programmable Logic and Ap-
plications (FPL), pp. 35–42, 2018.
[16] S. Rasoulinezhad, H. Zhou, L. Wang, and P. H. W. Leong, “PIR-
DSP: An FPGA DSP Block Architecture for Multi-precision Deep
Neural Networks,” IEEE 27th Annual International Symposium on Field-
Programmable Custom Computing Machines (FCCM), pp. 35–44, 2019.
[17] X. Wang, V. Goyal, J. Yu, V. Bertacco, A. Boutros, E. Nurvitadhi,
C. Augustine, R. R. Iyer, and R. Das, “Compute-Capable Block RAMs
for Efficient Deep Learning Acceleration on FPGAs,” IEEE 29th Annual
International Symposium on Field-Programmable Custom Computing
Machines (FCCM), pp. 88–96, 2021.
[18] A. Arora, T. Anand, A. Borda, R. Sehgal, B. Hanindhito, J. Kulkarni, and
L. K. John, “CoMeFa: Compute-in-Memory Blocks for FPGAs,” IEEE
30th Annual International Symposium on Field-Programmable Custom
Computing Machines (FCCM), pp. 1–9, 2022.
[19] A. Boutros and V. Betz, “FPGA Architecture: Principles and Progres-
sion,” IEEE Circuits and Systems Magazine, vol. 21, pp. 4–29, 2021.