Thanks to visit codestin.com
Credit goes to www.scribd.com

0% found this document useful (0 votes)
23 views6 pages

2017 - Binary Convolutional Neural Network On RRAM

Uploaded by

陈德爱
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
23 views6 pages

2017 - Binary Convolutional Neural Network On RRAM

Uploaded by

陈德爱
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 6

9C-2

Binary Convolutional Neural Network on RRAM


Tianqi Tang, Lixue Xia, Boxun Li, Yu Wang, Huazhong Yang
Dept. of E.E., Tsinghua National Laboratory for Information Science and Technology (TNList),
Tsinghua University, Beijing, China
e-mail: [email protected]

Abstract—Recent progress in the machine learning field makes low column is not large enough to hold all the weight parameters of one
bit-level Convolutional Neural Networks (CNNs), even CNNs with binary Convolution (Conv) kernel in large BCNNs like VGG [1]. Therefore,
weights and binary neurons, achieve satisfying recognition accuracy on
the operation of matrix splitting is inevitable, and the high-cost
ImageNet dataset. Binary CNNs (BCNNs) make it possible for introducing
low bit-level RRAM devices and low bit-level ADC/DAC interfaces in interfaces are still required for the intermediate data in splitting.
RRAM-based Computing System (RCS) design, which leads to faster Second, the size of intermediate data between layers increases rapidly
read-and-write operations and better energy efficiency than before. with the network scale and introduces large overhead.
However, some design challenges still exist: (1) how to make matrix In this paper, an RRAM crossbar-based BCNN accelerator is
splitting when one crossbar is not large enough to hold all parameters
of one layer; (2) how to design the pipeline to accelerate the whole CNN proposed. The contributions of this paper include:
forward process. • In our BCNN accelerator design, the matrix splitting problem
In this paper, an RRAM crossbar-based accelerator is proposed for is well discussed when mapping weight parameters to RRAM.
BCNN forward process. Moreover, the special design for BCNN is
well discussed, especially the matrix splitting problem and the pipeline Thanks to the line buffer introduced for intermediate data
implementation. In our experiment, BCNNs on RRAM show much buffering, a pipeline strategy is proposed for system efficiency.
smaller accuracy loss than multi-bit CNNs for LeNet on MNIST when • The robustness under device variation of BCNN on RRAM
considering device variation. For AlexNet on ImageNet, the RRAM-based is demonstrated. For LeNet on MNIST, binary CNN achieves
BCNN accelerator saves 58.2% energy consumption and 56.8% area
0.75% on 3bit RRAM devices in the case of device variation.
compared with multi-bit CNN structure.
• Experimental results show that BCNN saves 58.2% of energy
and 56.8% of area consumption are saved when using BCNN
I. I NTRODUCTION for AlexNet on ImageNet.
Convolutional Neural Networks (CNNs) have achieved great per- The rest of this paper is organized as follows: Section II introduces
formance in various recognition tasks, including image classification the related background and the motivation of our work; Section III
[1], video tracking [2] and natural language processing [3]. At the proposes the RRAM-based BCNN accelerator design, especially the
same time, larger computational intensity and higher bandwidth are pipeline design; Section IV uses the case studies of LeNet on MNIST,
required than traditional non-CNN models [4]. The emerging RRAM- and AlexNet on ImageNet to analyze recognition accuracy, area and
based Computing System (RCS) has been considered as a promising energy efficiency; and Section V shows the conclusion.
solution for future CNN accelerators [5]–[8], where the RRAM-based
crossbar can not only store the weight parameters of CNN models
but also be used as the matrix-vector multiplier. In this way, the II. P RELIMINARIES AND M OTIVATION
energy cost for data transfer is reduced and less bandwidth is required. A. CNN
Moreover, thanks to the crossbar-level parallelism, it also reduces the A typical CNN consists of a number of different kinds of layers that
running time complexity from O(n2 ) to O(1). run sequentially, i.e. the output of the previous layer is the input of
However, the current resistance precision of RRAM device is the next layer. The input/output of one layer is named “feature map”
limited [9], and the impact of writing variation and reliability problem while the parameters of one layer are called “weights”. In a standard
increases with the bit-level amounts of RRAM device [10]. On CNN structure, cascaded Convolutional (Conv) layers (optionally fol-
the other hand, the interfaces between the analog RRAM-based lowed by Neuron layers, Max Pooling layers, Normalization layers)
crossbars and the digital peripheral units take up most of the area and are followed by one or more Fully-Connected Layers [1].
power consumption, which makes the whole RRAM-based computing Conv Layer can be expressed as in Eq. 1:
system not so efficient as expected [5]. Therefore, the high precision
of data and weights in state-of-the-art CNNs becomes the main  w−1
h−1 in −1
 C
challenge for RRAM-based implementation. fout (x, y, z) = fin (x + i, y + j, k) · cz (i, j, k) (1)
i=0 j=0 k=0
Recently, researchers in the field of machine learning have demon-
strated that Binary CNNs (BCNNs) achieve satisfying recognition where i, j, and k are the spatial coordinate of three-dimensional (3-
accuracy on ImageNet dataset [11], [12]. BCNNs use binary weights D) matrix input feature map Fin with the size of Hin × Win × Cin ;
and data when processing the forward propagation. It provides a x, y, and z are the coordinate of output feature map Fout with the
promising solution to break the high precision limits in current size of Hout × Wout × Cout ; Cz is the z th Conv kernel with the
RRAM-based CNN accelerator design. Faster read-and-write oper- size of h × w × Cin ; and there are Cout kernels in one Conv Layer.
ations and better energy efficiency can be achieved by exploiting the In this way, all the Conv kernel parameters in one layer form a 4-D
binary characteristics of BCNN. blob with the size of (h, w, Cin , Cout ). Sliding stride s is used for
Some challenges still exist in the RRAM-based BCNN accelerator jumping some pixels and reduce the computation amount, while zero
design when the network scale increases. First, the length of crossbar padding is introduced when the convolution is processed at the edge
of feature maps and the pixels are not enough for one whole Conv
This work was supported by 973 Project 2013CB329000, National Natural
Science Foundation of China (No. 61622403, 61373026, 61261160501), Brain
kernel.
Inspired Computing Research, Tsinghua University. And we gratefully thank Neuron Layer is attached after the Conv Layer which makes a
Dr. Xudong Fei from Huawei Co. for the discussion. nonlinear one-by-one mapping (y = f(x)). Binary Neurons are used

978-1-5090-1558-0/17/$31.00 ©2017 IEEE 782


9C-2

.HUQHO .HUQHO .HUQHO0 1) Limited Bit Levels of RRAM Devices: To the best of our
'$&
knowledge, only 7-bit weights [9] are currently available for the
6OLGH 'LJLWDO
,QSXW ,QSXW
single RRAM device. However, state-of-the-art fixed-point CNNs
require 8 or 16 bit precision weights [4], [14]. As a result, multiple
'$&
RRAM devices have to be used for representing one number [7],
.HUQHO0
[8] and large energy overhead are introduced. Moreover, the multi-
$'& $'& bit devices suffer from more variation and reliability problems than
D E
single-bit devices [10], which decreases the recognition accuracy of
Fig. 1: Structure of the RRAM-based Crossbar. computing accelerator. Therefore, the precision of RRAM resistance
levels limits both the energy efficiency and the accuracy of RRAM-
in our BCNN system as proposed in BinaryNet [11]. The forward based computational system.
function can be expressed in Eq. 2: 2) Limited Bit Levels of Crossbar Interfaces: Since the crossbars

1, x>0 work in the analog mode, interfaces are needed for the transformation
y= (2) between the digital signals in the nearby computing units and the
−1, x ≤ 0
analog signals in the crossbar-based MVM. There are two kinds
Max Pooling Layer is cascaded after the non-linear neurons. It of interfaces in RRAM-based computational system. On the one
picks the largest element among the neighboring area of input feature hand, the interfaces between RRAM crossbar and CPU, i.e. the input
map in order to reduce the data amount and keep the local invariance. interface in the first layer and the output interface in the final layer, are
Fully-Connected (FC) Layer can be expressed as in Eq. 3: required. On the other hand, the interfaces between RRAM crossbars
Cin −1
 in different layers are required in CNNs. This is because CNNs are
fout (y) = fin (x) · c(x, y) (3) not full-connected networks. Therefore, each RRAM crossbar need
x=0 to process multiple cycles with different inputs, and the temporary
where x is the index of the 1-D input feature map vector Fin with results of each cycle need to be buffered until all the neighboring
the length of Cin ; y is the index of output feature map vector Fout results are obtained. The detailed function will be illustrated in
with the length of Cout ; and the 2-D weight matrix C is in the size Section III. An intuitive choice is to use DAC/ADCs as the interfaces,
of Cin × Cout . but huge overheads are introduced by high-precision ADC/DACs. Li
Batch Normalization (BN) Layer [13], which solves the problem [5] pointed out that 8-bit ADC/DACs contribute to more than 85%
of internal covariate shift, has been introduced in most state-of- of the area and power consumption of the whole RCS.
art network models, especially the binary network models, e.g. Therefore, it will contribute a lot to energy efficiency if achieving
BinaryNet [11], XNOR-Net [12]. The operation of BN Layer can a well-trained network model with low bit-level weight parameters
be abstracted into a linear one-to-one mapping. The parameters are and feature maps, especially the binary ones.
well-trained in the training process.
D. Challenges of RRAM-based BCNN
B. RRAM Device, Crossbar Array and Crossbar Interface Some recent papers have already shown that completely binary
An RRAM device is a passive two-port element with multiple re- CNNs (BCNNs) are achievable if the 1-bit quantization is processed
sistance states, and multiple devices can be used to build the crossbar in training. Courbariaux [11] proposes a sampling method which
structure. When the “matrix” is represented by the conductivity of the trains the binary network together with the floating-point network;
RRAM devices and the “vector” is represented by the input voltage while Rastegari [12] proposes the BinaryWeight by minimizing the
signals, the RRAM crossbar is able to perform as the analog matrix- binary quantization loss while training. Moreover, the weights and
vector multiplier (MVM). Specifically, the relationship between the the feature maps are also binarized.
input and output signals can be expressed as in Eq. 4 [6]: Based on these results, in this paper, we propose an RRAM
crossbar-based BCNN accelerator, achieving higher energy effi-

N −1
iout (k) = g(k, j) · vin (j) (4) ciency compared with multi-bit CNNs. However, when the network
j=0 scale increases, two main challenges limit the energy efficiency of
the accelerator.
where Vin is the vector of input voltage (denoted by j = 0, 1, ..., N −
1) Splitting Interface Overhead: Splitting is required when the
1), Iout is the the vector output current (denoted by k = 0, 1, ..., M −
size of the Conv kernels is larger than the length of crossbar column.
1), and G is the conductivity matrix of the RRAM device. Taking
State-of-the-art RRAM crossbars only achieve the column length of
advantage of the natural “multiplication and merging” function of
512 [6]. Crossbar of such size is not able to hold some large Conv
the crossbar structure, the RRAM crossbar can implement the Conv
kernels, e.g. Conv kernel with the size 4608 (= 3×3×512) in VGG16
kernels and FC Layers in analog mode with high speed, small area,
model. Therefore, the high-cost interfaces are still required because
and low power [6].
the intermediate data in splitting need high precision. PRIME [7]
For FC Layers, the weight matrices are directly mapped to the
and ISAAC [8] discussed matrix splitting method for full-precision
RRAM crossbars [5]. While for Conv Layers, one Conv kernel is
CNNs by using high-precision ADC/DACs, so the energy efficiency
mapped to one RRAM column, and different columns in one crossbar
is still limited. Considering that BCNN provides the potential for low-
correspond to different Conv kernels [6], as shown in Fig. 1.
precision crossbar interfaces, a BCNN-specific low-precision splitting
structure is in demand.
C. Motivation 2) Buffer Overhead: Since RRAM crossbar uses multiple inputs
Compared with the well-trained network which uses floating-point in the same cycle, the processed data between layers can only
weight parameters and feature maps on CPU/GPU platforms, the be buffered by registers instead of RAMs. Therefore, thousands
RRAM devices and the crossbar interfaces can only support limited of registers and corresponding multiplexers are required for large
bit levels. networks. ISAAC [8] gives a rough design for pipelining the Conv

783
9C-2

͙
ሺ௜ሻ ŝŐŝƚĂůWƌŽĐĞƐƐŝŶŐhŶŝƚ;EĞƵƌŽŶ͕ĂƚĐŚEŽƌŵͿ
/ŶƉƵƚ/ŵĂŐĞ ௜௡ ŽŶǀ >ŝŶĞ ƵĨĨĞƌƐ
>ĞŶŐƚŚൌ ሺ݄ሺ௜ሻ െͳሻ ‫ ܹ ڄ‬௜ ൅ ‫ ݓ‬ሺ௜ሻ
݄ ‫ܥ ڄ ݓ ڄ‬௜௡ ݄ ‫ܥ ڄ ݓ ڄ‬௜௡ ݄ ‫ܥ ڄ ݓ ڄ‬௜௡
Ͳܰʹܰ ǥ Ͳܰʹܰ ǥ Ͳܰʹܰ ǥ

ŽŶǀ >ĂLJĞƌϭ
۱ሺ଴ǡ଴ሻ ͙ ۱ሺଵǡ଴ሻ ͙ ‫ܥ‬ሺ௑೚ೠ೟ ିଵǡ଴ሻ ͙
ŽŶǀŽůǀĞƌ ŝƌĐƵŝƚ
ĨŽƌƚŚĞ݅௧௛ ŽŶǀ >ĂLJĞƌ ۱ሺ଴ǡଵሻ ۱ሺଵǡଵሻ ۱ሺ௑೚ೠ೟ିଵǡଵሻ

WŽŽůŝŶŐ>ĂLJĞƌϭ ۱ሺ଴ǡ௑೔೙ ିଶሻ ۱ሺଵǡ௑೔೙ ିଶሻ ͙ ۱ሺ௑೚ೠ೟ିଵǡ௑೔೙ ିଶሻ


;ďͿ ۱ሺ଴ǡ௑೔೙ ିଵሻ ۱ሺଵǡ௑೔೙ ିଵሻ ۱ሺ௑೚ೠ೟ିଵǡ௑೔೙ ିଵሻ
͙
;н͕ͲͿ
WŽŽůŝŶŐ>ŝŶĞƵĨĨĞƌ
ŽŶǀ >ĂLJĞƌE ௣௢௢௟ ௜ ሺ௣௢௢௟ሻ ͙ ͙ ͙
>ĞŶŐƚŚൌ ݄ െͳ ‫ܹڄ‬ ൅‫ݓ‬ Ϳ
WĂƌƚŝĂů^Ƶŵ WĂƌƚŝĂů^Ƶŵ WĂƌƚŝĂů^Ƶŵ
WŽŽůŝŶŐ>ĂLJĞƌE WŽŽůŝŶŐŝƌĐƵŝƚ
ĨŽƌƚŚĞ݅௧௛ WŽŽůŝŶŐ>ĂLJĞƌ ŝŐŝƚĂůWƌŽĐĞƐƐŝŶŐhŶŝƚ;EĞƵƌŽŶ͕ĂƚĐŚEŽƌŵͿ
;ĐͿ ;ĞͿ
&>ĂLJĞƌ;EнϭͿ
&ĞĂƚƵƌĞDĂƉƵĨĨĞƌ ሺ௜ሻ
௜௡ ůŝŶĞďƵĨĨĞƌƐ
ሺேା௝ሻ
;>ĞŶŐƚŚൌ ݄ሺேା௝ሻ ‫ ݓ ڄ‬ሺேା௝ሻ ‫ܥ ڄ‬௜௡ Ϳ
͙ ሺ௜ሻ
ሺ௜ሻ
௜௡ /E ݄ ‫ ڄ ݓ ڄ‬௜௡

Dhy
͙͙
&>ĂLJĞƌ;EнDͿ Ϭ
ŽŶǀŽůǀĞƌ ŝƌĐƵŝƚ &ĞĂƚƵƌĞDĂƉ
͙͙
&ĞĂƚƵƌĞDĂƉ

݄
ĨŽƌƚŚĞሺܰ ൅ ݆ሻ௧௛ &>ĂLJĞƌ &ƌŽŵƚŚĞWƌĞǀŝŽƵƐ KEsͺE dZ> dŽƚŚĞEĞdžƚ
Khd
ZĞĐŽŐŶŝƚŝŽŶZĞƐƵůƚ ŽŶǀŽǀůĞƌ ŝƌĐƵŝƚ ‫ݓ‬ ŽŶǀŽǀůĞƌ ŝƌĐƵŝƚ
ܹ
;ĂͿ ;ĂͿ ;ĚͿ ;ĨͿ

Fig. 2: (a) Overall Structure of the RRAM-based BCNN Accelerator: “N Conv Layers + M FC Layers” with the Input Image and the
Output Recognition Result, and each Conv Layer Optionally Followed by the Pooling Layer; (b)-(d) The Dataflow of the “Conv Layer”,
“Pooling Layer”, and “FC Layer”; (e) The Convolver Circuit for one Conv/FC Layer on RRAM-based Platform; (f) The Conv Line Buffers.

operation of different layers where weight duplication is introduced by repeatedly R&W operation is not available. Since the high area
for balance the load of the pipeline, but a throughout discussion on density is an important advantage of RRAM, all the Conv kernels
pipeline implementation is still in lack. Since only a few registers are can be mapped onto the crossbars in the Convolver Circuit of the
used in each cycle, there exists the parallel potential between layers corresponding layer. In this way, each output channel is able to get
to reduce the buffer size while boosting the processing speed. As a one output element in one processing cycle if enough data have been
result, a pipeline design between layers is necessary for both CNN fed into this layer’s Line Buffers by the former Conv Layer. However,
and BCNN accelerators. matrix splitting is necessary for large Conv kernels, as discussed in
Section II-D, which is the same for large FC matrices.
III. RRAM- BASED BCNN ACCELERATOR D ESIGN 1) Column Splitting: If the crossbar column count (M ) is smaller
As shown in Fig. 2 (a), the whole accelerator is made up of a series than the Conv kernel count (Cout , the same with the output channel
of Conv Layers cascaded by a series of FC Layers. The Pooling count) of this layer, then Cout Conv kernels are split into X(Conv)
out
Layer module optionally follows the Conv Layer. The data paths for groups of RRAM crossbars, as shown in Eq. 5. Copies of the input
the Conv, Pooling, and FC Layer are respectively shown in Fig. 2 (b)- feature maps with the size of one Conv kernel (h · w · Cin ) are sent
(d). Each layer consists of its own Input Buffer, and the Computing to each groups of crossbars.
Circuit. h · w · Cin (Conv)
(Conv) Cout
• Computing Circuit: For Conv and FC Layers, the Convolver
, Xout
Xin = =  (5)
N M
Circuit is made up of the RRAM crossbar-based MVMs, as 2) Row Splitting: If the cross-point count (N ) in one RRAM
shown in Fig. 2 (e). Some digital peripheral units, including column is smaller than the Conv kernel size (h · w · Cin ), then the
circuits for neurons and batch normalization, are also placed in elements of one Conv kernel are split into X(Conv) groups of RRAM
in
front of or at back of the crossbar groups. For Pooling Layers, crossbars, also as shown in Eq. 5. Moreover, the input feature map
the computing circuit can be easily implemented as the multi- is also split into X(Conv) groups and the partial sum is achieved from
in
input “OR” gate in the BCNN design. each group of crossbars. An adder tree needs to be cascaded after
• Input Buffer: For Conv and Pooling Layers, the operation of
the crossbar groups in order to merge the X(Conv)
in partial sums.
sliding window exists. In this way, the structure of Line Buffer The intermediate data before adder tree still use high precision.
(LB) is introduced for intermediate data buffering and fetching, However, since the cascaded digital functions, i.e. non-linear function
as shown in Fig. 2 (f). For FC Layers, the regular buffers are and BN, are monotone increasing functions, the 1-bit quantization
used since nearby layers are fully connected. can be merged with these functions by changing the threshold and
In this section, we first discuss the design of the Convovler circuit output data range. Therefore, the result after addition can also be
and matrix splitting in III-A; then discuss the design of intermediate only 1 bit, which provide the potential for using lower-precision
data buffering and the implementation of pipeline in III-B. intermediate data for addition. Based on this observation, we reduce
the ADC precision into 4 bit, which can save large amount of
A. Convolver Circuit: The Problem of Matrix Splitting overhead especially when the splitting amount is large.
Considering both the large energy cost of writing operation and 3) Signal Splitting: The resistance of RRAM device is positive, i.e.
the endurance limit of RRAM device [15], reusing RRAM crossbar it is unable to represent negative values. In this way, it is necessary to

784
9C-2

ܶ଴ ܶଵ ܶଶ ͙ ܶௐ ܶௐାଵ ܶௐାଶ ܶௐାଷ


map one weight matrix onto a crossbar pair: one crossbar for positive
‫ݒ݊݋ܥ‬௄ > ሺ௞ሻ ሺ௞ሻ ሺ௞ሻ ሺ௞ሻ ሺ௞ሻ ሺ௞ሻ
weights (+1), the other for negative ones (−1).
/ŶƉƵƚ
‫ݔ‬ଶǡଵ  ‫ݔ‬ଶǡଶ  ‫ݔ‬ଶǡଷ  ͙ ‫ݔ‬ଶǡௐ  Ϭ ‫ݔ‬ଷǡଵ  ‫ݔ‬ଷǡଶ  ͙
‫ݒ݊݋ܥ‬௄ 
^   ͙   ^  ͙
B. Line Buffer & Pipeline Implementation yďĂƌ

‫ݒ݊݋ܥ‬௄ାଵ > ሺ௞ାଵሻ ሺ௞ାଵሻ ሺ௞ାଵሻ ሺ௞ାଵሻ ሺ௞ାଵሻ


The sliding window exists in the Conv Layers. Data dependency /ŶƉƵƚ
Ϭ ‫ݔ‬ଵǡଵ ‫ݔ‬ଵǡଶ ͙ ‫ݔ‬ଵǡௐିଵ ‫ݔ‬ଵǡௐ Ϭ ‫ݔ‬ଶǡଵ ͙
analysis shows that the convolver circuit can awake (A) from sleep ‫ݒ݊݋ܥ‬௄ାଵ 
(S) once the input data of the Conv kernel size is achieved. In this yďĂƌ
^ ^  ͙    ^ ͙
way, the structure of the Line Buffer is introduced for the following
reasons: First, much fewer registers are used for data buffering since Fig. 3: The Dataflow of Conv-Conv. The first line shows the Line
it is unnecessary to buffer the whole input feature maps; second, with Buffer’s input data of previous Conv Layer in each cycle, and the
Line Buffer introduced in every Conv/Pooling Layer, a pipeline can third line shows the input data of next Conv Layer. The second and
be implemented, which makes the forward process much faster than forth line show whether the Convolvers are Awake (A) or Sleep (S).
computing the Conv Layers in the one-by-one mode. And it is the
same for the Pooling Layers. ܶ଴  ܶଵ ܶଶ ͙ ܶௐ ܶௐାଵ ܶௐାଶ ܶௐାଷ ͙ ܶଶௐାଵ 
‫ݒ݊݋ܥ‬௄ > ሺ௞ሻ ሺ௞ሻ ሺ௞ሻ ሺ௞ሻ ሺ௞ሻ ሺ௞ሻ ሺ௞ሻ ሺ௞ሻ
As the Pooling Layer is optionally followed by the Conv Layer, /ŶƉƵƚ
Ϭ ‫ݔ‬ଷǡଵ  ‫ݔ‬ଷǡଶ  ‫ݔ‬ଷǡଷ  ‫ݔ‬ଷǡସ  ͙ ‫ݔ‬ଷǡௐ  Ϭ ‫ݔ‬ସǡଵ  ‫ݔ‬ସǡଶ  ͙ ‫ݔ‬ସǡௐ 
there exist “Conv-Conv” and “Conv-Pooling-Conv” two modes for ‫ݒ݊݋ܥ‬௄ 
nearby layer relationship. Here, we use the dataflow behavior our yďĂƌ
 ^    ͙   ^  ͙ 
experiment of CIFAR-10 on VGG11 as case study to show the line- ܲ‫݈݋݋‬௄ >
‫ݕ‬ଵǡௐ 
ሺ௞ሻ ሺ௞ሻ ሺ௞ሻ ሺ௞ሻ ሺ௞ሻ ሺ௞ሻ ሺ௞ሻ ሺ௞ሻ
/ŶƉƵƚ
y ‫ݕ‬ଶǡଵ ‫ݕ‬ଶǡଶ ‫ݕ‬ଶǡଷ ͙ ‫ݕ‬ଶǡௐିଵ ‫ݕ‬ଶǡௐ  y ‫ݕ‬ଷǡଵ ͙ ‫ݕ‬ଷǡௐିଵ
buffer-based pipeline implementation.
ܲ‫݈݋݋‬௄  ^ůĞĞƉƵŶƚŝůŶĞǁĚĂƚĂ
1) Conv-Conv: For the Conv Layers in VGG11, the “kernel size” ŝƌĐƵŝƚ
 ^ ^  ^ ͙ ^  ^ ĐŽŵŝŶŐ
is set as 3 × 3; the “stride” is set as 1, and zero padding is introduced ‫ݒ݊݋ܥ‬௄ାଵ > ሺ௞ାଵሻ
Ϭ Ϭ Ϭ ሺ௞ାଵሻ
‫ݔ‬ଵǡଵ Ϭ ͙ Ϭ ‫ݔ‬ ௐ Ϭ Ϭ
ଵǡ
in order to keep the input and output feature map as the same size. /ŶƉƵƚ ଶ

When the feature map feeds in following the row-major order, the Fig. 4: The Dataflow of Conv-Pooling. The first line shows the Line
Line Buffer of each channel only needs (h − 1) · (W + p) + w Buffer’s input data of previous Conv Layer in each cycle; the third
registers. In the initial periods of a layer, zero padding in the length line shows the input data of next Pooling Layer; the second and
(k)
of (W + p) and the first row of x1,: are sent sequentially into the forth line show whether the Convolvers are Awake (A) or Sleep (S).
th
k Conv Layer’s Line Buffer before T0 . And in these cycles, the
Convolver Circuit of the kth is in the sleep (S) mode. Finally at the
(k)
T0 cycle, x2,1 is sent into the Line Buffer, as the dataflow shown in padding is not introduced in Pooling Layer, as shown in Cycle T1
Fig. 3. In the next cycle, the input Line Buffer shown in Fig. 2(f) and Cycle TW +2 . While for the Pooling-Conv Line Buffer of the next
is fulfilled by data, and therefore k th Conv Layer starts at time T1 . layer, it is just the turn of zero padding in this cycle like Conv-Conv
Additionally, at the end of the layer’s computation, (W + p) cycles pipeline.
are needed for computing the last row just like the initial cycles. Finally, in the pipeline implementation, the total cycle amount for
A main challenge for the Conv-Conv pipeline design is the sleep one complete forward process is shown as in Eq. 6.
control for the “line feed” problem. When the computation of a row

i>1  
is accomplished, the input data need to be changed from the end Tpip = (W (1) + p) · (H (1) + 2p) + (W (i) + p) + 1+ 1
of current line to the front of the next line, which means at least i∈Conv i∈Pool j∈FC
(w − 1) data (usually we have w > 3) in the next layer need to (6)
be prepared. However, for the line-buffer-based pipeline design, the (W (1) + p) · (H (1) + 2p) is the computation cycle amount for the
input field shown in Fig. 2(f) is invalid during the preparing cycles, first Conv Layer. After that, once the cascaded layer is a Conv Layer,
(k) (k) (k) (k) (k) (k)
e.g. (xi−3,1 , 0, xi−2,W ; xi−2,1 , 0, xi−1,W ; xi−1,1 , 0, xi,W ). In these (W + p) cycles are needed for computing the last row. Otherwise,
th
cycles, the Convolvor of the k layer is also in the S mode; while only one extra cycle is needed for computing the last pixel of next
for the Line Buffer of the (k + 1)th layer, there is no valid input. Pooling Layer, or to perform a FC Layer. The pipelined cycle amount
Fortunately, we find that the zero padding of next Conv Layer can is much fewer than the straight forward layer-by-layer design whose
just exploit this cycle. And in the next cycle, i.e. Cycle Tm(W +1)+2 cycle amount is:
(m = 0, 1, ...), the Convolver of the kth layer recovers to awake (A);   
(W (i) + p) · (H (i) + 2p) + (W (i+1) H (i+1) ) + 1 (7)
while the Convolver of the (k + 1)th layer begins to sleep (S) for
i∈Conv i∈Pool j∈FC
line feed. In this way, the “line feed” problem is solved by utilizing
the extra sleep cycle in each layer for zero padding, and the works in
IV. E XPERIMENTAL R ESULTS
fully pipelined parallelism without waiting. Based on this structure,
we achieve the theoretical fewest cycle amount for Conv-Conv A. Experiment Setup
pipeline connections. In this section, the models of LeNet and AlexNet are respectively
2) Conv-Pooling-Conv: For the Pooling Layers in VGG11, the used on the dataset of MNIST and ImageNet. The multi-bit model is
“kernel size” is set as 2 × 2; the “stride” is set as 2, and zero padding achieved by dynamically quantizing [4] the well-trained floating-point
does not exist. As stride is larger than 1, the pooling circuit will work model into 8 bits; while the BCNN model is achieved by following
for one row when every s rows are ready. In Conv-Pooling pipeline, the training algorithm of BinaryNet [11]. Single crossbar size is set
just as shown in Fig. 4, the kth Pool Circuit sleep from Cycle TW +3 as (M, N ) = (128, 128). If one crossbar pair is not large enough
to T2W +1 . For the awaken row, the pooling circuit will work once to store all parameters of one layer, parameter splitting is done as
every s data are sent into the pooling Line Buffer. As shown in Fig. 4, shown in III-A. For the multi-bit CNN RRAM-based accelerator, 8-
the kth Pool Circuit awakens one cycle and sleeps one cycle from bit RRAM devices and 8-bit interfaces are introduced. While the
Cycle T3 to TW +1 . Although the problem of “line feed” also exists, BCNN system is implemented as proposed in Sec. III: The same bit-
the sleep cycles can be hidden into with the “sleep row”, and zero level RRAM devices are used as in multi-bit CNN system; and the

785
9C-2

TABLE I: Error Rate of LeNet on MNIST: Device Variation Effects TABLE III: Area and Power Cost of Circuit Elements
Under Different Weight Bit-Levels Area Power(mW)
RRAM Used in RRAM Used in 1T1R RRAM device (1 + W L
) · 3F 2 0.052b
Weight RRAM
Full Bit-level Mode Binary Mode 0T1R RRAM device 4F 2 0.06b
No With No With 8bit DAC 3096T a [16] 30 [17]
Bit Level Bit Levela Sense Amplifier 244T [16] 0.25 [18]
Variation Variation Variation Variation
8 bit 7 bit 0.58% 0.58% 0.74% 8bit ADC 2550T +1kΩ(≈450T ) [16] 35 [19]
6 bit 5 bit 0.60% 0.59% 0.75% 4bit ADC 72T [20] 12 [20]
0.73% 8bit SUB 256T 2.5×10−6(c)
4 bit 3 bit 0.80% 1.21% 0.75%
2 bit 1 bit 90.67% 89.10% 0.86% 1bit ADC 244T 1.73 [21]
a The bit-level of RRAM devices is 1bit less than that of the weight 32bit SRAM Cache - 0.064c
parameters because of the signal splitting. a T = W/L · F 2 , where W/L =3, and the technology node F =45nm.
b The power consumption of RRAM cell is estimated by V 2 g
√ avg avg , where
TABLE II: Amount and Processing Count of gavg = gon goff [22]
c The energy consumption of digital arithmetic logics and memory access
Computing Units, Interfaces and Buffers refer to the energy table under 45nm CMOS technology node [23]. The
Module Layer Amount
Processing system clock is assumed to be 100MHz, which is determined by the speed
Count of ADC/DACs and the latency of RRAM crossbar [24].
RRAM cell Conv (h · w · Cin ) · Cout · Xout · Xout Hout · Wout
DAC Conv (h · w · Cin ) · Xout Hout · Wout
SA&ADC Conv Cout · Xin Hout · Wout TABLE IV: Energy and Area Estimation
Feature Map Buffer Conv h · w · Cin Hout · Wout of Different RRAM-based Crossbar PEs
Line Buffer Conv h · Win · Cin Hout · Wout Database Performance CNN BCNN Saving
Line Buffer Pooling h · Win · Cin Hout · Wout Energy(uJ/img) 18.39 13.55 26.3%
RRAM Cell FC Cin · Cout · Xout · Xout 1 MNIST
Area (mm2 ) 0.054 0.060 -11.1%
DAC FC Cin · Xout 1
Energy(uJ/img) 5444.85 2275.34 58.2%
SA&ADC FC Cout · Xin 1 ImageNet
Feature Map Buffer FC Cin 1 Area (mm2 ) 21.25 9.19 56.8%

interface is binary when matrix splitting is not necessary, 4 bits when C. Area and Energy Estimation Under Different Bit-Levels
necessary. Network models of LeNet on MNIST and AlexNet on ImageNet
In this section, we first explore the effect of variation under are demonstrated in the area and energy estimation. Moreover, we
different weight bit levels; then a comparison on system efficiency is also profile the area and energy distribution among different circuit
made between BCNNs on RRAM and multi-bit CNNs on RRAM. elements and among different layers on AlexNet. In our estimation,
the crossbar-based computing units and the buffers are considered;
while the consumption of interconnections are neglected. The amount
B. Accuracy: Effects of Device Variation Under Different Bit-Levels and the processing count of each module are listed as in Table. II.
Variation exists when mapping weight parameters to RRAM de- Because of the sliding window operation, modules in Conv layers
vices since it is one conductance range (not a specific conductance process Hout · Wout times in one forward process. The area and power
value) that represents one fixed-point number. When one RRAM de- consumption of each circuit elements are listed in Table. III.
vice is able to represent N bits, i.e. 2N conductance ranges represent The area and energy estimation is shown in Table. IV. The
2N fixed-point weights respectively. For the kth conductance range, experimental results show that BCNN on RRAN saves 58.2% of
g(k) represents the center conductance, and (g(k) − Δg, g(k) + Δg) energy and 56.8% of area consumption for AlexNet on ImageNet
represents the conductance range, i.e. the device variation δg ranges compared with multi-bit CNN. Whether for binary or multi-bit CNNs,
from (−Δg, Δg). According to previous physical measurement re- the output interface takes up the most part on energy and area
sults [25], we assume that the variation range Δg is the same for each consumption. The area and energy distribution is shown in Fig. 5. In
conductance range. When the RRAM device is used in the binary terms of area distribution among all layers, the FC layers take up the
mode, only two conductance ranges are picked from 2N ones. In this most part since the FC layers take up most of the the weight parameter
way, the expectation of (δg/g) can be smaller than in the case that of the whole CNN. While in terms of energy distribution, the Conv
2N ranges are all in use (we just name it as full bit-level mode), thus layers take up the most part. This is because the sliding window
introducing less computing error for matrix-vector multiplication. of each Conv layer has to sweep through the whole featue map in
LeNet on the MNIST dataset is demonstrated as case study to multiple process counts; but FC layers only process once. Comparing
show the effects of device variation under different weight bit-levels. the area and energy distribution between BCNN and multi-bit CNN,
Without considering device variation, a precise mapping is made from the overhead of input interface is mostly saved; meanwhile, the
quantized fixed-point weight parameters to RRAM conductances overhead of output interface is saved when the bit level of the partial
in the full bit-level mode. In this way, the increasing recognition sum decreases in the case of matrix splitting.
error rate mainly results from the quantization error. While in the
binary mode, the recognition performance keeps the same for RRAM V. C ONCLUSION
of different bit-levels when neglecting device variation, though the In this paper, an RRAM crossbar-based accelerator is proposed
recognition error is a bit higher than that of full bit-level mode in the for BCNN forward process. Moreover, the special design for BCNN
case of 7bit and 5bit RRAM, as listed in Table. I. is well discussed, especially the matrix splitting problem and the
When considering device variation, the recognition performance pipeline implementation. The robustness of BCNN on RRAM under
in the binary mode shows better robustness: In binary mode, device device variation are demonstrated. Experimental results show that
variation introduces less than 0.01% error rate increase in case of BCNN introduces negligible recognition accuracy loss for LeNet on
3bit (or larger bit-level) RRAM; while in full bit-level mode, the MNIST. For AlexNet on ImageNet, the RRAM-based BCNN accel-
recognition performance in 3bit RRAM becomes worse than that in erator saves 58.2% energy consumption and 56.8% area compared
binary mode due to larger effect of device variation. with multi-bit CNN structure.

786
9C-2

ƌĞĂŝƐƚƌŝďƵƚŝŽŶ ƌĞĂŝƐƚƌŝďƵƚŝŽŶ
ŽŶZZDͲďĂƐĞĚŵƵůƚŝͲďŝƚEEŽĨůĞdžEĞƚ ŽŶZZDͲďĂƐĞĚEEŽĨůĞdžEĞƚ
ϭϱ ϭϱ

ϭϮ ϭϮ

ϵ ϵ

ϲ ϲ

ϯ ϯ

Ϭ Ϭ
ĐŽŶǀϭ ĐŽŶǀϮ ĐŽŶǀϯ ĐŽŶǀϰ ĐŽŶǀϱ &ϲ &ϳ &ϴ ĐŽŶǀϭ ĐŽŶǀϮ ĐŽŶǀϯ ĐŽŶǀϰ ĐŽŶǀϱ &ϲ &ϳ &ϴ

ƌĞĂ;ZZDͿ ƌĞĂ;ŝŶƉƵƚŝŶƚĞƌĨĂĐĞͿ ƌĞĂ;ŽƵƚƉƵƚŝŶƚĞƌĨĂĐĞϴďŝƚͿ ƌĞĂ;ĞůƐĞͿ ƌĞĂ;ZZDͿ ƌĞĂ;ŝŶƉƵƚŝŶƚĞƌĨĂĐĞͿ ƌĞĂ;ŽƵƚƉƵƚŝŶƚĞƌĨĂĐĞϰďŝƚͿ ƌĞĂ;ĞůƐĞͿ

ŶĞƌŐLJŝƐƚƌŝďƵƚŝŽŶ ŶĞƌŐLJŝƐƚƌŝďƵƚŝŽŶ
ŽŶZZDͲďĂƐĞĚŵƵůƚŝͲďŝƚEEŽĨůĞdžEĞƚ ŽŶZZDͲďĂƐĞĚEEŽĨůĞdžEĞƚ
ϭϴϬϬ ϭϴϬϬ

ϭϮϬϬ ϭϮϬϬ

ϲϬϬ ϲϬϬ

Ϭ Ϭ
ĐŽŶǀϭ ĐŽŶǀϮ ĐŽŶǀϯ ĐŽŶǀϰ ĐŽŶǀϱ ĨĐϲ ĨĐϳ ĨĐϴ ĐŽŶǀϭ ĐŽŶǀϮ ĐŽŶǀϯ ĐŽŶǀϰ ĐŽŶǀϱ ĨĐϲ ĨĐϳ ĨĐϴ

;ϴďŝƚŵĞŵͿ ;ϭdϭZͿ ;ϴďŝƚͿ ;^нϴďŝƚͿ ;ĞůƐĞͿ ;ϭďŝƚŵĞŵͿ ;ϭdϭZͿ ;ŝŶƉƵƚŝŶƚĞƌĨĂĐĞͿ ;ŽƵƚƉƵƚŝŶƚĞƌĨĂĐĞͿ ;ĞůƐĞͿ

Fig. 5: Power and Area Distribution on AlexNet

R EFERENCES [14] T. Chen et al., “Diannao: A small-footprint high-throughput accelerator


for ubiquitous machine-learning,” in ACM Sigplan Notices, vol. 49, no. 4,
[1] K. Simonyan et al., “Very deep convolutional networks for large-scale 2014, pp. 269–284.
image recognition,” arXiv preprint arXiv:1409.1556, 2014. [15] Y. Y. Chen et al., “Understanding of the endurance failure in scaled
[2] J. Fan et al., “Human tracking using convolutional neural networks.” hfo 2-based 1t1r rram through vacancy mobility degradation,” in IEDM,
IEEE Transactions on Neural Networks, vol. 21, no. 10, pp. 1610–1623, 2012, pp. 20–3.
2010. [16] R. St. Amant et al., “General-purpose code acceleration with limited-
[3] A. Karpathy et al., “Deep visual-semantic alignments for generating precision analog computation,” ACM SIGARCH Computer Architecture
image descriptions,” in Computer Vision and Pattern Recognition, 2015. News, vol. 42, no. 3, pp. 505–516, 2014.
[4] J. Qiu et al., “Going deeper with embedded fpga platform for convolu- [17] J. Proesel et al., “An 8-bit 1.5 gs/s flash adc using post-manufacturing
tional neural network,” in FPGA, 2016, pp. 26–35. statistical selection,” in CICC, 2010, pp. 1–4.
[5] B. Li et al., “Merging the interface: Power, area and accuracy co- [18] S. Gupta et al., “Simulation and analysis of sense amplifier in submicron
optimization for rram crossbar-based mixed-signal computing system,” technology.”
in DAC, 2015, p. 13. [19] S. Y.-S. Chen et al., “A 10b 600ms/s multi-mode cmos dac for multiple
[6] L. Xia et al., “Selected by input: Energy efficient structure for rram- nyquist zone operation,” in 2011 Symposium on VLSI Circuits-Digest of
based convolutional neural network,” in DAC, 2016. Technical Papers, 2011.
[7] P. Chi et al., “Prime: A novel processing-in-memory architecture for [20] S. S. Chauhan, S. Manabala, S. Bose, and R. Chandel, “A new approach
neural network computation in reram-based main memory,” in ISCA, to design low power cmos flash a/d converter,” International Journal
vol. 43, 2016. of VLSI design & Communication Systems (VLSICS), vol. 2, no. 2, p.
[8] A. Shafiee et al., “Isaac: A convolutional neural network accelerator 10C108, 2011.
with in-situ analog arithmetic in crossbars,” in Proc. ISCA, 2016. [21] Siddharth et al., “Comparative study of cmos op-amp in 45nm and
[9] F. Alibart et al., “High precision tuning of state for memristive devices by 180 nm technology,” Journal of Engineering Research and Applications,
adaptable variation-tolerant algorithm,” Nanotechnology, vol. 23, no. 7, vol. 4, pp. 64–67, 2014.
p. 075201, 2012. [22] X. Dong et al., “Nvsim: A circuit-level performance, energy, and area
[10] R. Degraeve et al., “Causes and consequences of the stochastic aspect of model for emerging nonvolatile memory,” TCAD, vol. 31, no. 7, pp.
filamentary rram,” Microelectronic Engineering, vol. 147, pp. 171–175, 994–1007, 2012.
2015. [23] S. Han et al., “Learning both weights and connections for efficient neural
[11] M. Courbariaux et al., “Binarized neural network: Training deep neural network,” in Advances in Neural Information Processing Systems, 2015,
networks with weights and activations constrained to+ 1 or-1,” arXiv pp. 1135–1143.
preprint arXiv:1602.02830, 2016. [24] S.-S. Sheu et al., “A 4mb embedded slc resistive-ram macro with 7.2
[12] M. Rastegari et al., “Xnor-net: Imagenet classification using binary ns read-write random access time and 160ns mlc-access capability,” in
convolutional neural networks,” arXiv preprint arXiv:1603.05279, 2016. ISSCC, 2011.
[13] S. Ioffe et al., “Batch normalization: Accelerating deep network training [25] S. R. Lee et al., “Multi-level switching of triple-layered taox rram
by reducing internal covariate shift,” arXiv preprint arXiv:1502.03167, with excellent reliability for storage class memory,” Digest of Technical
2015. Papers - Symposium on VLSI Technology, pp. 71–72, 2012.

787

You might also like