0% found this document useful (0 votes)

23 views6 pages

2017 - Binary Convolutional Neural Network On RRAM

Uploaded by

陈德爱

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

23 views6 pages

2017 - Binary Convolutional Neural Network On RRAM

Uploaded by

陈德爱

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 6

9C-2

Binary Convolutional Neural Network on RRAM

Tianqi Tang, Lixue Xia, Boxun Li, Yu Wang, Huazhong Yang
Dept. of E.E., Tsinghua National Laboratory for Information Science and Technology (TNList),
Tsinghua University, Beijing, China
e-mail: [email protected]

Abstract—Recent progress in the machine learning field makes low column is not large enough to hold all the weight parameters of one
bit-level Convolutional Neural Networks (CNNs), even CNNs with binary Convolution (Conv) kernel in large BCNNs like VGG [1]. Therefore,
weights and binary neurons, achieve satisfying recognition accuracy on
the operation of matrix splitting is inevitable, and the high-cost
ImageNet dataset. Binary CNNs (BCNNs) make it possible for introducing
low bit-level RRAM devices and low bit-level ADC/DAC interfaces in interfaces are still required for the intermediate data in splitting.
RRAM-based Computing System (RCS) design, which leads to faster Second, the size of intermediate data between layers increases rapidly
read-and-write operations and better energy efficiency than before. with the network scale and introduces large overhead.
However, some design challenges still exist: (1) how to make matrix In this paper, an RRAM crossbar-based BCNN accelerator is
splitting when one crossbar is not large enough to hold all parameters
of one layer; (2) how to design the pipeline to accelerate the whole CNN proposed. The contributions of this paper include:
forward process. • In our BCNN accelerator design, the matrix splitting problem
In this paper, an RRAM crossbar-based accelerator is proposed for is well discussed when mapping weight parameters to RRAM.
BCNN forward process. Moreover, the special design for BCNN is
well discussed, especially the matrix splitting problem and the pipeline Thanks to the line buffer introduced for intermediate data
implementation. In our experiment, BCNNs on RRAM show much buffering, a pipeline strategy is proposed for system efficiency.
smaller accuracy loss than multi-bit CNNs for LeNet on MNIST when • The robustness under device variation of BCNN on RRAM
considering device variation. For AlexNet on ImageNet, the RRAM-based is demonstrated. For LeNet on MNIST, binary CNN achieves
BCNN accelerator saves 58.2% energy consumption and 56.8% area
0.75% on 3bit RRAM devices in the case of device variation.
compared with multi-bit CNN structure.
• Experimental results show that BCNN saves 58.2% of energy
and 56.8% of area consumption are saved when using BCNN
I. I NTRODUCTION for AlexNet on ImageNet.
Convolutional Neural Networks (CNNs) have achieved great per- The rest of this paper is organized as follows: Section II introduces
formance in various recognition tasks, including image classification the related background and the motivation of our work; Section III
[1], video tracking [2] and natural language processing [3]. At the proposes the RRAM-based BCNN accelerator design, especially the
same time, larger computational intensity and higher bandwidth are pipeline design; Section IV uses the case studies of LeNet on MNIST,
required than traditional non-CNN models [4]. The emerging RRAM- and AlexNet on ImageNet to analyze recognition accuracy, area and
based Computing System (RCS) has been considered as a promising energy efficiency; and Section V shows the conclusion.
solution for future CNN accelerators [5]–[8], where the RRAM-based
crossbar can not only store the weight parameters of CNN models
but also be used as the matrix-vector multiplier. In this way, the II. P RELIMINARIES AND M OTIVATION
energy cost for data transfer is reduced and less bandwidth is required. A. CNN
Moreover, thanks to the crossbar-level parallelism, it also reduces the A typical CNN consists of a number of different kinds of layers that
running time complexity from O(n2 ) to O(1). run sequentially, i.e. the output of the previous layer is the input of
However, the current resistance precision of RRAM device is the next layer. The input/output of one layer is named “feature map”
limited [9], and the impact of writing variation and reliability problem while the parameters of one layer are called “weights”. In a standard
increases with the bit-level amounts of RRAM device [10]. On CNN structure, cascaded Convolutional (Conv) layers (optionally fol-
the other hand, the interfaces between the analog RRAM-based lowed by Neuron layers, Max Pooling layers, Normalization layers)
crossbars and the digital peripheral units take up most of the area and are followed by one or more Fully-Connected Layers [1].
power consumption, which makes the whole RRAM-based computing Conv Layer can be expressed as in Eq. 1:
system not so efficient as expected [5]. Therefore, the high precision
of data and weights in state-of-the-art CNNs becomes the main w−1
h−1 in −1
C
challenge for RRAM-based implementation. fout (x, y, z) = fin (x + i, y + j, k) · cz (i, j, k) (1)
i=0 j=0 k=0
Recently, researchers in the field of machine learning have demon-
strated that Binary CNNs (BCNNs) achieve satisfying recognition where i, j, and k are the spatial coordinate of three-dimensional (3-
accuracy on ImageNet dataset [11], [12]. BCNNs use binary weights D) matrix input feature map Fin with the size of Hin × Win × Cin ;
and data when processing the forward propagation. It provides a x, y, and z are the coordinate of output feature map Fout with the
promising solution to break the high precision limits in current size of Hout × Wout × Cout ; Cz is the z th Conv kernel with the
RRAM-based CNN accelerator design. Faster read-and-write oper- size of h × w × Cin ; and there are Cout kernels in one Conv Layer.
ations and better energy efficiency can be achieved by exploiting the In this way, all the Conv kernel parameters in one layer form a 4-D
binary characteristics of BCNN. blob with the size of (h, w, Cin , Cout ). Sliding stride s is used for
Some challenges still exist in the RRAM-based BCNN accelerator jumping some pixels and reduce the computation amount, while zero
design when the network scale increases. First, the length of crossbar padding is introduced when the convolution is processed at the edge
of feature maps and the pixels are not enough for one whole Conv
This work was supported by 973 Project 2013CB329000, National Natural
Science Foundation of China (No. 61622403, 61373026, 61261160501), Brain
kernel.
Inspired Computing Research, Tsinghua University. And we gratefully thank Neuron Layer is attached after the Conv Layer which makes a
Dr. Xudong Fei from Huawei Co. for the discussion. nonlinear one-by-one mapping (y = f(x)). Binary Neurons are used

978-1-5090-1558-0/17/$31.00 ©2017 IEEE 782

9C-2

.HUQHO .HUQHO .HUQHO0 1) Limited Bit Levels of RRAM Devices: To the best of our
'$&
knowledge, only 7-bit weights [9] are currently available for the
6OLGH 'LJLWDO
,QSXW ,QSXW
single RRAM device. However, state-of-the-art fixed-point CNNs
require 8 or 16 bit precision weights [4], [14]. As a result, multiple
'$&
RRAM devices have to be used for representing one number [7],
.HUQHO0
[8] and large energy overhead are introduced. Moreover, the multi-
$'& $'& bit devices suffer from more variation and reliability problems than
D E
single-bit devices [10], which decreases the recognition accuracy of
Fig. 1: Structure of the RRAM-based Crossbar. computing accelerator. Therefore, the precision of RRAM resistance
levels limits both the energy efficiency and the accuracy of RRAM-
in our BCNN system as proposed in BinaryNet [11]. The forward based computational system.
function can be expressed in Eq. 2: 2) Limited Bit Levels of Crossbar Interfaces: Since the crossbars

1, x>0 work in the analog mode, interfaces are needed for the transformation
y= (2) between the digital signals in the nearby computing units and the
−1, x ≤ 0
analog signals in the crossbar-based MVM. There are two kinds
Max Pooling Layer is cascaded after the non-linear neurons. It of interfaces in RRAM-based computational system. On the one
picks the largest element among the neighboring area of input feature hand, the interfaces between RRAM crossbar and CPU, i.e. the input
map in order to reduce the data amount and keep the local invariance. interface in the first layer and the output interface in the final layer, are
Fully-Connected (FC) Layer can be expressed as in Eq. 3: required. On the other hand, the interfaces between RRAM crossbars
Cin −1
in different layers are required in CNNs. This is because CNNs are
fout (y) = fin (x) · c(x, y) (3) not full-connected networks. Therefore, each RRAM crossbar need
x=0 to process multiple cycles with different inputs, and the temporary
where x is the index of the 1-D input feature map vector Fin with results of each cycle need to be buffered until all the neighboring
the length of Cin ; y is the index of output feature map vector Fout results are obtained. The detailed function will be illustrated in
with the length of Cout ; and the 2-D weight matrix C is in the size Section III. An intuitive choice is to use DAC/ADCs as the interfaces,
of Cin × Cout . but huge overheads are introduced by high-precision ADC/DACs. Li
Batch Normalization (BN) Layer [13], which solves the problem [5] pointed out that 8-bit ADC/DACs contribute to more than 85%
of internal covariate shift, has been introduced in most state-of- of the area and power consumption of the whole RCS.
art network models, especially the binary network models, e.g. Therefore, it will contribute a lot to energy efficiency if achieving
BinaryNet [11], XNOR-Net [12]. The operation of BN Layer can a well-trained network model with low bit-level weight parameters
be abstracted into a linear one-to-one mapping. The parameters are and feature maps, especially the binary ones.
well-trained in the training process.
D. Challenges of RRAM-based BCNN
B. RRAM Device, Crossbar Array and Crossbar Interface Some recent papers have already shown that completely binary
An RRAM device is a passive two-port element with multiple re- CNNs (BCNNs) are achievable if the 1-bit quantization is processed
sistance states, and multiple devices can be used to build the crossbar in training. Courbariaux [11] proposes a sampling method which
structure. When the “matrix” is represented by the conductivity of the trains the binary network together with the floating-point network;
RRAM devices and the “vector” is represented by the input voltage while Rastegari [12] proposes the BinaryWeight by minimizing the
signals, the RRAM crossbar is able to perform as the analog matrix- binary quantization loss while training. Moreover, the weights and
vector multiplier (MVM). Specifically, the relationship between the the feature maps are also binarized.
input and output signals can be expressed as in Eq. 4 [6]: Based on these results, in this paper, we propose an RRAM
crossbar-based BCNN accelerator, achieving higher energy effi-

N −1
iout (k) = g(k, j) · vin (j) (4) ciency compared with multi-bit CNNs. However, when the network
j=0 scale increases, two main challenges limit the energy efficiency of
the accelerator.
where Vin is the vector of input voltage (denoted by j = 0, 1, ..., N −
1) Splitting Interface Overhead: Splitting is required when the
1), Iout is the the vector output current (denoted by k = 0, 1, ..., M −
size of the Conv kernels is larger than the length of crossbar column.
1), and G is the conductivity matrix of the RRAM device. Taking
State-of-the-art RRAM crossbars only achieve the column length of
advantage of the natural “multiplication and merging” function of
512 [6]. Crossbar of such size is not able to hold some large Conv
the crossbar structure, the RRAM crossbar can implement the Conv
kernels, e.g. Conv kernel with the size 4608 (= 3×3×512) in VGG16
kernels and FC Layers in analog mode with high speed, small area,
model. Therefore, the high-cost interfaces are still required because
and low power [6].
the intermediate data in splitting need high precision. PRIME [7]
For FC Layers, the weight matrices are directly mapped to the
and ISAAC [8] discussed matrix splitting method for full-precision
RRAM crossbars [5]. While for Conv Layers, one Conv kernel is
CNNs by using high-precision ADC/DACs, so the energy efficiency
mapped to one RRAM column, and different columns in one crossbar
is still limited. Considering that BCNN provides the potential for low-
correspond to different Conv kernels [6], as shown in Fig. 1.
precision crossbar interfaces, a BCNN-specific low-precision splitting
structure is in demand.
C. Motivation 2) Buffer Overhead: Since RRAM crossbar uses multiple inputs
Compared with the well-trained network which uses floating-point in the same cycle, the processed data between layers can only
weight parameters and feature maps on CPU/GPU platforms, the be buffered by registers instead of RAMs. Therefore, thousands
RRAM devices and the crossbar interfaces can only support limited of registers and corresponding multiplexers are required for large
bit levels. networks. ISAAC [8] gives a rough design for pipelining the Conv

783
9C-2

͙
ሺ௜ሻ ŝŐŝƚĂůWƌŽĐĞƐƐŝŶŐhŶŝƚ;EĞƵƌŽŶ͕ĂƚĐŚEŽƌŵͿ
/ŶƉƵƚ/ŵĂŐĞ ௜௡ ŽŶǀ >ŝŶĞ ƵĨĨĞƌƐ
>ĞŶŐƚŚൌ ሺ݄ሺ௜ሻ െͳሻ ‫ ܹ ڄ‬௜ ൅ ‫ ݓ‬ሺ௜ሻ
݄ ‫ܥ ڄ ݓ ڄ‬௜௡ ݄ ‫ܥ ڄ ݓ ڄ‬௜௡ ݄ ‫ܥ ڄ ݓ ڄ‬௜௡
Ͳܰʹܰ ǥ Ͳܰʹܰ ǥ Ͳܰʹܰ ǥ

ŽŶǀ >ĂǇĞƌϭ
۱ሺ଴ǡ଴ሻ ͙ ۱ሺଵǡ଴ሻ ͙ ‫ܥ‬ሺ௑೚ೠ೟ ିଵǡ଴ሻ ͙
ŽŶǀŽůǀĞƌ ŝƌĐƵŝƚ
ĨŽƌƚŚĞ݅௧௛ ŽŶǀ >ĂǇĞƌ ۱ሺ଴ǡଵሻ ۱ሺଵǡଵሻ ۱ሺ௑೚ೠ೟ିଵǡଵሻ

WŽŽůŝŶŐ>ĂǇĞƌϭ ۱ሺ଴ǡ௑೔೙ ିଶሻ ۱ሺଵǡ௑೔೙ ିଶሻ ͙ ۱ሺ௑೚ೠ೟ିଵǡ௑೔೙ ିଶሻ

;ďͿ ۱ሺ଴ǡ௑೔೙ ିଵሻ ۱ሺଵǡ௑೔೙ ିଵሻ ۱ሺ௑೚ೠ೟ିଵǡ௑೔೙ ିଵሻ
͙
;н͕ͲͿ
WŽŽůŝŶŐ>ŝŶĞƵĨĨĞƌ
ŽŶǀ >ĂǇĞƌE ௣௢௢௟ ௜ ሺ௣௢௢௟ሻ ͙ ͙ ͙
>ĞŶŐƚŚൌ ݄ െͳ ‫ܹڄ‬ ൅‫ݓ‬ Ϳ
WĂƌƚŝĂů^Ƶŵ WĂƌƚŝĂů^Ƶŵ WĂƌƚŝĂů^Ƶŵ
WŽŽůŝŶŐ>ĂǇĞƌE WŽŽůŝŶŐŝƌĐƵŝƚ
ĨŽƌƚŚĞ݅௧௛ WŽŽůŝŶŐ>ĂǇĞƌ ŝŐŝƚĂůWƌŽĐĞƐƐŝŶŐhŶŝƚ;EĞƵƌŽŶ͕ĂƚĐŚEŽƌŵͿ
;ĐͿ ;ĞͿ
&>ĂǇĞƌ;EнϭͿ
&ĞĂƚƵƌĞDĂƉƵĨĨĞƌ ሺ௜ሻ
௜௡ ůŝŶĞďƵĨĨĞƌƐ
ሺேା௝ሻ
;>ĞŶŐƚŚൌ ݄ሺேା௝ሻ ‫ ݓ ڄ‬ሺேା௝ሻ ‫ܥ ڄ‬௜௡ Ϳ
͙ ሺ௜ሻ
ሺ௜ሻ
௜௡ /E ݄ ‫ ڄ ݓ ڄ‬௜௡

Dhy
͙͙
&>ĂǇĞƌ;EнDͿ Ϭ
ŽŶǀŽůǀĞƌ ŝƌĐƵŝƚ &ĞĂƚƵƌĞDĂƉ
͙͙
&ĞĂƚƵƌĞDĂƉ

݄
ĨŽƌƚŚĞሺܰ ൅ ݆ሻ௧௛ &>ĂǇĞƌ &ƌŽŵƚŚĞWƌĞǀŝŽƵƐ KEsͺE dZ> dŽƚŚĞEĞǆƚ
Khd
ZĞĐŽŐŶŝƚŝŽŶZĞƐƵůƚ ŽŶǀŽǀůĞƌ ŝƌĐƵŝƚ ‫ݓ‬ ŽŶǀŽǀůĞƌ ŝƌĐƵŝƚ
ܹ
;ĂͿ ;ĂͿ ;ĚͿ ;ĨͿ

Fig. 2: (a) Overall Structure of the RRAM-based BCNN Accelerator: “N Conv Layers + M FC Layers” with the Input Image and the
Output Recognition Result, and each Conv Layer Optionally Followed by the Pooling Layer; (b)-(d) The Dataﬂow of the “Conv Layer”,
“Pooling Layer”, and “FC Layer”; (e) The Convolver Circuit for one Conv/FC Layer on RRAM-based Platform; (f) The Conv Line Buffers.

operation of different layers where weight duplication is introduced by repeatedly R&W operation is not available. Since the high area
for balance the load of the pipeline, but a throughout discussion on density is an important advantage of RRAM, all the Conv kernels
pipeline implementation is still in lack. Since only a few registers are can be mapped onto the crossbars in the Convolver Circuit of the
used in each cycle, there exists the parallel potential between layers corresponding layer. In this way, each output channel is able to get
to reduce the buffer size while boosting the processing speed. As a one output element in one processing cycle if enough data have been
result, a pipeline design between layers is necessary for both CNN fed into this layer’s Line Buffers by the former Conv Layer. However,
and BCNN accelerators. matrix splitting is necessary for large Conv kernels, as discussed in
Section II-D, which is the same for large FC matrices.
III. RRAM- BASED BCNN ACCELERATOR D ESIGN 1) Column Splitting: If the crossbar column count (M ) is smaller
As shown in Fig. 2 (a), the whole accelerator is made up of a series than the Conv kernel count (Cout , the same with the output channel
of Conv Layers cascaded by a series of FC Layers. The Pooling count) of this layer, then Cout Conv kernels are split into X(Conv)
out
Layer module optionally follows the Conv Layer. The data paths for groups of RRAM crossbars, as shown in Eq. 5. Copies of the input
the Conv, Pooling, and FC Layer are respectively shown in Fig. 2 (b)- feature maps with the size of one Conv kernel (h · w · Cin ) are sent
(d). Each layer consists of its own Input Buffer, and the Computing to each groups of crossbars.
Circuit. h · w · Cin (Conv)
(Conv) Cout
• Computing Circuit: For Conv and FC Layers, the Convolver
, Xout
Xin = = (5)
N M
Circuit is made up of the RRAM crossbar-based MVMs, as 2) Row Splitting: If the cross-point count (N ) in one RRAM
shown in Fig. 2 (e). Some digital peripheral units, including column is smaller than the Conv kernel size (h · w · Cin ), then the
circuits for neurons and batch normalization, are also placed in elements of one Conv kernel are split into X(Conv) groups of RRAM
in
front of or at back of the crossbar groups. For Pooling Layers, crossbars, also as shown in Eq. 5. Moreover, the input feature map
the computing circuit can be easily implemented as the multi- is also split into X(Conv) groups and the partial sum is achieved from
in
input “OR” gate in the BCNN design. each group of crossbars. An adder tree needs to be cascaded after
• Input Buffer: For Conv and Pooling Layers, the operation of
the crossbar groups in order to merge the X(Conv)
in partial sums.
sliding window exists. In this way, the structure of Line Buffer The intermediate data before adder tree still use high precision.
(LB) is introduced for intermediate data buffering and fetching, However, since the cascaded digital functions, i.e. non-linear function
as shown in Fig. 2 (f). For FC Layers, the regular buffers are and BN, are monotone increasing functions, the 1-bit quantization
used since nearby layers are fully connected. can be merged with these functions by changing the threshold and
In this section, we ﬁrst discuss the design of the Convovler circuit output data range. Therefore, the result after addition can also be
and matrix splitting in III-A; then discuss the design of intermediate only 1 bit, which provide the potential for using lower-precision
data buffering and the implementation of pipeline in III-B. intermediate data for addition. Based on this observation, we reduce
the ADC precision into 4 bit, which can save large amount of
A. Convolver Circuit: The Problem of Matrix Splitting overhead especially when the splitting amount is large.
Considering both the large energy cost of writing operation and 3) Signal Splitting: The resistance of RRAM device is positive, i.e.
the endurance limit of RRAM device [15], reusing RRAM crossbar it is unable to represent negative values. In this way, it is necessary to

784
9C-2

ܶ଴ ܶଵ ܶଶ ͙ ܶௐ ܶௐାଵ ܶௐାଶ ܶௐାଷ

map one weight matrix onto a crossbar pair: one crossbar for positive
‫ݒ݊݋ܥ‬௄ > ሺ௞ሻ ሺ௞ሻ ሺ௞ሻ ሺ௞ሻ ሺ௞ሻ ሺ௞ሻ
weights (+1), the other for negative ones (−1).
/ŶƉƵƚ
‫ݔ‬ଶǡଵ ‫ݔ‬ଶǡଶ ‫ݔ‬ଶǡଷ ͙ ‫ݔ‬ଶǡௐ Ϭ ‫ݔ‬ଷǡଵ ‫ݔ‬ଷǡଶ ͙
‫ݒ݊݋ܥ‬௄
^ ͙ ^ ͙
B. Line Buffer & Pipeline Implementation yďĂƌ

‫ݒ݊݋ܥ‬௄ାଵ > ሺ௞ାଵሻ ሺ௞ାଵሻ ሺ௞ାଵሻ ሺ௞ାଵሻ ሺ௞ାଵሻ

The sliding window exists in the Conv Layers. Data dependency /ŶƉƵƚ
Ϭ ‫ݔ‬ଵǡଵ ‫ݔ‬ଵǡଶ ͙ ‫ݔ‬ଵǡௐିଵ ‫ݔ‬ଵǡௐ Ϭ ‫ݔ‬ଶǡଵ ͙
analysis shows that the convolver circuit can awake (A) from sleep ‫ݒ݊݋ܥ‬௄ାଵ
(S) once the input data of the Conv kernel size is achieved. In this yďĂƌ
^ ^ ͙ ^ ͙
way, the structure of the Line Buffer is introduced for the following
reasons: First, much fewer registers are used for data buffering since Fig. 3: The Dataflow of Conv-Conv. The first line shows the Line
it is unnecessary to buffer the whole input feature maps; second, with Buffer’s input data of previous Conv Layer in each cycle, and the
Line Buffer introduced in every Conv/Pooling Layer, a pipeline can third line shows the input data of next Conv Layer. The second and
be implemented, which makes the forward process much faster than forth line show whether the Convolvers are Awake (A) or Sleep (S).
computing the Conv Layers in the one-by-one mode. And it is the
same for the Pooling Layers. ܶ଴ ܶଵ ܶଶ ͙ ܶௐ ܶௐାଵ ܶௐାଶ ܶௐାଷ ͙ ܶଶௐାଵ
‫ݒ݊݋ܥ‬௄ > ሺ௞ሻ ሺ௞ሻ ሺ௞ሻ ሺ௞ሻ ሺ௞ሻ ሺ௞ሻ ሺ௞ሻ ሺ௞ሻ
As the Pooling Layer is optionally followed by the Conv Layer, /ŶƉƵƚ
Ϭ ‫ݔ‬ଷǡଵ ‫ݔ‬ଷǡଶ ‫ݔ‬ଷǡଷ ‫ݔ‬ଷǡସ ͙ ‫ݔ‬ଷǡௐ Ϭ ‫ݔ‬ସǡଵ ‫ݔ‬ସǡଶ ͙ ‫ݔ‬ସǡௐ
there exist “Conv-Conv” and “Conv-Pooling-Conv” two modes for ‫ݒ݊݋ܥ‬௄
nearby layer relationship. Here, we use the dataflow behavior our yďĂƌ
^ ͙ ^ ͙
experiment of CIFAR-10 on VGG11 as case study to show the line- ܲ‫݈݋݋‬௄ >
‫ݕ‬ଵǡௐ
ሺ௞ሻ ሺ௞ሻ ሺ௞ሻ ሺ௞ሻ ሺ௞ሻ ሺ௞ሻ ሺ௞ሻ ሺ௞ሻ
/ŶƉƵƚ
y ‫ݕ‬ଶǡଵ ‫ݕ‬ଶǡଶ ‫ݕ‬ଶǡଷ ͙ ‫ݕ‬ଶǡௐିଵ ‫ݕ‬ଶǡௐ y ‫ݕ‬ଷǡଵ ͙ ‫ݕ‬ଷǡௐିଵ
buffer-based pipeline implementation.
ܲ‫݈݋݋‬௄ ^ůĞĞƉƵŶƚŝůŶĞǁĚĂƚĂ
1) Conv-Conv: For the Conv Layers in VGG11, the “kernel size” ŝƌĐƵŝƚ
^ ^ ^ ͙ ^ ^ ĐŽŵŝŶŐ
is set as 3 × 3; the “stride” is set as 1, and zero padding is introduced ‫ݒ݊݋ܥ‬௄ାଵ > ሺ௞ାଵሻ
Ϭ Ϭ Ϭ ሺ௞ାଵሻ
‫ݔ‬ଵǡଵ Ϭ ͙ Ϭ ‫ݔ‬ ௐ Ϭ Ϭ
ଵǡ
in order to keep the input and output feature map as the same size. /ŶƉƵƚ ଶ

When the feature map feeds in following the row-major order, the Fig. 4: The Dataflow of Conv-Pooling. The first line shows the Line
Line Buffer of each channel only needs (h − 1) · (W + p) + w Buffer’s input data of previous Conv Layer in each cycle; the third
registers. In the initial periods of a layer, zero padding in the length line shows the input data of next Pooling Layer; the second and
(k)
of (W + p) and the first row of x1,: are sent sequentially into the forth line show whether the Convolvers are Awake (A) or Sleep (S).
th
k Conv Layer’s Line Buffer before T0 . And in these cycles, the
Convolver Circuit of the kth is in the sleep (S) mode. Finally at the
(k)
T0 cycle, x2,1 is sent into the Line Buffer, as the dataflow shown in padding is not introduced in Pooling Layer, as shown in Cycle T1
Fig. 3. In the next cycle, the input Line Buffer shown in Fig. 2(f) and Cycle TW +2 . While for the Pooling-Conv Line Buffer of the next
is fulfilled by data, and therefore k th Conv Layer starts at time T1 . layer, it is just the turn of zero padding in this cycle like Conv-Conv
Additionally, at the end of the layer’s computation, (W + p) cycles pipeline.
are needed for computing the last row just like the initial cycles. Finally, in the pipeline implementation, the total cycle amount for
A main challenge for the Conv-Conv pipeline design is the sleep one complete forward process is shown as in Eq. 6.
control for the “line feed” problem. When the computation of a row

i>1
is accomplished, the input data need to be changed from the end Tpip = (W (1) + p) · (H (1) + 2p) + (W (i) + p) + 1+ 1
of current line to the front of the next line, which means at least i∈Conv i∈Pool j∈FC
(w − 1) data (usually we have w > 3) in the next layer need to (6)
be prepared. However, for the line-buffer-based pipeline design, the (W (1) + p) · (H (1) + 2p) is the computation cycle amount for the
input field shown in Fig. 2(f) is invalid during the preparing cycles, first Conv Layer. After that, once the cascaded layer is a Conv Layer,
(k) (k) (k) (k) (k) (k)
e.g. (xi−3,1 , 0, xi−2,W ; xi−2,1 , 0, xi−1,W ; xi−1,1 , 0, xi,W ). In these (W + p) cycles are needed for computing the last row. Otherwise,
th
cycles, the Convolvor of the k layer is also in the S mode; while only one extra cycle is needed for computing the last pixel of next
for the Line Buffer of the (k + 1)th layer, there is no valid input. Pooling Layer, or to perform a FC Layer. The pipelined cycle amount
Fortunately, we find that the zero padding of next Conv Layer can is much fewer than the straight forward layer-by-layer design whose
just exploit this cycle. And in the next cycle, i.e. Cycle Tm(W +1)+2 cycle amount is:
(m = 0, 1, ...), the Convolver of the kth layer recovers to awake (A);
(W (i) + p) · (H (i) + 2p) + (W (i+1) H (i+1) ) + 1 (7)
while the Convolver of the (k + 1)th layer begins to sleep (S) for
i∈Conv i∈Pool j∈FC
line feed. In this way, the “line feed” problem is solved by utilizing
the extra sleep cycle in each layer for zero padding, and the works in
IV. E XPERIMENTAL R ESULTS
fully pipelined parallelism without waiting. Based on this structure,
we achieve the theoretical fewest cycle amount for Conv-Conv A. Experiment Setup
pipeline connections. In this section, the models of LeNet and AlexNet are respectively
2) Conv-Pooling-Conv: For the Pooling Layers in VGG11, the used on the dataset of MNIST and ImageNet. The multi-bit model is
“kernel size” is set as 2 × 2; the “stride” is set as 2, and zero padding achieved by dynamically quantizing [4] the well-trained floating-point
does not exist. As stride is larger than 1, the pooling circuit will work model into 8 bits; while the BCNN model is achieved by following
for one row when every s rows are ready. In Conv-Pooling pipeline, the training algorithm of BinaryNet [11]. Single crossbar size is set
just as shown in Fig. 4, the kth Pool Circuit sleep from Cycle TW +3 as (M, N ) = (128, 128). If one crossbar pair is not large enough
to T2W +1 . For the awaken row, the pooling circuit will work once to store all parameters of one layer, parameter splitting is done as
every s data are sent into the pooling Line Buffer. As shown in Fig. 4, shown in III-A. For the multi-bit CNN RRAM-based accelerator, 8-
the kth Pool Circuit awakens one cycle and sleeps one cycle from bit RRAM devices and 8-bit interfaces are introduced. While the
Cycle T3 to TW +1 . Although the problem of “line feed” also exists, BCNN system is implemented as proposed in Sec. III: The same bit-
the sleep cycles can be hidden into with the “sleep row”, and zero level RRAM devices are used as in multi-bit CNN system; and the

785
9C-2

TABLE I: Error Rate of LeNet on MNIST: Device Variation Effects TABLE III: Area and Power Cost of Circuit Elements
Under Different Weight Bit-Levels Area Power(mW)
RRAM Used in RRAM Used in 1T1R RRAM device (1 + W L
) · 3F 2 0.052b
Weight RRAM
Full Bit-level Mode Binary Mode 0T1R RRAM device 4F 2 0.06b
No With No With 8bit DAC 3096T a [16] 30 [17]
Bit Level Bit Levela Sense Ampliﬁer 244T [16] 0.25 [18]
Variation Variation Variation Variation
8 bit 7 bit 0.58% 0.58% 0.74% 8bit ADC 2550T +1kΩ(≈450T ) [16] 35 [19]
6 bit 5 bit 0.60% 0.59% 0.75% 4bit ADC 72T [20] 12 [20]
0.73% 8bit SUB 256T 2.5×10−6(c)
4 bit 3 bit 0.80% 1.21% 0.75%
2 bit 1 bit 90.67% 89.10% 0.86% 1bit ADC 244T 1.73 [21]
a The bit-level of RRAM devices is 1bit less than that of the weight 32bit SRAM Cache - 0.064c
parameters because of the signal splitting. a T = W/L · F 2 , where W/L =3, and the technology node F =45nm.
b The power consumption of RRAM cell is estimated by V 2 g
√ avg avg , where
TABLE II: Amount and Processing Count of gavg = gon goﬀ [22]
c The energy consumption of digital arithmetic logics and memory access
Computing Units, Interfaces and Buffers refer to the energy table under 45nm CMOS technology node [23]. The
Module Layer Amount
Processing system clock is assumed to be 100MHz, which is determined by the speed
Count of ADC/DACs and the latency of RRAM crossbar [24].
RRAM cell Conv (h · w · Cin ) · Cout · Xout · Xout Hout · Wout
DAC Conv (h · w · Cin ) · Xout Hout · Wout
SA&ADC Conv Cout · Xin Hout · Wout TABLE IV: Energy and Area Estimation
Feature Map Buffer Conv h · w · Cin Hout · Wout of Different RRAM-based Crossbar PEs
Line Buffer Conv h · Win · Cin Hout · Wout Database Performance CNN BCNN Saving
Line Buffer Pooling h · Win · Cin Hout · Wout Energy(uJ/img) 18.39 13.55 26.3%
RRAM Cell FC Cin · Cout · Xout · Xout 1 MNIST
Area (mm2 ) 0.054 0.060 -11.1%
DAC FC Cin · Xout 1
Energy(uJ/img) 5444.85 2275.34 58.2%
SA&ADC FC Cout · Xin 1 ImageNet
Feature Map Buffer FC Cin 1 Area (mm2 ) 21.25 9.19 56.8%

interface is binary when matrix splitting is not necessary, 4 bits when C. Area and Energy Estimation Under Different Bit-Levels
necessary. Network models of LeNet on MNIST and AlexNet on ImageNet
In this section, we first explore the effect of variation under are demonstrated in the area and energy estimation. Moreover, we
different weight bit levels; then a comparison on system efficiency is also profile the area and energy distribution among different circuit
made between BCNNs on RRAM and multi-bit CNNs on RRAM. elements and among different layers on AlexNet. In our estimation,
the crossbar-based computing units and the buffers are considered;
while the consumption of interconnections are neglected. The amount
B. Accuracy: Effects of Device Variation Under Different Bit-Levels and the processing count of each module are listed as in Table. II.
Variation exists when mapping weight parameters to RRAM de- Because of the sliding window operation, modules in Conv layers
vices since it is one conductance range (not a specific conductance process Hout · Wout times in one forward process. The area and power
value) that represents one fixed-point number. When one RRAM de- consumption of each circuit elements are listed in Table. III.
vice is able to represent N bits, i.e. 2N conductance ranges represent The area and energy estimation is shown in Table. IV. The
2N fixed-point weights respectively. For the kth conductance range, experimental results show that BCNN on RRAN saves 58.2% of
g(k) represents the center conductance, and (g(k) − Δg, g(k) + Δg) energy and 56.8% of area consumption for AlexNet on ImageNet
represents the conductance range, i.e. the device variation δg ranges compared with multi-bit CNN. Whether for binary or multi-bit CNNs,
from (−Δg, Δg). According to previous physical measurement re- the output interface takes up the most part on energy and area
sults [25], we assume that the variation range Δg is the same for each consumption. The area and energy distribution is shown in Fig. 5. In
conductance range. When the RRAM device is used in the binary terms of area distribution among all layers, the FC layers take up the
mode, only two conductance ranges are picked from 2N ones. In this most part since the FC layers take up most of the the weight parameter
way, the expectation of (δg/g) can be smaller than in the case that of the whole CNN. While in terms of energy distribution, the Conv
2N ranges are all in use (we just name it as full bit-level mode), thus layers take up the most part. This is because the sliding window
introducing less computing error for matrix-vector multiplication. of each Conv layer has to sweep through the whole featue map in
LeNet on the MNIST dataset is demonstrated as case study to multiple process counts; but FC layers only process once. Comparing
show the effects of device variation under different weight bit-levels. the area and energy distribution between BCNN and multi-bit CNN,
Without considering device variation, a precise mapping is made from the overhead of input interface is mostly saved; meanwhile, the
quantized fixed-point weight parameters to RRAM conductances overhead of output interface is saved when the bit level of the partial
in the full bit-level mode. In this way, the increasing recognition sum decreases in the case of matrix splitting.
error rate mainly results from the quantization error. While in the
binary mode, the recognition performance keeps the same for RRAM V. C ONCLUSION
of different bit-levels when neglecting device variation, though the In this paper, an RRAM crossbar-based accelerator is proposed
recognition error is a bit higher than that of full bit-level mode in the for BCNN forward process. Moreover, the special design for BCNN
case of 7bit and 5bit RRAM, as listed in Table. I. is well discussed, especially the matrix splitting problem and the
When considering device variation, the recognition performance pipeline implementation. The robustness of BCNN on RRAM under
in the binary mode shows better robustness: In binary mode, device device variation are demonstrated. Experimental results show that
variation introduces less than 0.01% error rate increase in case of BCNN introduces negligible recognition accuracy loss for LeNet on
3bit (or larger bit-level) RRAM; while in full bit-level mode, the MNIST. For AlexNet on ImageNet, the RRAM-based BCNN accel-
recognition performance in 3bit RRAM becomes worse than that in erator saves 58.2% energy consumption and 56.8% area compared
binary mode due to larger effect of device variation. with multi-bit CNN structure.

786
9C-2

ƌĞĂŝƐƚƌŝďƵƚŝŽŶ ƌĞĂŝƐƚƌŝďƵƚŝŽŶ
ŽŶZZDͲďĂƐĞĚŵƵůƚŝͲďŝƚEEŽĨůĞǆEĞƚ ŽŶZZDͲďĂƐĞĚEEŽĨůĞǆEĞƚ
ϭϱ ϭϱ

ϭϮ ϭϮ

ϵ ϵ

ϲ ϲ

ϯ ϯ

Ϭ Ϭ
ĐŽŶǀϭ ĐŽŶǀϮ ĐŽŶǀϯ ĐŽŶǀϰ ĐŽŶǀϱ &ϲ &ϳ &ϴ ĐŽŶǀϭ ĐŽŶǀϮ ĐŽŶǀϯ ĐŽŶǀϰ ĐŽŶǀϱ &ϲ &ϳ &ϴ

ƌĞĂ;ZZDͿ ƌĞĂ;ŝŶƉƵƚŝŶƚĞƌĨĂĐĞͿ ƌĞĂ;ŽƵƚƉƵƚŝŶƚĞƌĨĂĐĞϴďŝƚͿ ƌĞĂ;ĞůƐĞͿ ƌĞĂ;ZZDͿ ƌĞĂ;ŝŶƉƵƚŝŶƚĞƌĨĂĐĞͿ ƌĞĂ;ŽƵƚƉƵƚŝŶƚĞƌĨĂĐĞϰďŝƚͿ ƌĞĂ;ĞůƐĞͿ

ŶĞƌŐǇŝƐƚƌŝďƵƚŝŽŶ ŶĞƌŐǇŝƐƚƌŝďƵƚŝŽŶ
ŽŶZZDͲďĂƐĞĚŵƵůƚŝͲďŝƚEEŽĨůĞǆEĞƚ ŽŶZZDͲďĂƐĞĚEEŽĨůĞǆEĞƚ
ϭϴϬϬ ϭϴϬϬ

ϭϮϬϬ ϭϮϬϬ

ϲϬϬ ϲϬϬ

Ϭ Ϭ
ĐŽŶǀϭ ĐŽŶǀϮ ĐŽŶǀϯ ĐŽŶǀϰ ĐŽŶǀϱ ĨĐϲ ĨĐϳ ĨĐϴ ĐŽŶǀϭ ĐŽŶǀϮ ĐŽŶǀϯ ĐŽŶǀϰ ĐŽŶǀϱ ĨĐϲ ĨĐϳ ĨĐϴ

;ϴďŝƚŵĞŵͿ ;ϭdϭZͿ ;ϴďŝƚͿ ;^нϴďŝƚͿ ;ĞůƐĞͿ ;ϭďŝƚŵĞŵͿ ;ϭdϭZͿ ;ŝŶƉƵƚŝŶƚĞƌĨĂĐĞͿ ;ŽƵƚƉƵƚŝŶƚĞƌĨĂĐĞͿ ;ĞůƐĞͿ

Fig. 5: Power and Area Distribution on AlexNet

R EFERENCES [14] T. Chen et al., “Diannao: A small-footprint high-throughput accelerator

for ubiquitous machine-learning,” in ACM Sigplan Notices, vol. 49, no. 4,
[1] K. Simonyan et al., “Very deep convolutional networks for large-scale 2014, pp. 269–284.
image recognition,” arXiv preprint arXiv:1409.1556, 2014. [15] Y. Y. Chen et al., “Understanding of the endurance failure in scaled
[2] J. Fan et al., “Human tracking using convolutional neural networks.” hfo 2-based 1t1r rram through vacancy mobility degradation,” in IEDM,
IEEE Transactions on Neural Networks, vol. 21, no. 10, pp. 1610–1623, 2012, pp. 20–3.
2010. [16] R. St. Amant et al., “General-purpose code acceleration with limited-
[3] A. Karpathy et al., “Deep visual-semantic alignments for generating precision analog computation,” ACM SIGARCH Computer Architecture
image descriptions,” in Computer Vision and Pattern Recognition, 2015. News, vol. 42, no. 3, pp. 505–516, 2014.
[4] J. Qiu et al., “Going deeper with embedded fpga platform for convolu- [17] J. Proesel et al., “An 8-bit 1.5 gs/s flash adc using post-manufacturing
tional neural network,” in FPGA, 2016, pp. 26–35. statistical selection,” in CICC, 2010, pp. 1–4.
[5] B. Li et al., “Merging the interface: Power, area and accuracy co- [18] S. Gupta et al., “Simulation and analysis of sense amplifier in submicron
optimization for rram crossbar-based mixed-signal computing system,” technology.”
in DAC, 2015, p. 13. [19] S. Y.-S. Chen et al., “A 10b 600ms/s multi-mode cmos dac for multiple
[6] L. Xia et al., “Selected by input: Energy efficient structure for rram- nyquist zone operation,” in 2011 Symposium on VLSI Circuits-Digest of
based convolutional neural network,” in DAC, 2016. Technical Papers, 2011.
[7] P. Chi et al., “Prime: A novel processing-in-memory architecture for [20] S. S. Chauhan, S. Manabala, S. Bose, and R. Chandel, “A new approach
neural network computation in reram-based main memory,” in ISCA, to design low power cmos flash a/d converter,” International Journal
vol. 43, 2016. of VLSI design & Communication Systems (VLSICS), vol. 2, no. 2, p.
[8] A. Shafiee et al., “Isaac: A convolutional neural network accelerator 10C108, 2011.
with in-situ analog arithmetic in crossbars,” in Proc. ISCA, 2016. [21] Siddharth et al., “Comparative study of cmos op-amp in 45nm and
[9] F. Alibart et al., “High precision tuning of state for memristive devices by 180 nm technology,” Journal of Engineering Research and Applications,
adaptable variation-tolerant algorithm,” Nanotechnology, vol. 23, no. 7, vol. 4, pp. 64–67, 2014.
p. 075201, 2012. [22] X. Dong et al., “Nvsim: A circuit-level performance, energy, and area
[10] R. Degraeve et al., “Causes and consequences of the stochastic aspect of model for emerging nonvolatile memory,” TCAD, vol. 31, no. 7, pp.
filamentary rram,” Microelectronic Engineering, vol. 147, pp. 171–175, 994–1007, 2012.
2015. [23] S. Han et al., “Learning both weights and connections for efficient neural
[11] M. Courbariaux et al., “Binarized neural network: Training deep neural network,” in Advances in Neural Information Processing Systems, 2015,
networks with weights and activations constrained to+ 1 or-1,” arXiv pp. 1135–1143.
preprint arXiv:1602.02830, 2016. [24] S.-S. Sheu et al., “A 4mb embedded slc resistive-ram macro with 7.2
[12] M. Rastegari et al., “Xnor-net: Imagenet classification using binary ns read-write random access time and 160ns mlc-access capability,” in
convolutional neural networks,” arXiv preprint arXiv:1603.05279, 2016. ISSCC, 2011.
[13] S. Ioffe et al., “Batch normalization: Accelerating deep network training [25] S. R. Lee et al., “Multi-level switching of triple-layered taox rram
by reducing internal covariate shift,” arXiv preprint arXiv:1502.03167, with excellent reliability for storage class memory,” Digest of Technical
2015. Papers - Symposium on VLSI Technology, pp. 71–72, 2012.

787

2017 - Binary Convolutional Neural Network On RRAM - PPT
No ratings yet
2017 - Binary Convolutional Neural Network On RRAM - PPT
21 pages
Data and Hardware Efficient Design For Convolutional Neural Network!
No ratings yet
Data and Hardware Efficient Design For Convolutional Neural Network!
10 pages
A CNN Accelerator On FPGA Using Depthwise
No ratings yet
A CNN Accelerator On FPGA Using Depthwise
5 pages
FPGA CNN Accelerator with Depthwise Convolution
No ratings yet
FPGA CNN Accelerator with Depthwise Convolution
5 pages
Design and Optimization of Fefet-Based Crossbars For Binary Convolution Neural Networks
No ratings yet
Design and Optimization of Fefet-Based Crossbars For Binary Convolution Neural Networks
6 pages
FFCNN: Fast FPGA Based Acceleration For Convolution Neural Network Inference
No ratings yet
FFCNN: Fast FPGA Based Acceleration For Convolution Neural Network Inference
5 pages
Accelerating Binarized Convolutional 2017
No ratings yet
Accelerating Binarized Convolutional 2017
10 pages
2017.01.jssc - Eyeriss Design
No ratings yet
2017.01.jssc - Eyeriss Design
12 pages
Design and Implementation of Hardware Computation For Convolutional Neural Networks
No ratings yet
Design and Implementation of Hardware Computation For Convolutional Neural Networks
6 pages
High Throughput and Low Bandwidth Demand Accelerating CNN Inference Block-By-block On FPGAs
No ratings yet
High Throughput and Low Bandwidth Demand Accelerating CNN Inference Block-By-block On FPGAs
9 pages
Area Efficient Compression For Floating-Point Feature Maps in Convolutional Neural Network Accelerators
No ratings yet
Area Efficient Compression For Floating-Point Feature Maps in Convolutional Neural Network Accelerators
5 pages
Energy-Efficient FPGA for CNNs
No ratings yet
Energy-Efficient FPGA for CNNs
5 pages
Design and Implementation of Hardware Computation For Convolutional Neural Networks
No ratings yet
Design and Implementation of Hardware Computation For Convolutional Neural Networks
6 pages
Channelnets Compact and Efficient Convolutional Neural Networks Via Channel Wise Convolutions
No ratings yet
Channelnets Compact and Efficient Convolutional Neural Networks Via Channel Wise Convolutions
9 pages
Eyeriss An Energy-Efficient Reconfigurable Accelerator For Deep Convolutional Neural Networks
No ratings yet
Eyeriss An Energy-Efficient Reconfigurable Accelerator For Deep Convolutional Neural Networks
12 pages
Design and Implementation of Hardware Computation For Convolutional Neural Networks
No ratings yet
Design and Implementation of Hardware Computation For Convolutional Neural Networks
6 pages
Design and Implementation of Hardware Computation For Convolutional Neural Networks
No ratings yet
Design and Implementation of Hardware Computation For Convolutional Neural Networks
6 pages
Harley MSC Thesis Menos Especializadpo
No ratings yet
Harley MSC Thesis Menos Especializadpo
71 pages
A Hardware-Friendly High-Precision CNN Pruning Method and Its FPGA Implementation
No ratings yet
A Hardware-Friendly High-Precision CNN Pruning Method and Its FPGA Implementation
22 pages
Irmak2021energy Efficient
No ratings yet
Irmak2021energy Efficient
4 pages
A 64 KB Reconfigurable Full-Precision Digital ReRAM-Based Compute-In-Memory For Artificial Intelligence Applications
No ratings yet
A 64 KB Reconfigurable Full-Precision Digital ReRAM-Based Compute-In-Memory For Artificial Intelligence Applications
13 pages
FT04 Haghighat Independent 2023
No ratings yet
FT04 Haghighat Independent 2023
40 pages
Ane Cient Implementation of 2D Convolution in CNN: Jing Chang and Jin Sha
No ratings yet
Ane Cient Implementation of 2D Convolution in CNN: Jing Chang and Jin Sha
8 pages
02Computing-in-Memory With SRAM and RRAM For Binary Neural Networks
No ratings yet
02Computing-in-Memory With SRAM and RRAM For Binary Neural Networks
4 pages
CNN Project
No ratings yet
CNN Project
16 pages
FPGA Implementation of Convolutional Neural Networ PDF
No ratings yet
FPGA Implementation of Convolutional Neural Networ PDF
10 pages
FPGA-Based CNN for Real-Time Video
No ratings yet
FPGA-Based CNN for Real-Time Video
7 pages
CNN hw1
No ratings yet
CNN hw1
13 pages
High-Performance Acceleration of 2-D and 3-D CNNs On FPGAs Using Static Block Floating Point
No ratings yet
High-Performance Acceleration of 2-D and 3-D CNNs On FPGAs Using Static Block Floating Point
15 pages
Guddu Jha - Organized
No ratings yet
Guddu Jha - Organized
3 pages
Design of A Lightweight Convolutional Neural Network Accelerated by FPGA
No ratings yet
Design of A Lightweight Convolutional Neural Network Accelerated by FPGA
4 pages
Visual and Audio Signal Processing Lab University of Wollongong
No ratings yet
Visual and Audio Signal Processing Lab University of Wollongong
20 pages
Unit 2 Part 01
No ratings yet
Unit 2 Part 01
35 pages
Design of Optimized CNN For Image Processing Using Verilog
No ratings yet
Design of Optimized CNN For Image Processing Using Verilog
6 pages
FPT2017 PipeCNN
No ratings yet
FPT2017 PipeCNN
4 pages
7-Research On FPGA High-Performance Implementation Method of CNN
No ratings yet
7-Research On FPGA High-Performance Implementation Method of CNN
5 pages
Cap Ram
No ratings yet
Cap Ram
12 pages
5-Convolutional Neural Network
No ratings yet
5-Convolutional Neural Network
43 pages
IEEE Journal
No ratings yet
IEEE Journal
5 pages
Accelerating Binarized Neural Networks Comparison of FPGA CPU GPU and ASIC
No ratings yet
Accelerating Binarized Neural Networks Comparison of FPGA CPU GPU and ASIC
8 pages
Deep Learning Image Classification
No ratings yet
Deep Learning Image Classification
11 pages
Efficient Hardware Architectures For Deep Convolutional Neural Network
No ratings yet
Efficient Hardware Architectures For Deep Convolutional Neural Network
13 pages
A106-Wang 0
No ratings yet
A106-Wang 0
6 pages
Deep Learning Subject Practicals Uni Mumbai
No ratings yet
Deep Learning Subject Practicals Uni Mumbai
25 pages
Image Skin Cancer Classification Based On FPGA and Convolutional Neural Network
No ratings yet
Image Skin Cancer Classification Based On FPGA and Convolutional Neural Network
7 pages
Images and Convolutional Neural Networks: Practical Deep Learning
No ratings yet
Images and Convolutional Neural Networks: Practical Deep Learning
34 pages
ML 2
No ratings yet
ML 2
70 pages
Cloning Safe Driving Behavior For Self-D PDF
No ratings yet
Cloning Safe Driving Behavior For Self-D PDF
8 pages
Fully Convolutional
No ratings yet
Fully Convolutional
4 pages
CNN, RNN
No ratings yet
CNN, RNN
60 pages
Electronics 08 00065
No ratings yet
Electronics 08 00065
19 pages
Kanoria Shubham Anil 2023HT01569
No ratings yet
Kanoria Shubham Anil 2023HT01569
9 pages
CNN Accelerator with Mixed Precision
No ratings yet
CNN Accelerator with Mixed Precision
5 pages
UCNN: Exploiting Computational Reuse in Deep Neural Networks Via Weight Repetition
No ratings yet
UCNN: Exploiting Computational Reuse in Deep Neural Networks Via Weight Repetition
14 pages
An Energy Efficient Convolutional Neural Network Accelerator For Speech Classification Based On FPGA and Quantization
No ratings yet
An Energy Efficient Convolutional Neural Network Accelerator For Speech Classification Based On FPGA and Quantization
13 pages
A High Performance Reconfigurable Hardware Archite
No ratings yet
A High Performance Reconfigurable Hardware Archite
17 pages
ReRAM-Assisted SRAM for NN Acceleration
No ratings yet
ReRAM-Assisted SRAM for NN Acceleration
14 pages
Introduction To Convolutional Neural Networks (CNNS)
No ratings yet
Introduction To Convolutional Neural Networks (CNNS)
28 pages
Transforming Sensor Data To The Image Domain For Deep Learning - An Application To Footstep Detection
No ratings yet
Transforming Sensor Data To The Image Domain For Deep Learning - An Application To Footstep Detection
8 pages
2012 An Introduction To The Memristor A Valuable Circui
No ratings yet
2012 An Introduction To The Memristor A Valuable Circui
10 pages
2016-Nature - Memristors With Diffusive Dynamics As Synaptic Emulators For Neuromorphic
No ratings yet
2016-Nature - Memristors With Diffusive Dynamics As Synaptic Emulators For Neuromorphic
10 pages
Nnano 19 Heat-Assisted Microwave Amplifier
No ratings yet
Nnano 19 Heat-Assisted Microwave Amplifier
3 pages
Neuromorphic Computing Emerging Memories Artificial Intelligence Socs
No ratings yet
Neuromorphic Computing Emerging Memories Artificial Intelligence Socs
7 pages
Skin Cancer Detection with DL
100% (2)
Skin Cancer Detection with DL
5 pages
《元宇宙导论与实践》report
No ratings yet
《元宇宙导论与实践》report
31 pages
NNDL Lab Manual
No ratings yet
NNDL Lab Manual
39 pages
v1 Covered
No ratings yet
v1 Covered
29 pages
Aspiring Data Scientist's Journey
No ratings yet
Aspiring Data Scientist's Journey
1 page
Graph Wavenet For Deep Spatial-Temporal Graph Modeling: Zonghan Wu Shirui Pan Guodong Long Jing Jiang Chengqi Zhang
No ratings yet
Graph Wavenet For Deep Spatial-Temporal Graph Modeling: Zonghan Wu Shirui Pan Guodong Long Jing Jiang Chengqi Zhang
7 pages
Deep Learning for Brain Tumor MRI Detection
No ratings yet
Deep Learning for Brain Tumor MRI Detection
4 pages
Chatbots with Personality via Deep Learning
No ratings yet
Chatbots with Personality via Deep Learning
47 pages
Fruit Old
No ratings yet
Fruit Old
37 pages
Yolo Code
No ratings yet
Yolo Code
8 pages
CT Metal Artifact Reduction
No ratings yet
CT Metal Artifact Reduction
8 pages
Ref 38
No ratings yet
Ref 38
15 pages
Stock Price Literature Review
100% (1)
Stock Price Literature Review
5 pages
Thesis On Iris Recognition PDF
100% (2)
Thesis On Iris Recognition PDF
6 pages
A Comparative Study of Deep Learning Architectures On Melanoma Detection
No ratings yet
A Comparative Study of Deep Learning Architectures On Melanoma Detection
8 pages
Survey Paper
No ratings yet
Survey Paper
8 pages
A Comprehensive Survey of Deep Learning For Image Captioning
No ratings yet
A Comprehensive Survey of Deep Learning For Image Captioning
36 pages
Empowering Object Detection in Dynamic Environments With AI-Driven MIMO Radar Technology
No ratings yet
Empowering Object Detection in Dynamic Environments With AI-Driven MIMO Radar Technology
6 pages
Implementing Pointnet For Point Cloud Segmentation in The Heritage Context
No ratings yet
Implementing Pointnet For Point Cloud Segmentation in The Heritage Context
18 pages
10 1016@j Compbiomed 2019 103342
No ratings yet
10 1016@j Compbiomed 2019 103342
8 pages
Tomato Plant Diseases Detection System Using Image Processing
No ratings yet
Tomato Plant Diseases Detection System Using Image Processing
7 pages
NNDL-unit 3
No ratings yet
NNDL-unit 3
25 pages
Wu 2018
No ratings yet
Wu 2018
5 pages
Artificial Intelligence Applicationsin Solar Photovoltaic Renewable Energy Systems
No ratings yet
Artificial Intelligence Applicationsin Solar Photovoltaic Renewable Energy Systems
41 pages
Optimized GAN-Based Pipeline For High-Quality Face Restoration From CCTV Images
No ratings yet
Optimized GAN-Based Pipeline For High-Quality Face Restoration From CCTV Images
29 pages
Emotion Based Music Recommendation System
No ratings yet
Emotion Based Music Recommendation System
4 pages
Deep Learning Image Segmentation
No ratings yet
Deep Learning Image Segmentation
12 pages
Final Phase 2 - Review 2
No ratings yet
Final Phase 2 - Review 2
45 pages
Paper 3
No ratings yet
Paper 3
11 pages
SuperPose: Improved 6D Pose Estimation With Robust Tracking and Mask-Free Initialization
No ratings yet
SuperPose: Improved 6D Pose Estimation With Robust Tracking and Mask-Free Initialization
11 pages

2017 - Binary Convolutional Neural Network On RRAM

Uploaded by

2017 - Binary Convolutional Neural Network On RRAM

Uploaded by

9C-2

Binary Convolutional Neural Network on RRAM

978-1-5090-1558-0/17/$31.00 ©2017 IEEE 782

WŽŽůŝŶŐ>ĂǇĞƌϭ ۱ሺ଴ǡ௑೔೙ ିଶሻ ۱ሺଵǡ௑೔೙ ିଶሻ ͙ ۱ሺ௑೚ೠ೟ିଵǡ௑೔೙ ିଶሻ

ܶ଴ ܶଵ ܶଶ ͙ ܶௐ ܶௐାଵ ܶௐାଶ ܶௐାଷ

‫ݒ݊݋ܥ‬௄ାଵ > ሺ௞ାଵሻ ሺ௞ାଵሻ ሺ௞ାଵሻ ሺ௞ାଵሻ ሺ௞ାଵሻ

ƌĞĂ;ZZDͿ ƌĞĂ;ŝŶƉƵƚŝŶƚĞƌĨĂĐĞͿ ƌĞĂ;ŽƵƚƉƵƚŝŶƚĞƌĨĂĐĞϴďŝƚͿ ƌĞĂ;ĞůƐĞͿ ƌĞĂ;ZZDͿ ƌĞĂ;ŝŶƉƵƚŝŶƚĞƌĨĂĐĞͿ ƌĞĂ;ŽƵƚƉƵƚŝŶƚĞƌĨĂĐĞϰďŝƚͿ ƌĞĂ;ĞůƐĞͿ

Fig. 5: Power and Area Distribution on AlexNet

R EFERENCES [14] T. Chen et al., “Diannao: A small-footprint high-throughput accelerator

You might also like

‫ݒ݊݋ܥ‬௄ାଵ > ሺ௞ାଵሻ ሺ௞ାଵሻ ሺ௞ାଵሻ ሺ௞ାଵሻ ሺ௞ାଵሻ