Stripes: Efficient DNN Accelerator

Uploaded by

Sohan Sai

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

59 views4 pages

Stripes: Efficient DNN Accelerator

Uploaded by

Sohan Sai

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 4

80 IEEE COMPUTER ARCHITECTURE LETTERS, VOL. 16, NO.

1, JANUARY-JUNE 2017

Stripes: Bit-Serial Deep Neural precision length in bits. The idea behind STR is simple: use bit-
Network Computing serial computations and compensate for the increase in computa-
tion latency by exploiting the abundant parallelism that is offered
by DNN layers. As an added benefit, using bit serial computations
Patrick Judd, Jorge Albericio, and Andreas Moshovos eliminates the need for multipliers and enables additional energy/
Abstract—The numerical representation precision required by the computations performance vs. accuracy trade offs (e.g., on a battery operated
performed by Deep Neural Networks (DNNs) varies across networks and between device a user may chose for a less accurate recognition accuracy in
layers of a same network. This observation motivates a precision-based approach exchange for longer uptime). This work demonstrates that STR has
to acceleration which takes into account both the computational structure and the the potential to improve performance with a better energy scaling
required numerical precision representation. This work presents Stripes (STR), a
and focuses solely on image classification, that is recognizing the
hardware accelerator that uses bit-serial computations to improve energy
efficiency and performance. Experimental measurements over a set of state-of-
type of object depicted in an image.
the-art DNNs for image classification show that STR improves performance over a Prior work has exploited reduced precision by turning off
state-of-the-art accelerator from 1.35 to 5.33 and by 2.24 on average. STR’s upper bit paths improving energy, but not performance [10]. STR
area and power overhead are estimated at 5 percent and 12 percent respectively. is most similar to using serial multiplication on 2D convolution to
STR is 2.00 more energy efficient than the baseline. improve energy [11]. However, STR targets 3D convolution and
does not use a lookup table of precomputed results. Since the filters
Index Terms—Hardware acceleration, deep learning, deep neural networks, in DNNs are much larger, this precomputation is intractable. Bit-
convolution, numerical representation, serial computing serial computation has been used for neural networks in [12] but
for a circuit with fixed synaptic weights.
Ç The rest of the paper is organized as follows: Section 2 corrobo-
rates the per-layer precision requirement variability of DNNs,
1 INTRODUCTION Section 3 reviews the DN design and presents Stripes, Section 4
demonstrates STR’s benefits experimentally. Finally, Section 5
DEEP neural networks (DNNs) are the state-of-the-art technique in
summarizes the limitations of our study.
many recognition tasks, spanning object [1] to speech recogni-
tion [2]. Deep Neural Networks comprise a feed-forward arrange- 2 MOTIVATION: PRECISION VARIABILITY
ment of layers each exhibiting high computational demands. They
also exhibit a high degree of parallelism which is commonly Numerical precision requirements vary significantly across net-
exploited with the use of Graphic Processing Units (GPUs). How- works and layers within a network [4], [5]. Table 1 reports the
ever, the high computation demands of DNNs and the need for fixed-point representation needed for each convolutional layer to
higher energy efficiency motivated special purpose architectures maintain the networks classification accuracy of the baseline 16-bit
such as the state-of-the-art DaDianNao (DN) [3], whose power effi- implementation. We extend the approach of Judd et al. [4], by con-
ciency is up to 330 better than a GPU. As long as additional paral- sidering a different fixed point format and including three addi-
lelism can be found, both DN and GPU performance can be tional networks [13], [14]. The precision needed varies from as
improved by introducing additional compute units. However, much as 14 bits (layer 1, GoogLeNet) to as little as 2 bits (Layer 1,
improving performance requires a proportional increase in units LeNet). We will focus on convolutional layers since they account
and thus at least a proportional increase in power. As power tends for 90 percent of the processing time in DNNs [15].
to be the limiting factor in modern high-performance designs, it is DN’s performance can be improved by 5.6 when scaling the
desirable to achieve better energy efficiency and thus performance system from 4 to 64 nodes, but with a 12.3 increase in power [3].
under given power constraints. This result demonstrates that improving performance by exploit-
This work presents Stripes (STR), a DNN performance improve- ing cross-computational parallelism can result in a disproportion-
ment technique that: 1) is complementary to existing techniques ate increase in power. Since power is the main limiting factor in
that exploit parallelism across computations, and 2) offers better modern high-performance designs, once the power budget is
energy efficiency. STR goes beyond parallelism across computa- reached, improving performance is only possible by increasing
tions and exploits the data value representation requirements of energy efficiency.
DNNs. STR is motivated by recent work that shows that the preci- Stripes exploits this precision variability to do less work per
sion required by DNNs varies significantly not only across networks neuron and thus improve energy efficiency. Compared to DN which
but also across the layers of the same network [4], [5]. Most existing uses 16-bit neurons, Stripes incorporates units whose execution time
implementations rely on a one-size-fits-all approach, using the worst- is, ideally, p=16 when using a neuron representation of p bits.
case numerical precision for all values. For example most software The Ideal column in Table 1 reports this ideal speedup over DN.
implementations use 32-bit floating-point [6], [7] while accelerators
and some recent GPUs use 16-bit fixed-point [3], [8], [9]. 3 STRIPES: A BIT-SERIAL DNN ACCELERATOR
In STR execution time scales linearly with the length of the This section first details the computation performed by convolu-
numerical precision needed by each layer. We present STR as an tional layers, and how the inner products can be transformed in
extension to the state-of-the-art accelerator DN. Since DN uses a series of additions in a straightforward way. Second, we introduce
16-bit fixed-point representation, STR would ideally improve per- the baseline system, the state-of-the-art DaDianNao accelerator [3].
formance at each layer by 16=p where p is the layer’s required Third, we present the Stripes accelerator, where the compute time
of a particular layer is directly proportional to the corresponding
The authors are with The Edward S. Rogers Sr. Department of Electrical & Computer required precision.
Engineering, University of Toronto, Toronto, ON M5S3H7, Canada.
E-mail: [email protected], [email protected], [email protected]. 3.1 Bit-Serial Convolutional Layer Computation
Manuscript received 10 Mar. 2016; accepted 7 Apr. 2016. Date of publication 1 Aug. The input to a convolutional layer is a 3D array of neurons. The
2016; date of current version 26 June 2017. layer applies N 3D filters using a constant stride S to produce an
For information on obtaining reprints of this article, please send e-mail to: reprints@ieee.
org, and reference the Digital Object Identifier below. output 3D array of neurons. The input neuron array contains
Digital Object Identifier no. 10.1109/LCA.2016.2597140 Nx Ny Ni real numbers, or neurons. The layer applies Nn filters,

1556-6056 ß 2016 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission.
See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
Authorized licensed use limited to: VIT University- Chennai Campus. Downloaded on December 02,2023 at 15:14:18 UTC from IEEE Xplore. Restrictions apply.
IEEE COMPUTER ARCHITECTURE LETTERS, VOL. 16, NO. 1, JANUARY-JUNE 2017 81

TABLE 1
Per Convolutional Layer Neuron Precision Needed to
Maintain Accuracy of the Baseline

Ideal: Ideal performance improvement with Stripes.

each containing Fx Fy Ni real numbers, or synapses. The layer

outputs a Ox Oy Nn neuron array (its depth equals the filter
count). The neuron arrays can be thought of comprising several fea-
tures, that is 2D arrays stacked along the i dimension, each corre-
sponding to an output feature. Applying the filter identifies where Fig. 2. DaDianNao tile.
in the input each feature appears. The input has Ni features and
the output has Nn features, each produced by a different filter. To 3.2 Baseline System
calculate an output neuron, one filter is applied over a window, a
We demonstrate STR as an extension over the DaDianNao state-of-
sub-array of the input neuron array that has the same dimensions
the-art accelerator proposed by Chen et al. [3]. A DN chip contains
as the filters Fx Fy Ni . Let nðy; x; iÞ and oðy; x; iÞ be respectively
16 tiles, each comprising an Neural Functional Unit (NFU), or unit,
input and output neurons, and sn ðy; x; iÞ be synapses of filter n.
and its associated buffers. Fig. 2 shows one tile. Each cycle the unit
The output neuron at position ðk; l; nÞ is calculated as in
processes 16 input neurons and 256 synapses from 16 filters, and
FX
produces 16 partial output neurons. The unit has 16 neuron lanes
y 1 FX X
x 1 N i 1
oðk; l; nÞ ¼ sn ðy; x; iÞ nðy þ l S; x þ k S; iÞ: (1) and 16 filter lanes each with 16 synapse lanes (256 in total). Each
y¼0 x¼0 i¼0 neuron lane is connected to 16 synapse lanes, one from each of the
16 filter lanes. Each synapse lane multiplies its synapse with the
There is one output neuron per window and filter. The filters corresponding input neuron. The 16 synapse lanes per filter output
are applied repeatedly over different windows moving along the 16 products which are then reduced into a partial sum. Thus, each
X and Y dimensions using a constant stride S to produce all the filter lane produces a partial sum per cycle, for a total 16 output
output neurons. Accordingly, the output neuron array dimensions neurons per unit. The tile’s synapse buffer (SB) has 256 lanes
are Ox ¼ ðNx Fx Þ=S þ 1, and Oy ¼ ðNy Fy Þ=S þ 1. (16 16) feeding the 256 synapse lanes. The input neuron buffer
(NBin) has 16 lanes feeding the 16 neuron lanes, and the neuron
3.1.1 Bit-Serial Inner-Product Computing output buffer (NBout) has 16 lanes. The unit has 256 multipliers
Now we describe how the inner products in a DNN can be one per synapse lane and 16, 17-input adder trees one per output
computed in a serial-parallel fashion where neurons are fed in neuron (16 products plus the partial sum from NBout). The num-
bit-serially and synapses are fed in bit-parallel. The terms of ber of neuron lanes and filters per unit are design time parameters
the inner product can be reorganized as described in [16]. that could be changed. All lanes operate in lock-step.
Being nb the bth bit of n, the inner summation in Equation (1) DN is designed to minimize off-chip bandwidth and to maxi-
can be reordered as mize on-chip compute utilization. The total SB capacity is designed
to be sufficient to store all synapses for the layer(s) being processed
(32 or 2 MB per unit) thus avoiding fetching synapses from off-
X
N i 1 X
N i 1 X
P 1 X
P 1 X
N i 1
s i ni ¼ si nbi 2b ¼ 2b nbi Si : (2) chip. Up to 256 filters can be processed in parallel, 16 per unit. All
i¼0 i¼0 b¼0 b¼0 i¼0 inter-layer neuron (activation) values are stored in an appropri-
ately sized central eDRAM, or Neuron Memory (NM). The 4 MB
The synapses are first combined with the individual bits of the
NM is shared among all 16 units. The only traffic seen externally is
neurons by using a set of AND gates whose outputs are fed into an
for the initial input, for loading the synapses once per layer, and
adder tree, and then perform a shifted accumulate of the result of
for writing the final output.
the reduction. Fig. 1 shows a bit-serial inner product of two neu-
Processing starts by reading from external memory: 1) the filter
rons with two synapses.
synapses, and 2) the initial input. The filter synapses are distrib-
uted accordingly to the SBs whereas the neuron input is fed to the
NBins. Every cycle 16 neurons from a continuous slice along the i
dimension, nðx; y; iÞ; . . . ; nðx; y; i þ 15Þ, are broadcast to all units.
The layer outputs are stored through NBout to NM and then fed to
the NBins for processing the next layer. Loading the next set of syn-
apses from external memory can be overlapped with the process-
ing of the current layer as necessary. NM and the SBs are
implemented using eDRAM.

3.3 Stripes
Fig. 3 shows an STR unit that: 1) offers the same computation
throughput as a DN unit, and 2) needs p cycles to process a neuron
Fig. 1. Bit-serial inner product. represented in p bits. Since STR uses bit-serial computation, in the
Authorized licensed use limited to: VIT University- Chennai Campus. Downloaded on December 02,2023 at 15:14:18 UTC from IEEE Xplore. Restrictions apply.
82 IEEE COMPUTER ARCHITECTURE LETTERS, VOL. 16, NO. 1, JANUARY-JUNE 2017

Fig. 4. Memory mapping of elements from different windows.

reads a full NM row containing 256 neurons in parallel into a

buffer. It then broadcasts one bit from each neuron per cycle to all
units. Using two buffers, the dispatcher can in parallel read the
next group of 256 neurons.
The 256 neurons being processed by the STR unit (chunk) come
in groups of 16 neuron slices from 16 consecutive along the X
dimension windows. STR optimizes memory accesses, by storing
Fig. 3. Stripes tile. these neurons as contiguously as possible. Specifically, the input
neuron array is sliced every 16 neurons along the i-axis. For exam-
worst case, it needs 16 cycles to calculate a product involving a 16- ple, the neurons at positions ð0; 0; 15Þ and ð0; 1; 0Þ are stored contig-
bit neuron. Since a DN unit processes 16 neurons in parallel, STR uously in the same memory row. This way, one chunk fits
needs to process 16 16 or 256 neurons in parallel to maintain the perfectly in one NM row when the stride is 1. In this case, the dis-
same computation bandwidth. The natural parallelism of convolu- patcher buffer can read the whole chunk in a single cyle. However,
tional layers offers a multitude of options for processing neurons when the stride is larger, the chunk spans multiple rows and multi-
in parallel. STR opts to process 16 windows in parallel using 16 ple cycles are required to read it. If the number of rows containing
neurons from each window so that the same 16 synapses from the chunk is larger than P , the dispatcher will have to stall. The dis-
each of the 16 filters can be used to calculate 16 16 output neu- patcher incorporates a shuffling network to appropriately collect
rons in parallel. STR units process 16 neuron slices nðx; y; iÞ; . . . ; the neurons belonging to each chunk. Each neuron can come from
nðx; y; i þ 15Þ through nðx þ 15; y; iÞ; . . . ; nðx þ 15; y; i þ 15Þ bit in any of 16 different positions from the input row.
parallel, a single bit per neuron per cycle. Each neuron slice corre- Fig. 4 shows an example of how two 11 neuron windows with
sponds to a different window and thus the same 16 synapses can stride of two would be mapped in NM. The shuffling network
be used to calculate in parallel 256 output partial sums correspond- skips neurons ð0; 1; 0 15Þ, which are stored in between the
ing to 256 output neurons. The input neuron array of Fig. 4 shows desired neuron slices. Then the dispatcher reorganizes the chunk
an example with two windows. DN would process these windows to feed the units with bit-serial neurons.
at different cycles, whereas STR processes them in parallel bit-
serially. In both windows the same group of 16 synapses is used 4 EVALUATION
with the corresponding set of neurons to contribute a partial out-
We implemented cycle-accurate models of both the baseline sys-
put neuron, one per window.
tem and Stripes. The baseline’s area and power were measured on
Fig. 3 shows an STR unit. The unit’s front-end comprises 16
a synthesized implementation. The baseline design was imple-
groups of 16 bit-serial neuron lanes for a total of 256 neuron lanes.
mented in Verilog and synthesized via the Synopsis Design Com-
Each group of 16 neuron lanes corresponds to one of 16 input neu-
piler [17] with the TSMC 65 nm library. The NBin and NBout
ron array windows. As in DN, SB provides 16 groups of 16 16-bit
SRAM buffers were modelled using the Artisan single-ported reg-
synapses, one group per filter. While in DN each synapse is com-
ister file memory compiler [18] using double-pumping to allow a
bined with only one neuron, in STR each synapse is combined with
read and a write per cycle. The clocks of the system and SRAM
16 neurons, one from each window, allowing STR to produce 256
allow the use of this commonly used technique which estimates
partial output neurons.
the power of a dual-ported RAM. We cross-validated scaling with
Every cycle, an STR unit fetches 256 bits from NBin and each
CACTI but used the synthesis result as more representative. The
neuron bit is bit-wise ANDed with 16 synapses (one per filter) pro-
eDRAM area and energy was modelled with Destiny [19]. We have
ducing 4 K terms. The 16 per filter and per window terms are
estimated the difference in area and power between the baseline
reduced using dedicated 16-input adder trees producing in total 256
and STR with individual synthesis results for single adders and
partial output neurons per cycle. Every cycle, the partial output neu-
multipliers. A fully synthesized STR design is forthcoming. We
rons are shifted by one bit, effectively implementing bit-serial multi-
report performance for the convolutional layers of each network.
plication. The same 256 synapses are used during P cycles, where P
Power and Area The serial multiplier design in STR in combina-
is the precision, in bits, of the input neurons. After the P cycles, 256
tion with the 16 increase in parallelism results in a 72 percent
output neurons are produced in full. Stripes’s units use 256 16-input
increase in area and 87 percent increase compared to the baseline
adder trees while DN requires only 16. However, DN requires 256
parallel multipliers. In terms of the full chip, this is a 5 percent area
two-input 16-bit multipliers, whereas STR requires none.
and 12 percent power overhead.
Fig. 5 shows the speedup and relative energy efficiency for STR,
3.3.1 Neuron Memory Mapping where relative energy efficiency is calculated as the ratio
DN’s neuron memory broadcasts 16 16-bit neurons, or 256 bits per EDN =ESTR of the energy for STR and DN. On average the STR
cycle to all units. STR needs to also broadcast 256 bits per cycle to yields a speedup of 2.24 over DN. In the best case, LeNet, which
all units, where each bit corresponds to a different neuron. We opt requires only 3 bits of precision, sees a speedup of 5.33.
to maintain the same storage format in central neuron memory VGG19 sees the least speedup, with 1.35. This is mostly due to
(NM) as in DN aligning each neuron at a 16-bit granularity. A dis- the high precision requirements. The relative delay is in line with
patcher unit appropriately feeds the STR units. The dispatcher the ideal speedup in Table 1. The difference in these two values is
Authorized licensed use limited to: VIT University- Chennai Campus. Downloaded on December 02,2023 at 15:14:18 UTC from IEEE Xplore. Restrictions apply.
IEEE COMPUTER ARCHITECTURE LETTERS, VOL. 16, NO. 1, JANUARY-JUNE 2017 83

[7] Y. Jia, E. Shelhamer, J. Donahue, S. Karayev, J. Long, R. B. Girshick,

S. Guadarrama, and T. Darrell, “Caffe: Convolutional architecture for fast
feature embedding,” CoRR, vol. abs/1408.5093, 2014, http://arxiv.org/
abs/1408.5093
[8] S. Gupta, A. Agrawal, K. Gopalakrishnan, and P. Narayanan, “Deep learn-
ing with limited numerical precision,” CoRR, vol. abs/1502.02551, 2015.
[9] M. Courbariaux, Y. Bengio, and J. David, “Low precision arithmetic for
deep learning,” CoRR, vol. abs/1412.7024, 2014.
[10] J. Park, J. H. Choi, and K. Roy, “Dynamic bit-width adaptation in DCT: An
approach to trade off image quality and computation energy,” IEEE Trans.
Very Large Scale Integr. Syst., vol. 18, no. 5, pp. 787–793, May 2010.
[11] T. Xanthopoulos and A. Chandrakasan, “A low-power DCT core using
adaptive bitwidth and arithmetic activity exploiting signal correlations and
quantization,” IEEE J. Solid-State Circuits, vol. 35, no. 5, pp. 740–750,
May 2000.
[12] T. Szabo, L. Antoni, G. Horvath, and B. Feher, “A full-parallel digital imple-
mentation for pre-trained NNs,” in Proc. IEEE-INNS-ENNS Int. Joint Conf.
Neural Netw., 2000, vol. 2, pp. 49–54.
[13] K. Chatfield, K. Simonyan, A. Vedaldi, and A. Zisserman, “Return of
the devil in the details: Delving deep into convolutional nets,” CoRR,
vol. abs/1405.3531, 2014.
[14] K. Simonyan and A. Zisserman, “Very deep convolutional networks
for large-scale image recognition,” CoRR, vol. abs/1409.1556, 2014.
[15] Y.-H. Chen, J. Emer, and V. Sze, “Eyeriss: A Spatial Architecture for
Energy-Efficient Dataflow for Convolutional Neural Networks,” 2016.
Fig. 5. Speedup and energy efficiency of STR normalized to the baseline. [Online]. Available: http://dspace.mit.edu/handle/1721.1/102369
[16] S. White, “Applications of distributed arithmetic to digital signal process-
ing: A tutorial review,” IEEE ASSP Mag., vol. 6, no. 3, pp. 4–19, Jul. 1989.
small and is due under-utilization of the neuron lanes, which is [17] Synopsys, “Design Compiler,” [Online]. Available: http://www.synopsys.
2 percent for VGG19, versus 1 percent under-utilization on average. com/Tools/Implementation/RTLSynthesis/DesignCompiler/Pages
Since there is a power overhead in STR, the relative energy effi- [18] ARM, “Artisan Memory Compiler,” [Online]. Available: http://www.arm.
com/products/physical-ip/embedded-memory-ip/index.php
ciency is lower than the speedup, but still more efficient than the [19] M. Poremba, S. Mittal, D. Li, J. Vetter, and Y. Xie, “Destiny: A tool for
baseline. The average efficiency across all networks is 2.00, rang- modeling emerging 3D NVM and eDRAM caches,” in Proc. Des. Autom.
ing from 4.76 in the best case (LeNet) to only 1.20 in the worst Test Europe Conf. Exhibition, Mar. 2015, pp. 1543–1546.
case (VGG19).

5 STUDY LIMITATIONS
We have demonstrated the potential performance improvements
and estimated the energy efficiency characteristics of STR. The
limitations of this study include: 1) we use approximate energy
and area models. 2) We do not compare with an approach
where the data-lanes are capable of calculating different data-
types such as for example one two-input 16-bit multiplication,
or two two-input 8-bit multiplications. 3) Efficiently processing
fully-connected layers requires of a slight modification of the
present design. 4) Support for signed neuron values requires
additional hardware.
Future work will address these limitations. It is unlikely that a
design that supports only some datatypes will be competitive since
as per the results of Section 2, there aren’t many layers that would
benefit from 8-bit or 4-bit precisions. Furthermore, the unit will
have to incorporate functionality to support processing different
number of output neurons on-the-fly. We further believe that the
area and energy models we used are pessimistic. We assumed that
energy and area simply as the sum of individual adders. However,
we expect that further area and energy efficiency are to be had in
an actual design.

REFERENCES
[1] R. B. Girshick, J. Donahue, T. Darrell, and J. Malik, “Rich feature hierarchies
for accurate object detection and semantic segmentation,” CoRR, pp. 580–
587, 2014.
[2] A. Y. Hannun, et al., “Deep speech: Scaling up end-to-end speech
recognition,” CoRR, 2014.
[3] Y. Chen, et al., “DaDianNao: A machine-learning supercomputer,” in Proc.
47th Annu. IEEE/ACM Int. Symp. Microarchitecture, Dec. 2014, pp. 609–622.
[4] P. Judd, J. Albericio, T. H. Hetherington, T. M. Aamodt, N. D. E. Jerger,
R. Urtasun, and A. Moshovos, “Reduced-precision strategies for bounded
memory in deep neural nets,” CoRR, vol. abs/1511.05236, 2015, http://
arxiv.org/abs/1511.05236
[5] S. Anwar, K. Hwang, and W. Sung, “Fixed point optimization of deep con-
volutional neural networks for object recognition,” in Proc. IEEE Int. Conf.
Acoust. Speech Signal Process., Apr. 2015, pp. 1131–1135.
[6] I. Buck, “NVIDIA’s Next-Gen Pascal GPU Architecture to Provide 10X
Speedup for Deep Learning Apps,” 2015. [Online]. Available: http://blogs.
nvidia.com/blog/2015/03/17/pascal/
Authorized licensed use limited to: VIT University- Chennai Campus. Downloaded on December 02,2023 at 15:14:18 UTC from IEEE Xplore. Restrictions apply.

Bit-Serial DNN Accelerator: Stripes
No ratings yet
Bit-Serial DNN Accelerator: Stripes
12 pages
UNPU An Energy-Efficient Deep Neural Network Accelerator With Fully Variable Weight Bit Precision
No ratings yet
UNPU An Energy-Efficient Deep Neural Network Accelerator With Fully Variable Weight Bit Precision
13 pages
Accurate and Efficient Stochastic Computing Hardware For Convolutional Neural Networks
No ratings yet
Accurate and Efficient Stochastic Computing Hardware For Convolutional Neural Networks
8 pages
A Deep Learning Accelerator Based On A Streaming Architecture For Binary Neural Networks
No ratings yet
A Deep Learning Accelerator Based On A Streaming Architecture For Binary Neural Networks
19 pages
SC-DCNN: Highly-Scalable Deep Convolutional Neural Network Using Stochastic Computing
No ratings yet
SC-DCNN: Highly-Scalable Deep Convolutional Neural Network Using Stochastic Computing
14 pages
CNN Accelerator with Mixed Precision
No ratings yet
CNN Accelerator with Mixed Precision
5 pages
VLSI Implementation of Deep Neural Network Using Integral Stochastic Computing
No ratings yet
VLSI Implementation of Deep Neural Network Using Integral Stochastic Computing
12 pages
High-Performance Hardware For Machine Learning - 0916
No ratings yet
High-Performance Hardware For Machine Learning - 0916
68 pages
Deep Learning With Limited Numerical Precision
No ratings yet
Deep Learning With Limited Numerical Precision
10 pages
A 0.61-J Frame Pipelined Wired-Logic DNN Processor in 16-nm FPGA Using Convolutional Non-Linear Neural Network
No ratings yet
A 0.61-J Frame Pipelined Wired-Logic DNN Processor in 16-nm FPGA Using Convolutional Non-Linear Neural Network
11 pages
A Multiplier-Free RNS-Based CNN Accelerator Exploiting Bit-Level Sparsity
No ratings yet
A Multiplier-Free RNS-Based CNN Accelerator Exploiting Bit-Level Sparsity
17 pages
An Energy-Efficient Precision-Scalable ConvNet Processor in 40-Nm CMOS-1
No ratings yet
An Energy-Efficient Precision-Scalable ConvNet Processor in 40-Nm CMOS-1
12 pages
FPGA CNN Project Paper
No ratings yet
FPGA CNN Project Paper
31 pages
A Deep Learning Prediction Process Accelerator Based FPGA PDF
No ratings yet
A Deep Learning Prediction Process Accelerator Based FPGA PDF
4 pages
NoC Based DNN Accelerators
No ratings yet
NoC Based DNN Accelerators
8 pages
An In-Memory VLSI Architecture For Convolutional Neural Networks
No ratings yet
An In-Memory VLSI Architecture For Convolutional Neural Networks
12 pages
FP BNN On FPGA
No ratings yet
FP BNN On FPGA
15 pages
GPU-Accelerated Sparse DNN Challenge
No ratings yet
GPU-Accelerated Sparse DNN Challenge
8 pages
Chapter 5 Deep Learning
No ratings yet
Chapter 5 Deep Learning
35 pages
5 Lecture 28 01 25
No ratings yet
5 Lecture 28 01 25
47 pages
MythicWhitepaper 2019oct31
No ratings yet
MythicWhitepaper 2019oct31
9 pages
Accelerating Binarized Convolutional 2017
No ratings yet
Accelerating Binarized Convolutional 2017
10 pages
aDSA SuperComp4Trng DNN
No ratings yet
aDSA SuperComp4Trng DNN
12 pages
Ashtiani22Nature An On-Chip Photonic Deep Neural Network For Image Classification
No ratings yet
Ashtiani22Nature An On-Chip Photonic Deep Neural Network For Image Classification
20 pages
Ug4 Proj
No ratings yet
Ug4 Proj
44 pages
Review of Deep Learning Algorithms and Architectur
No ratings yet
Review of Deep Learning Algorithms and Architectur
29 pages
Architecture Design For Highly Flexible and Energy-Efficient Deep Neural Network Accelerators
No ratings yet
Architecture Design For Highly Flexible and Energy-Efficient Deep Neural Network Accelerators
147 pages
Dual-Split 6T SRAM-CIM for DNN Edge
No ratings yet
Dual-Split 6T SRAM-CIM for DNN Edge
14 pages
Tutorial On DNN 6 of 9 Network and Hardware Co Design
No ratings yet
Tutorial On DNN 6 of 9 Network and Hardware Co Design
60 pages
NNQuant 1
No ratings yet
NNQuant 1
14 pages
Neuromorphic Architectures Lec 4-16-1731320691
No ratings yet
Neuromorphic Architectures Lec 4-16-1731320691
276 pages
FeNN-A RISC-V Vector Processor For Spiking
No ratings yet
FeNN-A RISC-V Vector Processor For Spiking
7 pages
An Empirical Approach To Enhance Performance For Scalable CORDIC-Based Deep Neural Networks
No ratings yet
An Empirical Approach To Enhance Performance For Scalable CORDIC-Based Deep Neural Networks
32 pages
Residual Squeeze VGG16
No ratings yet
Residual Squeeze VGG16
11 pages
Powerai DDL: Minsik Cho, Ulrich Finkler, Sameer Kumar, David Kung, Vaibhav Saxena, Dheeraj Sreedhar Ibm Research
No ratings yet
Powerai DDL: Minsik Cho, Ulrich Finkler, Sameer Kumar, David Kung, Vaibhav Saxena, Dheeraj Sreedhar Ibm Research
10 pages
HLS-Based Acceleration Framework For Deep Convolutional Neural Networks
No ratings yet
HLS-Based Acceleration Framework For Deep Convolutional Neural Networks
11 pages
PL-NPU An Energy-Efficient Edge-Device DNN Training Processor With Posit-Based Logarithm-Domain Computing
No ratings yet
PL-NPU An Energy-Efficient Edge-Device DNN Training Processor With Posit-Based Logarithm-Domain Computing
14 pages
Accelerating Binarized Neural Networks Comparison of FPGA CPU GPU and ASIC
No ratings yet
Accelerating Binarized Neural Networks Comparison of FPGA CPU GPU and ASIC
8 pages
Efficient Hardware Architectures For Accelerating Deep Neural Networks Survey
No ratings yet
Efficient Hardware Architectures For Accelerating Deep Neural Networks Survey
41 pages
An Overview of Efficient Interconnection Networks For Deep Neural Network Accelerators
No ratings yet
An Overview of Efficient Interconnection Networks For Deep Neural Network Accelerators
15 pages
Mondal Umn 0130E 25561
No ratings yet
Mondal Umn 0130E 25561
111 pages
Efficient Deep Learning Infrastructures For Embedded Computing Systems: A Comprehensive Survey and Future Envision
No ratings yet
Efficient Deep Learning Infrastructures For Embedded Computing Systems: A Comprehensive Survey and Future Envision
101 pages
Capra 2020
No ratings yet
Capra 2020
48 pages
Approximate Softmax Functions For Energy-Efficient Deep Neural Networks
No ratings yet
Approximate Softmax Functions For Energy-Efficient Deep Neural Networks
13 pages
Futureinternet 12 00113 v2
No ratings yet
Futureinternet 12 00113 v2
22 pages
High-Performance Acceleration of 2-D and 3-D CNNs On FPGAs Using Static Block Floating Point
No ratings yet
High-Performance Acceleration of 2-D and 3-D CNNs On FPGAs Using Static Block Floating Point
15 pages
07) A Time-Domain Binary CNN Engine With Error-Detection-Based Resilience in 28nm CMOS
No ratings yet
07) A Time-Domain Binary CNN Engine With Error-Detection-Based Resilience in 28nm CMOS
5 pages
Energy-Efficient FPGA for CNNs
No ratings yet
Energy-Efficient FPGA for CNNs
5 pages
Neural Network Architectures Guide
No ratings yet
Neural Network Architectures Guide
6 pages
Energy Efficient 4-bit CNN Inference on FPGA
No ratings yet
Energy Efficient 4-bit CNN Inference on FPGA
4 pages
Quantization and Deployment Od DNN On Microcontroller
No ratings yet
Quantization and Deployment Od DNN On Microcontroller
34 pages
286 1006 1 PB
No ratings yet
286 1006 1 PB
8 pages
EECS251Leture-JennyHuang 2021
No ratings yet
EECS251Leture-JennyHuang 2021
67 pages
Deep Convolutional Neural Network Inference With Floating-Point Weights and
No ratings yet
Deep Convolutional Neural Network Inference With Floating-Point Weights and
10 pages
Hardware Approximate Techniques For Deep Neural Network Accelerators: A Survey
No ratings yet
Hardware Approximate Techniques For Deep Neural Network Accelerators: A Survey
36 pages
w1 - Machine Learning Hardware Design For Efficiency, Flexibility, and Scalability (Feature)
No ratings yet
w1 - Machine Learning Hardware Design For Efficiency, Flexibility, and Scalability (Feature)
19 pages
Power Efficientreconfigurable Accelerator For Deep Convolutional Neural Networks
No ratings yet
Power Efficientreconfigurable Accelerator For Deep Convolutional Neural Networks
6 pages
RESNET
No ratings yet
RESNET
5 pages
Interfacing The Led Using 8086 Microprocessor
No ratings yet
Interfacing The Led Using 8086 Microprocessor
12 pages
IBM - POS - 4846 - E65-545-565 - Service Manual - Hs - MST
No ratings yet
IBM - POS - 4846 - E65-545-565 - Service Manual - Hs - MST
196 pages
OnPortal Datasheet
No ratings yet
OnPortal Datasheet
4 pages
Installation Manual For Full Height Turnstiles
No ratings yet
Installation Manual For Full Height Turnstiles
3 pages
Tech Enthusiasts' Price List
No ratings yet
Tech Enthusiasts' Price List
5 pages
Decanter Automation Plus - Commissioning Manual - ENG
No ratings yet
Decanter Automation Plus - Commissioning Manual - ENG
55 pages
2 SAMSUNG-2HP Fact3-2023-2-Merged-Páginas
No ratings yet
2 SAMSUNG-2HP Fact3-2023-2-Merged-Páginas
10 pages
Cisco Router License Setup
No ratings yet
Cisco Router License Setup
4 pages
Cad Cam Manual
No ratings yet
Cad Cam Manual
46 pages
Succession at Apple The Insiders Poised To Take Over
No ratings yet
Succession at Apple The Insiders Poised To Take Over
7 pages
Li36x8 PRG en
No ratings yet
Li36x8 PRG en
422 pages
Mii Channel Theme
No ratings yet
Mii Channel Theme
1 page
Lecture 03
No ratings yet
Lecture 03
37 pages
Experiment 1: Nachos Threads 1. Objectives
No ratings yet
Experiment 1: Nachos Threads 1. Objectives
4 pages
Las 2020
No ratings yet
Las 2020
32 pages
Aricent Solution Brief ERPS
No ratings yet
Aricent Solution Brief ERPS
2 pages
Com - Apple.mobilegestalt 2
No ratings yet
Com - Apple.mobilegestalt 2
6 pages
Os MCQ Set
No ratings yet
Os MCQ Set
5 pages
BEC-351 Unit-II Lecture
No ratings yet
BEC-351 Unit-II Lecture
83 pages
Volume-1 BoQ
No ratings yet
Volume-1 BoQ
5 pages
ESXi Arm Fling
No ratings yet
ESXi Arm Fling
34 pages
Wa0003.
No ratings yet
Wa0003.
17 pages
Pages From Neta 2007 PDF
No ratings yet
Pages From Neta 2007 PDF
3 pages
Industrial Automation Solutions
No ratings yet
Industrial Automation Solutions
76 pages
PIC Microcontroller Trends
No ratings yet
PIC Microcontroller Trends
9 pages
Biostar G41D3+ Spec
No ratings yet
Biostar G41D3+ Spec
2 pages
Performance Analysis Guide For Intel I7 Processor
No ratings yet
Performance Analysis Guide For Intel I7 Processor
72 pages
DC Motor Speed Control Using 8051
No ratings yet
DC Motor Speed Control Using 8051
4 pages
ATM System Project Report
No ratings yet
ATM System Project Report
11 pages
Drosophila Activity Monitor Guide
No ratings yet
Drosophila Activity Monitor Guide
26 pages

Stripes: Efficient DNN Accelerator

Uploaded by

Stripes: Efficient DNN Accelerator

Uploaded by

80 IEEE COMPUTER ARCHITECTURE LETTERS, VOL. 16, NO.

Network Neuron Precision [Bits] Ideal

Ideal: Ideal performance improvement with Stripes.

each containing Fx Fy Ni real numbers, or synapses. The layer

Fig. 4. Memory mapping of elements from different windows.

reads a full NM row containing 256 neurons in parallel into a

[7] Y. Jia, E. Shelhamer, J. Donahue, S. Karayev, J. Long, R. B. Girshick,

You might also like