Stripes: Efficient DNN Accelerator
Stripes: Efficient DNN Accelerator
1, JANUARY-JUNE 2017
Stripes: Bit-Serial Deep Neural precision length in bits. The idea behind STR is simple: use bit-
Network Computing serial computations and compensate for the increase in computa-
tion latency by exploiting the abundant parallelism that is offered
by DNN layers. As an added benefit, using bit serial computations
Patrick Judd, Jorge Albericio, and Andreas Moshovos eliminates the need for multipliers and enables additional energy/
Abstract—The numerical representation precision required by the computations performance vs. accuracy trade offs (e.g., on a battery operated
performed by Deep Neural Networks (DNNs) varies across networks and between device a user may chose for a less accurate recognition accuracy in
layers of a same network. This observation motivates a precision-based approach exchange for longer uptime). This work demonstrates that STR has
to acceleration which takes into account both the computational structure and the the potential to improve performance with a better energy scaling
required numerical precision representation. This work presents Stripes (STR), a
and focuses solely on image classification, that is recognizing the
hardware accelerator that uses bit-serial computations to improve energy
efficiency and performance. Experimental measurements over a set of state-of-
type of object depicted in an image.
the-art DNNs for image classification show that STR improves performance over a Prior work has exploited reduced precision by turning off
state-of-the-art accelerator from 1.35 to 5.33 and by 2.24 on average. STR’s upper bit paths improving energy, but not performance [10]. STR
area and power overhead are estimated at 5 percent and 12 percent respectively. is most similar to using serial multiplication on 2D convolution to
STR is 2.00 more energy efficient than the baseline. improve energy [11]. However, STR targets 3D convolution and
does not use a lookup table of precomputed results. Since the filters
Index Terms—Hardware acceleration, deep learning, deep neural networks, in DNNs are much larger, this precomputation is intractable. Bit-
convolution, numerical representation, serial computing serial computation has been used for neural networks in [12] but
for a circuit with fixed synaptic weights.
Ç The rest of the paper is organized as follows: Section 2 corrobo-
rates the per-layer precision requirement variability of DNNs,
1 INTRODUCTION Section 3 reviews the DN design and presents Stripes, Section 4
demonstrates STR’s benefits experimentally. Finally, Section 5
DEEP neural networks (DNNs) are the state-of-the-art technique in
summarizes the limitations of our study.
many recognition tasks, spanning object [1] to speech recogni-
tion [2]. Deep Neural Networks comprise a feed-forward arrange- 2 MOTIVATION: PRECISION VARIABILITY
ment of layers each exhibiting high computational demands. They
also exhibit a high degree of parallelism which is commonly Numerical precision requirements vary significantly across net-
exploited with the use of Graphic Processing Units (GPUs). How- works and layers within a network [4], [5]. Table 1 reports the
ever, the high computation demands of DNNs and the need for fixed-point representation needed for each convolutional layer to
higher energy efficiency motivated special purpose architectures maintain the networks classification accuracy of the baseline 16-bit
such as the state-of-the-art DaDianNao (DN) [3], whose power effi- implementation. We extend the approach of Judd et al. [4], by con-
ciency is up to 330 better than a GPU. As long as additional paral- sidering a different fixed point format and including three addi-
lelism can be found, both DN and GPU performance can be tional networks [13], [14]. The precision needed varies from as
improved by introducing additional compute units. However, much as 14 bits (layer 1, GoogLeNet) to as little as 2 bits (Layer 1,
improving performance requires a proportional increase in units LeNet). We will focus on convolutional layers since they account
and thus at least a proportional increase in power. As power tends for 90 percent of the processing time in DNNs [15].
to be the limiting factor in modern high-performance designs, it is DN’s performance can be improved by 5.6 when scaling the
desirable to achieve better energy efficiency and thus performance system from 4 to 64 nodes, but with a 12.3 increase in power [3].
under given power constraints. This result demonstrates that improving performance by exploit-
This work presents Stripes (STR), a DNN performance improve- ing cross-computational parallelism can result in a disproportion-
ment technique that: 1) is complementary to existing techniques ate increase in power. Since power is the main limiting factor in
that exploit parallelism across computations, and 2) offers better modern high-performance designs, once the power budget is
energy efficiency. STR goes beyond parallelism across computa- reached, improving performance is only possible by increasing
tions and exploits the data value representation requirements of energy efficiency.
DNNs. STR is motivated by recent work that shows that the preci- Stripes exploits this precision variability to do less work per
sion required by DNNs varies significantly not only across networks neuron and thus improve energy efficiency. Compared to DN which
but also across the layers of the same network [4], [5]. Most existing uses 16-bit neurons, Stripes incorporates units whose execution time
implementations rely on a one-size-fits-all approach, using the worst- is, ideally, p=16 when using a neuron representation of p bits.
case numerical precision for all values. For example most software The Ideal column in Table 1 reports this ideal speedup over DN.
implementations use 32-bit floating-point [6], [7] while accelerators
and some recent GPUs use 16-bit fixed-point [3], [8], [9]. 3 STRIPES: A BIT-SERIAL DNN ACCELERATOR
In STR execution time scales linearly with the length of the This section first details the computation performed by convolu-
numerical precision needed by each layer. We present STR as an tional layers, and how the inner products can be transformed in
extension to the state-of-the-art accelerator DN. Since DN uses a series of additions in a straightforward way. Second, we introduce
16-bit fixed-point representation, STR would ideally improve per- the baseline system, the state-of-the-art DaDianNao accelerator [3].
formance at each layer by 16=p where p is the layer’s required Third, we present the Stripes accelerator, where the compute time
of a particular layer is directly proportional to the corresponding
The authors are with The Edward S. Rogers Sr. Department of Electrical & Computer required precision.
Engineering, University of Toronto, Toronto, ON M5S3H7, Canada.
E-mail: [email protected], [email protected], [email protected]. 3.1 Bit-Serial Convolutional Layer Computation
Manuscript received 10 Mar. 2016; accepted 7 Apr. 2016. Date of publication 1 Aug. The input to a convolutional layer is a 3D array of neurons. The
2016; date of current version 26 June 2017. layer applies N 3D filters using a constant stride S to produce an
For information on obtaining reprints of this article, please send e-mail to: reprints@ieee.
org, and reference the Digital Object Identifier below. output 3D array of neurons. The input neuron array contains
Digital Object Identifier no. 10.1109/LCA.2016.2597140 Nx Ny Ni real numbers, or neurons. The layer applies Nn filters,
1556-6056 ß 2016 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission.
See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
Authorized licensed use limited to: VIT University- Chennai Campus. Downloaded on December 02,2023 at 15:14:18 UTC from IEEE Xplore. Restrictions apply.
IEEE COMPUTER ARCHITECTURE LETTERS, VOL. 16, NO. 1, JANUARY-JUNE 2017 81
TABLE 1
Per Convolutional Layer Neuron Precision Needed to
Maintain Accuracy of the Baseline
3.3 Stripes
Fig. 3 shows an STR unit that: 1) offers the same computation
throughput as a DN unit, and 2) needs p cycles to process a neuron
Fig. 1. Bit-serial inner product. represented in p bits. Since STR uses bit-serial computation, in the
Authorized licensed use limited to: VIT University- Chennai Campus. Downloaded on December 02,2023 at 15:14:18 UTC from IEEE Xplore. Restrictions apply.
82 IEEE COMPUTER ARCHITECTURE LETTERS, VOL. 16, NO. 1, JANUARY-JUNE 2017
5 STUDY LIMITATIONS
We have demonstrated the potential performance improvements
and estimated the energy efficiency characteristics of STR. The
limitations of this study include: 1) we use approximate energy
and area models. 2) We do not compare with an approach
where the data-lanes are capable of calculating different data-
types such as for example one two-input 16-bit multiplication,
or two two-input 8-bit multiplications. 3) Efficiently processing
fully-connected layers requires of a slight modification of the
present design. 4) Support for signed neuron values requires
additional hardware.
Future work will address these limitations. It is unlikely that a
design that supports only some datatypes will be competitive since
as per the results of Section 2, there aren’t many layers that would
benefit from 8-bit or 4-bit precisions. Furthermore, the unit will
have to incorporate functionality to support processing different
number of output neurons on-the-fly. We further believe that the
area and energy models we used are pessimistic. We assumed that
energy and area simply as the sum of individual adders. However,
we expect that further area and energy efficiency are to be had in
an actual design.
REFERENCES
[1] R. B. Girshick, J. Donahue, T. Darrell, and J. Malik, “Rich feature hierarchies
for accurate object detection and semantic segmentation,” CoRR, pp. 580–
587, 2014.
[2] A. Y. Hannun, et al., “Deep speech: Scaling up end-to-end speech
recognition,” CoRR, 2014.
[3] Y. Chen, et al., “DaDianNao: A machine-learning supercomputer,” in Proc.
47th Annu. IEEE/ACM Int. Symp. Microarchitecture, Dec. 2014, pp. 609–622.
[4] P. Judd, J. Albericio, T. H. Hetherington, T. M. Aamodt, N. D. E. Jerger,
R. Urtasun, and A. Moshovos, “Reduced-precision strategies for bounded
memory in deep neural nets,” CoRR, vol. abs/1511.05236, 2015, http://
arxiv.org/abs/1511.05236
[5] S. Anwar, K. Hwang, and W. Sung, “Fixed point optimization of deep con-
volutional neural networks for object recognition,” in Proc. IEEE Int. Conf.
Acoust. Speech Signal Process., Apr. 2015, pp. 1131–1135.
[6] I. Buck, “NVIDIA’s Next-Gen Pascal GPU Architecture to Provide 10X
Speedup for Deep Learning Apps,” 2015. [Online]. Available: http://blogs.
nvidia.com/blog/2015/03/17/pascal/
Authorized licensed use limited to: VIT University- Chennai Campus. Downloaded on December 02,2023 at 15:14:18 UTC from IEEE Xplore. Restrictions apply.