Spyc
Spyc
dimension (lines 2-4 in Listing 3) and accesses to the the regular Linux shared library framework will load the
arrays (lines 11,12 in Listing 2) have been converted to the dynamic library.
appropriate array offset (lines 11,13 in Listing 3). Also, the Pyramid can be called from the command line like:
loop bounds have been generated based on the arguments
$ pyramid -func foo my app.py -source my acc.cpp
of the range function. Uses of the shape attribute have
-header my acc.h
been fixed to the static size of the array. Finally, notice that
pragmas have been placed in their appropriate locations as Notice that users still specify the Python application source
required by each tool. HLS interface pragmas have been code and accelerated function. This is used in the invoca-
placed inside of the function, and SDSoC pragmas have been tion of SDSoC to specify the top-level accelerator func-
placed outside just prior to the function definition. tion whose implementation is in the file specified by the
-source <file.cpp> option. The primary output of
C. Pyramid Implementation Flow Pyramid is a script that drives the SDSoC implementation
flow. This is the same process that happens when an Eclipse-
Once a Python function has been translated into C, it can based IDE generates a makefile using CDT. Users manually
be implemented in HDL using an HLS tool. The Pyramid execute the generated script after Pyramid completes. In the
tool generates scripts for SDSoC to drive HLS and generate a Python-based flow, the source and header files are generated
complete system design including the drivers for transferring by sPyC. In the C/HDL-based flow, users specify their own
data and controlling the accelerator. In this flow, there is no custom source files.
main function therefore we use the SDSoC library flow
to produce a static library. After accelerating a C function D. Pylon Wrapper Generator
through HLS, SDSoC rewrites the source code to call a new The Python interpreter and run-time are written in C++
driver function that interacts with the accelerator, rather than and therefore can be extended to execute custom user pro-
the original C function. grams by linking against the run-time. However, using the
SDSoC provides some important features necessary Python C API requires writing another wrapper to interface
for this work: contiguous memory management, high- custom C code with Python’s internal class representation.
performance data movers and drivers, and full system Pylon generates this wrapper file automatically by analyzing
generation. Memory allocators and data mover drivers are the function arguments and return type annotations.
provided as a static library that is normally integrated Pylon can be called from the command line like:
into the final software binary during linking. Within this
$ pylon -func foo my app.py
library are memory management routines and data structures
that manage the memory reserved via Linux CMA. In the Pylon generates C code to interface the accelerator driver
dynamic PYNQ environment, users need to be able to function with Python data types and classes. It also produces
allocate memory without using SDSoC and thus PYNQ a script to execute the compilation step, producing a shared
provides a dynamic version of this regularly static library. object. We use this capability to link in the custom driver
In order to integrate SDSoC generated accelerators with this code produced by SDSoC so that an accelerator can be
dynamic library, our flow diverges from the regular SDSoC used from Python. This shared object is loaded by the
flow such that after a regular SDSoC build completes, Python interpreter when importing the module containing the
we must regenerate the binary without including SDSoC’s accelerator driver function. When the function is executed in
static library. Then later, when the Python module is loaded Python, it executes this wrapper which sets up input/output
data pointers to the physical memory inside of the Python multiple accelerators together in a dataflow style to reduce
data structures and then calls the SDSoC-produced driver the amount of data movement between the processor and the
function to use the hardware accelerator. fabric. This will require application-level static analyses to
perform dependency tracking to determine which functions
E. Pyrite Rewriter can be directly connected. This is more of an advanced
Pyrite refactors the original Python code to use hardware feature, but one that cannot be enabled without the strong
accelerators. Given a Python module containing a function foundation of tools presented in this work. Additionally,
that interacts with an accelerator, users must modify their other analyses can be added to automatically insert the
application code to import this module and call the function necessary pragmas for configuring interfaces. Similarly, our
from the module at the appropriate location. Additionally, tools can support HLS pragma insertion for automatically
they must also take care to choose appropriate allocators to accelerating functions.
initialize their arrays. The standard NumPy ndarray allocator Our work is highly dependent on system design tools from
creates arrays in virtual memory. This requires a Scatter- Xilinx, namely SDSoC in the embedded space targeting
Gather DMA to transfer the data. PYNQ provides access to SoC-based FPGAs. There is not currently anything as mature
the SDSoC contiguous allocator which is compatible with as SDSoC with a C/C++ front end available from any other
the Simple DMA and Zero-copy data movers. In Section source (vendor or academic). As new tools emerge, our
IV-C we experimentally evaluate the trade-offs of using work can be applied to those as well. Another application
virtual versus contiguous memory with different DMAs. of our tools is in the support of Python with the OpenCL-
Pyrite can be called from the command line like: based vendor tools and PCIe platforms that enable higher
performance on larger devices and access to cloud-based
$ pyrite -func foo my app.py
applications where Python is being used more heavily than
Pyrite refactors the user’s original Python code to import the in embedded systems.
hardware module and replaces calls to the original Python
function with calls to the driver function in the new module. IV. E XPERIMENTS & R ESULTS
It can also replace calls to ndarray constructors with calls to While developing the Hot & Spicy tools, we have created
an helper function that uses the contiguous allocator. This a suite of unit tests to validate each tool individually as well
is an optional feature and is experimental. as tests for integration of the various tools into the Python-
based and C/HDL-based flows. We present 4 experiments
F. Current Capabilities & Future Work that evaluate the performance of the tools, and the overheads
We have developed the Hot & Spicy tool suite from that are incurred. First, we evaluate sPyC by analyzing the
scratch to bridge the gap in supporting applications written time it takes to translate various functions. Then we evaluate
in the high-level Python language using FPGA accelerators. the overhead using a dynamic library versus the default
As such we have focused on the most necessary support SDSoC static library. Next, we evaluate the overhead to
and features initially. In this section, we describe future call an accelerator from Python versus calling it from a
extensions in 3 areas: Python-to-C translation, application- regular C program. Finally, we present the acceleration of a
level static analyses, and application to additional domains pure Python Canny edge detection application using existing
outside of the embedded space. source code from widely available open source resources.
The sPyC translator currently supports a limited set of We evaluated the applications by running through the
Python syntax, however this can be extended with more implementation flow using the Hot & Spicy tool suite and
development effort. Vivado HLS only supports a restricted the 2017.2 version of the Xilinx SDSoC tool (which includes
set of C/C++ for synthesis, which also restricts the capa- 2017.2 versions of Vivado, HLS, and SDK). We targeted
bilities of sPyC. Currently there is support for parsing a the PYNQ board which has a Zynq-7000 7020 dual ARM
single Python file and accelerating a single function. In Cortex A-9 processor with integrated FPGA fabric and a
order to target real-world codebases, we will need to add processor clock of 650MHz. We ran the accelerators at
support for multiple files, imported module tracking, and
accelerating multiple functions. Xilinx has recently released Table I: Results of sPyC source-to-source translation tests
the xfOpenCV library [21] which provides HLS-suitable C
Lines of Code Translate time HLS time
code for many OpenCV functions. Integrating these into Test Name
Python C [seconds] [seconds]
Python with our tools would require adding support for Empty 3 7 0.066 56.47
OpenCV’s cv::Mat data structure in the generation of For 4 9 0.073 55.17
IfElifElse 7 14 0.064 53.69
Python C API wrappers. Currently wrappers can only be MMult 61 31 0.070 67.07
generated for primitive arrays. Canny 250 107 0.078 249.15
To generate high-performance systems, users will want Huge1k 1,000 1,503 0.111 57.90
to take advantage of SDSoC’s ability to directly connect Huge10k 10,000 15,003 0.554 151.16
Table II: Overhead of using dynamic versus static library on DMA setup time
Static Setup Dynamic Setup Overhead
Test Name DMA Type Coherency Contiguity Scenario
Time [µs] Time [µs] [µs]
mmult Simple Coherent Contiguous Standard 1.70 1.91 0.21
mmult sg Scatter-Gather Coherent Contiguous Standard 8.06 8.40 0.34
mmult hp Simple Non-Coherent Contiguous Cache Flush 18.01 19.20 1.19
mmult sg hp Scatter-Gather Non-Coherent Contiguous Cache Flush 22.47 23.20 0.73
mmult malloc Scatter-Gather Coherent Non-Contiguous Page Pinning 198.16 201.77 3.61
mmult hp malloc Scatter-Gather Non-Coherent Non-Contiguous Page Pin+Flush 217.83 218.86 1.03
100MHz for simplicity even though many of the cores can 1024 times and we averaged the runtimes for each function.
achieve higher performance. We leave system customization Table II shows characteristics of each test and the additional
to max out the performance to future work. We used a overhead of the dynamic library.
modified version of the PYNQ 2.0 Ubuntu image with Notice that all tests require more time to use the dynamic
an updated SDSoC library from version 2016.1 to version library to setup the DMA versus the static library. Overall,
2017.2 to be compatible with the latest tools. these overheads are very reasonable for the benefit of using
a dynamic library since the overhead is not dependent on
A. Evaluation of sPyC Translations the amount of data being transferred. Also notice that the
To evaluate the performance of sPyC to translate Python amount of time it takes to setup the DMA to transfer
to C, we used a series of functions to see how capable data varies wildly depending on which DMA is used, how
our tool is to support real-world functions. We translated an memory is allocated, and whether the DMA is connected to
empty function (with no body), a simple for loop, a set of a coherent port. Users will need to keep this in mind when
conditional statements, a matrix multiplication function with determining which pragma to use in their code.
triple for loops, the Canny edge detection code, and artificial
functions with function bodies containing from 1,000 to C. Evaluation of Python Overheads
10,000 lines of code. Table I shows the results of translating Calling C/C++ routines from Python incurs additional
these functions to C using sPyC on a PC with a 2.2GHz overhead compared to calling the same routines from a
Intel Xeon E5-2630 and 512GB of memory (note that only C/C++ application. This experiment used the C/HDL-based
up to 16GB of memory was ever used). For each test, we flow, where we modified custom C accelerator code with
recorded the number of lines of code in the entire source file, different pragmas, but the Python code remained unchanged.
not just the function that was translated, since the whole file We analyzed these overheads for the same matrix multiplica-
is parsed by the tool. We also recorded the number of lines tion applications described previously, but on a much courser
of C code in the output file and the time necessary for both: level. In this experiment, we capture the time it takes to
sPyC translating Python-to-C, and Vivado HLS synthesizing call the same accelerator function as a whole. This includes
C-to-HDL. You can see that sPyC runs much faster than setting up the DMAs, starting the accelerator, and waiting
Vivado HLS and can still handle large real-world functions. for it to complete using the dynamic library. Table III shows
the amount of time it took to call an accelerator from Python
B. Evaluation of a Centralized Library verses from C/C++. These tests use the exact same bitstream
At runtime, Python buffers are allocated on-the-fly and and drivers analyzed in the previous experiment with the
will be accessed from SDSoC’s DMA drivers. This required dynamic library.
making the SDSoC library a dynamic shared object rather In each of these tests the same accelerator is used, but
than a static library. The caveat of doing this is a potential each one has a different data transfer configuration. Using
additional overhead of each call into the library. We eval- a Scatter-Gather DMA (mmult sg) versus a Simple DMA
uated this overhead by executing various C/C++ programs (mmult) results in 1.8x longer execution of the function.
and collected the time it took to call into the library. Not using coherency (mmult hp) results in a 2.1x longer ex-
In this experiment, we used a 32x32 floating-point matrix
multiplication application with the same source code be- Table III: Overhead of calling an accelerator from Python
tween each experiment. We varied pragmas to direct SDSoC for different DMA scenarios and their relative performance
to generate systems with different data movers, coherency, C Python Overhead Relative
and memory management in order to exercise many of Test Name
[µs] [µs] [µs] Perf.
the library functions for completeness. We collected the mmult 33.48 56.00 22.52 1.0x
mmult sg 62.41 104.00 41.59 1.8x
time it took to call into the library and setup the DMA to mmult hp 71.41 107.00 35.59 2.1x
transfer data. For each test, we compiled two executables: mmult sg hp 104.97 152.00 47.03 3.1x
one statically linked, and another that was linked against the mmult malloc 613.74 673.00 59.26 18.3x
dynamic shared library. Each test executes the accelerator mmult hp malloc 661.07 709.00 47.93 19.7x
ecution and using non-contiguous memory (mmult malloc) Table V: Canny edge detection resource results
results in a 18.3x longer execution. Yet, for each of these LUTs FFs DSPs 36kBRAMs
Test Name
tests the overhead of calling the function from Python [53,200] [106,400] [220] [140]
Original Python* 23,246 31,076 108 151.5
averages only 42.3 microseconds. This is because data is Unoptimized HLS 11,722 15,596 20 27.5
not copied from the Python domain to the C/C++ domain, Pipelined HLS 17,061 21,011 20 26.5
instead the same memory is accessed and only addresses are Partitioned HLS 18,250 22,356 30 28.5
*Note - Original Python results are post-synthesis as it failed placement
passed.
These results showed large performance benefits to trans-
ferring contiguous memory over virtual memory. Another partitioned. We also implemented a version of the applica-
important factor is the performance of the allocation. The tion that used the OpenCV library function in software for
time to allocate a 4KB buffer in Python using a non- comparison.
contiguous allocator takes approximately 32 microseconds These results show that modifying source code to be per-
compared to 283 microseconds to allocate a contiguous formant in HLS is normally counter-intuitive to improving
buffer. Even with the additional latency, in many cases software performance as the refactored code resulted in a
higher performance is achieved using contiguous memory significant slowdown running in Python. Yet implementing
by carefully managing allocations and reusing buffers. that refactored code in hardware gave a nice 820x speedup
over the original Python implementation. After adding some
D. Canny Edge Detection
HLS pragmas (pipeline, array partition, etc.) without mod-
For this experiment, we sought to accelerate a complex ifying the algorithm, we were able to achieve a 39,137x
real-world application in Python with the goal to use an speedup which was also 6x faster than the OpenCV version.
existing implementation in the style of Python rather than These results show that it is possible to, from Python,
C/C++ rewritten in Python. We chose a Canny edge detec- manipulate an algorithm to achieve performance in an FPGA
tion application that was available open source, designed and using our tools.
written in Python [15] in order to evaluate our tools. First, Table V shows the post-place and route resource results
we took the unmodified source code and tweaked it to use of implementing the various systems. The Original Python
syntax supported by sPyC. This required modifying 11 lines failed to implement due to too many BRAMs unable to be
of code out of 279 with annotations necessary for typing as placed on the device (151.5 out of 140 available) and as
described in Section III-B. Then we attempted to implement such these results are post-synthesis only. After refactoring
the function as-is via HLS but the placement of the design the code, the Unoptimized HLS system was able to be
in the small 7020 device failed because the algorithm as- implemented successfully. Notice that the Pipelined HLS
written was using too many copies of the current frame, example uses less BRAMs than the Unoptimized HLS.
which resulted in too many BRAMs being used. In order to This is because some arrays were implemented in flip-flops
get the design to fit, we combined the various loops together instead of BRAMs and as a result the pipelined design used
to operate on line buffers rather than whole frames. Finally, 5,415 more flip-flops. The final Partitioned HLS design used
we added directives to improve performance. more BRAMs as arrays were partitioned and more LUTs
Table IV shows the performance results of running the and flip-flops were used to generate muxes to interface all
Canny edge detection algorithm on the PYNQ board for of these memories. In addition, more DSPs were able to be
a 320x240 sized image. The Original Python is the un- fed thanks to the memory partitioning.
modified open source implementation and the Refactored
Python is the version modified to meet the sPyC syntax
and refactored to fit on the device, but executed in software. V. C ONCLUSION
The Unoptimized HLS uses the refactored code implemented
We presented Hot & Spicy, a tool suite for integrating
in hardware with no performance pragmas (pipeline, unroll,
FPGA accelerators in Python applications. We evaluated
etc.). The Pipelined HLS version has some loops pipelined,
the capabilities of the tools and showed the overheads of
and the Partitioned HLS has both pipelining and some arrays
accessing accelerators from Python to be minimal. In this
paper, we evaluated the tools by directly accelerating Python
Table IV: Canny edge detection performance results code for a Canny edge detection application and achieved
Test Name Performance Speedup a 39,137x speedup over the initial software design and a
Original Python 48.14 sec 1.0x 6x speedup compared to an high-performance, hand-tuned
Refactored Python 139.28 sec 0.3x
Unoptimized HLS 58.68 ms 820.0x
OpenCV implementation. In the future, we will evaluate
Pipelined HLS 12.22 ms 3,939.0x integrating existing C/C++ HLS accelerators such as the
Partitioned HLS 1.23 ms 39,137.0x xfOpenCV library and existing RTL into Python applica-
OpenCV 7.19 ms 6,695.0x tions.
The tools are open source and available online at [12] V. Kathail, J. Hwang, W. Sun, Y. Chobe, T. Shui, and J. Car-
https://spicy.isi.edu with all of the examples and rillo. SDSoC: A Higher-level Programming Environment for
tests mentioned in this work. Zynq SoC and Ultrascale+ MPSoC. ACM/SIGDA Interna-
tional Symposium on Field-Programmable Gate Arrays, Feb.
ACKNOWLEDGMENT 2016.
This work was supported by the National Aero- [13] E. Logaras, O. G. Hazapis, and E. S. Manolakos. Python to
nautics and Space Administration (NASA) under grant Accelerate Embedded SoC Design: A Case Study for Systems
80NSSC17K0286. The views, opinions, and/or findings Biology. ACM Transactions on Embedded Computer Systems,
13(4), Mar. 2014.
expressed are those of the author(s) and should not be
interpreted as representing the official views or policies of [14] A. Rigo and S. Pedroni. PyPy’s approach to virtual machine
NASA or the U.S. Government. construction. Computational Science & Discovery, Oct. 2006.
The authors would like to thank the Xilinx University
Program for their donation of boards and licenses. Thanks [15] Rosetta Code. Canny Edge Detector. https://rosettacode.org/
wiki/Canny edge detector#Python.
to both the PYNQ and SDSoC teams at Xilinx for their
support. [16] D. Sheffield, M. Anderson, and K. Keutzer. Automatic
Generation of Application-specific Accelerators for FPGAs
R EFERENCES from Python Loop Nests. International Conference on Field
[1] D. Beazley. SWIG: An Easy to use tool for Integrating Programmable Logic, Oct. 2012.
Scripting Languages with C and C++. USENIX Tcl/Tk
Workshop, July. 1996. [17] S. van der Walt, S. C. Colbert, and G. Varoquaux. The NumPy
Array: A Structure for Efficient Numerical Computation.
[2] S. Behnel, R. Bradshaw, C. Citro, L. Dalcin, D. S. Seljebotn, Computing in Science & Engineering, 13(2), 2011.
and K. Smith. Cython: The Best of Both Worlds. Computing
in Science & Engineering, 13(31), 2011. [18] G. van Rossum, J. Lehtosalo, and Ł. Langa. PEP 484 – Type
Hints. Index of Python Enhancement Proposals, Sept. 2014.
[3] A. Canis, J. Choi, M. Aldham, V. Zhang, A. Kammoona, https://www.python.org/dev/peps/pep-0484/.
J. H. Anderson, S. Brown, and T. Czajkowski. LegUp:
High-level Synthesis for FPGA-based Processor/Accelerator [19] Xilinx. DS083: Virtex-II Pro and Virtex-II Pro X Platform
Systems. International Symposium on Field Programmable FPGAs. Jan. 2002.
Gate Arrays, Feb. 2011.
[20] Xilinx. PYNQ: Python Productivity for Zynq. 2017. http:
[4] J. Cong, B. Liu, S. Neuendorffer, J. Noguera, K. Vissers, //www.pynq.io/.
and Z. Zhang. High-Level Synthesis for FPGAs: From
Prototyping to Deployment. IEEE Transactions on Computer- [21] Xilinx. Xilinx xfOpenCV Library. June 2017. https://github.
Aided Design of Integrated Circuits and Systems, 30(4), 2011. com/Xilinx/xfopencv.