Thanks to visit codestin.com
Credit goes to www.scribd.com

0% found this document useful (0 votes)
9 views8 pages

Spyc

The document introduces Hot & Spicy, an open-source tool suite designed to integrate FPGA accelerators into Python applications, enhancing productivity for both FPGA experts and non-experts. It includes tools for translating Python functions to HLS-suitable C, generating wrapper bindings, automating EDA tool flows, and retargeting Python code for acceleration. The suite has demonstrated significant performance improvements, achieving a 39,137x speedup in a Python image processing application compared to its original implementation.

Uploaded by

Supriya R L
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
9 views8 pages

Spyc

The document introduces Hot & Spicy, an open-source tool suite designed to integrate FPGA accelerators into Python applications, enhancing productivity for both FPGA experts and non-experts. It includes tools for translating Python functions to HLS-suitable C, generating wrapper bindings, automating EDA tool flows, and retargeting Python code for acceleration. The suite has demonstrated significant performance improvements, achieving a 39,137x speedup in a Python image processing application compared to its original implementation.

Uploaded by

Supriya R L
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 8

Hot & Spicy: Improving Productivity with Python and HLS for FPGAs

Sam Skalicky, Joshua Monson, Andrew Schmidt, Matthew French


Information Sciences Institute, University of Southern California, Arlington, VA
{skalicky,jmonson,aschmidt,mfrench}@isi.edu

Abstract—We present Hot & Spicy an open-source infras-


tructure and tool suite for integrating FPGA accelerators
in Python applications, provided entirely as Python source
code and available at https://spicy.isi.edu. This suite
of tools eases the packaging, integration, and binding of
accelerators and their C/C++ based drivers callable from a
Python application. The Hot & Spicy tools can: (1) translate
Python functions to HLS-suitable C functions, (2) generate
Python C wrapper bindings, (3) automate the FPGA EDA tool
flow, and (4) retarget Python source code to use accelerated
libraries. For FPGA experts, this enables increased productivity
and supports research on each stage of the flow by providing a
framework to integrate additional compilers and optimizations.
For everyone else this enables fast, consistent, acceleration of
applications on FPGAs. We describe the design principles and
flows for supporting high-level Python abstractions in an FPGA
development flow. Then we evaluate the overheads of calling
C/C++ routines from Python. Lastly, we show the results of
accelerating a kernel in a Python image processing application
and achieve a 39,137x speedup over the original Python Figure 1: Approaches for Python app implementation
implementation, and 6x speedup over an high-performance,
hand-optimized OpenCV library implementation. [7]. In addition, in 2017 Xilinx released PYNQ [20] as
an open-source project to provide Python productivity on
I. I NTRODUCTION Zynq devices. While PYNQ has been a catalyst for this
Since the introduction of the Virtex II Pro with PowerPC work, one limitation that became apparent is the lack of
processors 15 years ago [19], FPGAs have evolved into Python HLS support. Specifically, if an application exists in
more sophisticated heterogeneous processing devices with Python someone must manually profile and identify suitable
full ARM-based processor subsystems, hardened floating- functions for acceleration then use either conventional FPGA
point units and memory controllers. The FPGA community development tool flows or convert the application to C/C++
has been down this road before when first incorporating and use modern HLS tools to perform FPGA implementa-
embedded processors into FPGA devices. However, this time tion, as illustrated in the top two approaches in Figure 1.
FPGAs with ARM processors are capable of booting full Instead, this work simplifies the translation process by
Linux systems without first requiring the FPGA logic to be leveraging the decades of research in HLS [3][4] and more
programmed. This enables developers to first run software recent open-source projects for Python development on Zynq
applications on an FPGA, such as in C or more recently (PYNQ) by building higher-level productivity and run-time
Python, without writing any HDL. Moreover, tools and tools for Python applications. Our approach, as shown in
frameworks now more reliably exist to provide high-level the bottom of Figure 1, provides an end-to-end development
synthesis (HLS) support to aid in creation of accelerators flow that automates the acceleration of existing Python
from higher-level languages [3][4]. As a result, these de- applications on FPGAs. Ultimately, we are asking the simple
vices are attracting the attention of non-traditional FPGA question: “Can we support accelerating portions of a pure
developers: software developers. These HLS tools still rely Python application in FPGAs?” This would not only include
on C/C++ which is suitable for conventional embedded support for new FPGA users, but also for advanced users.
platforms, but do not yet support more modern programming The development flow will need to extend from Python to the
languages. bitstream for acceleration, but also integrating the bitstream,
In this work, we consider a use case that is increas- drivers, and runtime support to use the accelerator from
ing in interest and popularity, Python on FPGAs. Since Python.
2014 Python has grown in popularity from #4 on IEEE We have developed custom tools to quickly and easily
Spectrum’s Programming Language Survey to #1 in 2017 integrate FPGA-based accelerators into Python applications
for internal use, which are releasing to the research commu- STL) in the background for speed. Lastly, PyPy [14] has
nity as an open-source project. One of our goals in this is a JIT compiler that translates Python into C code but still
to increase adoption of FPGAs for acceleration and improve runs in an interpreted fashion and cannot generate complete
results for non-experts. Another goal is to provide concepts C files.
and design principles to support higher level abstractions in There are also tools for generating Python C API wrapper
an FPGA development flow enabling researchers to focus on bindings, the most widely used is SWIG [1] which generates
their contributions rather than infrastructure development. STL arrays and vectors for Python lists and arrays. Current
Our design flow is made up of four independent tools that HLS tools, such as Vivado HLS used in this work, do not
together form a comprehensive design flow. The Synthesis of support STL types on function interfaces.
Python-to-C (sPyC, pronounced spicy) tool supports source-
to-source translation for functions from Python to HLS- Although there are a few examples of HLS support for
suitable C/C++. The Python Linker (Pylon) tool generates Python, they are not suitable for translating pure Python to
wrapper bindings that link the Python application to C/C++ HDL. The Polyphony tool [11] requires writing a stylized
accelerator drivers. The Python rewriter (Pyrite) tool refac- hardware suitable Python syntax. Another Python compiler
tors the original source code to use the accelerators via the [22] was designed to synthesize to a custom multi-core
generated wrappers. The Pyramid tool produces scripts to processor architecture and execute custom microinstructions
drive the EDA tool flow. rather than traditional hardware logic. The Three Fingered
Our work is architected such that these tools work together Jack tool [16] transforms a restricted set of Python loop
seamlessly to support a simple flow that can accelerate func- nests to HDL but only supports a 32-bit data width. None
tions from a Python application. They also support advanced of these tools support user control of interface types (AXI,
use cases such as integrating an existing accelerator from FIFO, BRAM, etc.) making it difficult to integrate the
C/C++ via HLS or custom RTL into a Python application. generated HDL into complete hardware systems. MyHDL
This enables users to get something working quickly and allows hardware description in the Python language [6] but is
easily integrate existing EDA flows. just another HDL (like VHDL or Verilog) and is not suitable
The contributions of this work are: for migrating a regular Python application to hardware.
High-level synthesis (HLS) tools bridge a very complex gap
1) Open-source Python to HLS-suitable C translator
from high-level languages to low level hardware description.
2) Open-source tools to support integrating FPGA-based
Implementing a quality HLS tool that produces production
accelerators into Python applications
grade HDL is not an easy task. And supporting the complex-
3) Analysis of overheads of using Python to control
ities of the entire language is a long process. The Vivado
FPGA-based accelerators compared to C/C++
HLS tool is one example of the best currently available and
4) Experimental results showing realizable speedup for
yet, it is only capable of synthesizing a subset of the C/C++
Python implemented algorithm accelerated via source-
language to high quality HDL. In this paper, we refer to this
to-source translation to C and implemented via HLS
subset of the C/C++ language as “HLS-suitable C”. Taking
This rest of this paper is organized as follows. Section II
advantage of these capabilities is the motivation for the sPyC
discusses the relevant related work. Section III presents an
translator tool.
overview of the flow and describes the tools. Section IV
presents the experiments and results. Section V summarizes Currently there are no FPGA-based design tools that
our contributions. support applications written in the Python language. SysPy
[13] is a tool that generates scripts to simplify hardware
II. R ELATED W ORK development using Python as the scripting language. It does
In this work, we present a suite of tools to integrate not add any support for Python-based applications. The
support for FPGAs in Python. We support the following PYNQ [20] project provides a regular Linux experience on
operations: translating Python to C, automating the EDA FPGA-based SoCs in addition to libraries and APIs that
flow, and generating Python C API wrapper bindings. There are available in Python. The PYNQ overlay methodology
are many existing tools for translating Python code to C, presents a design pattern for using FPGA accelerators in
however none of them produce HLS-suitable C code. Cython Python. But there are no tools that automate any of the
[2] translates Python code to C but still makes calls into manual and error prone steps necessary to design an overlay
the CPython interpreter and standard libraries. ShedSkin or integrate it into with an application. The Software-defined
[8] is experimental and only translates a restricted subset System-on-chip (SDSoC) Development Environment [12]
of Python to C++ and does not support the NumPy [17] provides tools and automation to ease the design of hetero-
package. Numba [5] compiles Python to machine code geneous hardware/software systems by automating the hard-
directly rather than producing C/C++ code using its own ware system creation and driver interfacing of accelerators to
LLVM compiler. Pythran [10] also translates Python to C++ C/C++ applications, but there is no Python language front-
but uses other high-performance libraries (OpenMP, boost, end. There is currently no integration between PYNQ and
SDSoC. In this paper, we integrate PYNQ and SDSoC using
our suite of tools.

III. T HE H OT & S PICY A PPROACH


The Hot & Spicy tool suite is designed to support
accelerating portions of a Python application in FPGAs. In
any application, there are certain functions that represent the
critical path or bottleneck of the application, and accelerating
these functions provides the most benefit to the overall ap-
plication performance. Given an application written entirely
in Python, our tools support integrating accelerators with
Python applications.
Figure 2: Implementation flows using the Hot & Spicy tools.
A. Supported Flows
Our tools support various design flows for integrating In addition to the regular basic Python syntax, sPyC
accelerators into a Python application. These accelerators requires that users implement type annotations on function
can be implemented as Python functions, C/C++ functions, arguments and return [18]. There is some type inference sup-
or HDL IPs. Figure 2 shows an high-level view of the tools port for primitive variables, but users can also define specific
used in various flows. types using variable annotations [9]. These annotations are
The Python-based flow operates by translating a single supported in Python versions 3.6 and up. Listing 1 shows
Python function to C so that an high-level synthesis (HLS) an example matrix multiplication function implemented in
tool can generate an accelerator IP. Then, scripts to drive the the traditional triple for loop style and Listing 2 shows the
vendor tool implementation flow are produced, the design is same code with annotations to specify types (lines 1-3,9)
implemented, and a bitstream generated. Next, the Python and docstrings for pragmas (lines 4-5,8). sPyC supports a
C API wrapper is generated to call the accelerator driver subset of Python syntax inside the function that is being
function in C, which is compiled to produce a Python mod- translated. However, this does not restrict the rest of the
ule that can be imported into a user’s application. Finally, user’s application; all Python syntax is supported in the rest
the original application is rewritten to use the accelerator by of the source file.
importing the new module. The supported Python syntax within the accelerated func-
The C/HDL-based flow operates very similar to the tion still includes all the necessary structures to support com-
Python-based flow except that no Python code is translated plex computation. It supports arithmetic operators like +, -,
to C. Instead, users provide existing C/C++ code destined ×, ÷, %, conditionals like if, elif, else, both for and
for synthesis using an HLS tool or custom HDL code. Then while loops, and the built-in range() function. Variables
the vendor tools perform implementation and generate the must be primitive types like char, short, int, long,
bitstream. Our tools still generate the Python C API wrapper, float, double or NumPy ndarrays [17]. Additionally,
and a Python module is produced. In this flow, the sPyC the shape attribute on ndarrays is also supported so that
translator is not used. it can be used in expressions. NumPy arithmetic functions
like round, power, arctan, and sqrt are supported by
B. sPyC Translator calling the appropriate math.h function. Calling custom user
Just as HLS tools increased productivity for HDL de- defined functions in the function body is supported by also
velopers by transforming C implementation to HDL, our translating these sub-functions. There are no fundamental
Python to C source-to-source translator, sPyC, improves technical limitations for only supporting what is described
productivity for hardware developers looking to accelerate above. Rather, these are the currently supported features and
functions written in Python with HLS. sPyC generates development is still on-going. Section III-F describes future
synthesizable C/C++ code targeting the Vivado HLS tool. work goals.
The goal when implementing sPyC was to translate pure sPyC can be called from the command line like:
Python syntax with the guiding principle being any sPyC-
$ spyc my app.py -func foo
specific syntax requirement must not modify the original
behavior of the Python code. Python provides a built-in AST This will call the translator to process the file my app.py
package that gives access to the same routines used by the and translate the function foo from Python to C. Listing
Python interpreter to parse source code. We use this module 3 shows the result of translating the matrix multiplication
as our front-end to extract the abstract syntax tree (AST) of code in Listing 2 to HLS-suitable C code. Notice that
the program so that we can guarantee that we are not adding the dimensionality of the arrays on the function arguments
any custom syntax. (lines 1-3 in Listing 2) has been transformed into a single
1: def mmult(a, 1: def mmult(a:np.ndarray((2,2),’int32’), 1: #pragma SDS access pattern(a:SEQUENTIAL)
2: b, 2: b:np.ndarray((2,2),’int32’), 2: void mmult(int32 t a[4],
3: c): 3: c:np.ndarray((2,2),’int32’)): 3: int32 t b[4],
4: for i in range(a.shape[0]): 4: ’’’#pragma SDS access pattern(a:SEQUENTIAL) 4: int32 t c[4]) {
5: for j in range(a.shape[1]): 5: #pragma HLS interface axis port=b’’’ 5: #pragma HLS interface axis port=b
6: term = 0 6: for i in range(a.shape[0]): 6: for(int i=0; i<2; i+=1) {
7: for k in range(b.shape[0]): 7: for j in range(a.shape[1]): 7: for(int j=0; j<2; j+=1) {
8: term += a[i][k] ∗ b[k][j] 8: ’’’#pragma HLS unroll’’’ 8: #pragma HLS unroll
9: c[i][j] = term 9: term:’int32’ = 0 9: int32 t term=0;
10: 10: for k in range(b.shape[0]): 10: for(int k=0; k<2; k+=1) {
11: 11: term += a[i][k] ∗ b[k][j] 11: term += a[i∗2+k] ∗ b[k∗2+j];
12: 12: c[i][j] = term 12: }
13: 13: 13: c[i∗2+j] = term;
14: 14: 14: }
15: 15: 15: }
16: 16: 16: }
Listing 1: Original Python code Listing 2: Annotated Python code Listing 3: Generated C code

dimension (lines 2-4 in Listing 3) and accesses to the the regular Linux shared library framework will load the
arrays (lines 11,12 in Listing 2) have been converted to the dynamic library.
appropriate array offset (lines 11,13 in Listing 3). Also, the Pyramid can be called from the command line like:
loop bounds have been generated based on the arguments
$ pyramid -func foo my app.py -source my acc.cpp
of the range function. Uses of the shape attribute have
-header my acc.h
been fixed to the static size of the array. Finally, notice that
pragmas have been placed in their appropriate locations as Notice that users still specify the Python application source
required by each tool. HLS interface pragmas have been code and accelerated function. This is used in the invoca-
placed inside of the function, and SDSoC pragmas have been tion of SDSoC to specify the top-level accelerator func-
placed outside just prior to the function definition. tion whose implementation is in the file specified by the
-source <file.cpp> option. The primary output of
C. Pyramid Implementation Flow Pyramid is a script that drives the SDSoC implementation
flow. This is the same process that happens when an Eclipse-
Once a Python function has been translated into C, it can based IDE generates a makefile using CDT. Users manually
be implemented in HDL using an HLS tool. The Pyramid execute the generated script after Pyramid completes. In the
tool generates scripts for SDSoC to drive HLS and generate a Python-based flow, the source and header files are generated
complete system design including the drivers for transferring by sPyC. In the C/HDL-based flow, users specify their own
data and controlling the accelerator. In this flow, there is no custom source files.
main function therefore we use the SDSoC library flow
to produce a static library. After accelerating a C function D. Pylon Wrapper Generator
through HLS, SDSoC rewrites the source code to call a new The Python interpreter and run-time are written in C++
driver function that interacts with the accelerator, rather than and therefore can be extended to execute custom user pro-
the original C function. grams by linking against the run-time. However, using the
SDSoC provides some important features necessary Python C API requires writing another wrapper to interface
for this work: contiguous memory management, high- custom C code with Python’s internal class representation.
performance data movers and drivers, and full system Pylon generates this wrapper file automatically by analyzing
generation. Memory allocators and data mover drivers are the function arguments and return type annotations.
provided as a static library that is normally integrated Pylon can be called from the command line like:
into the final software binary during linking. Within this
$ pylon -func foo my app.py
library are memory management routines and data structures
that manage the memory reserved via Linux CMA. In the Pylon generates C code to interface the accelerator driver
dynamic PYNQ environment, users need to be able to function with Python data types and classes. It also produces
allocate memory without using SDSoC and thus PYNQ a script to execute the compilation step, producing a shared
provides a dynamic version of this regularly static library. object. We use this capability to link in the custom driver
In order to integrate SDSoC generated accelerators with this code produced by SDSoC so that an accelerator can be
dynamic library, our flow diverges from the regular SDSoC used from Python. This shared object is loaded by the
flow such that after a regular SDSoC build completes, Python interpreter when importing the module containing the
we must regenerate the binary without including SDSoC’s accelerator driver function. When the function is executed in
static library. Then later, when the Python module is loaded Python, it executes this wrapper which sets up input/output
data pointers to the physical memory inside of the Python multiple accelerators together in a dataflow style to reduce
data structures and then calls the SDSoC-produced driver the amount of data movement between the processor and the
function to use the hardware accelerator. fabric. This will require application-level static analyses to
perform dependency tracking to determine which functions
E. Pyrite Rewriter can be directly connected. This is more of an advanced
Pyrite refactors the original Python code to use hardware feature, but one that cannot be enabled without the strong
accelerators. Given a Python module containing a function foundation of tools presented in this work. Additionally,
that interacts with an accelerator, users must modify their other analyses can be added to automatically insert the
application code to import this module and call the function necessary pragmas for configuring interfaces. Similarly, our
from the module at the appropriate location. Additionally, tools can support HLS pragma insertion for automatically
they must also take care to choose appropriate allocators to accelerating functions.
initialize their arrays. The standard NumPy ndarray allocator Our work is highly dependent on system design tools from
creates arrays in virtual memory. This requires a Scatter- Xilinx, namely SDSoC in the embedded space targeting
Gather DMA to transfer the data. PYNQ provides access to SoC-based FPGAs. There is not currently anything as mature
the SDSoC contiguous allocator which is compatible with as SDSoC with a C/C++ front end available from any other
the Simple DMA and Zero-copy data movers. In Section source (vendor or academic). As new tools emerge, our
IV-C we experimentally evaluate the trade-offs of using work can be applied to those as well. Another application
virtual versus contiguous memory with different DMAs. of our tools is in the support of Python with the OpenCL-
Pyrite can be called from the command line like: based vendor tools and PCIe platforms that enable higher
performance on larger devices and access to cloud-based
$ pyrite -func foo my app.py
applications where Python is being used more heavily than
Pyrite refactors the user’s original Python code to import the in embedded systems.
hardware module and replaces calls to the original Python
function with calls to the driver function in the new module. IV. E XPERIMENTS & R ESULTS
It can also replace calls to ndarray constructors with calls to While developing the Hot & Spicy tools, we have created
an helper function that uses the contiguous allocator. This a suite of unit tests to validate each tool individually as well
is an optional feature and is experimental. as tests for integration of the various tools into the Python-
based and C/HDL-based flows. We present 4 experiments
F. Current Capabilities & Future Work that evaluate the performance of the tools, and the overheads
We have developed the Hot & Spicy tool suite from that are incurred. First, we evaluate sPyC by analyzing the
scratch to bridge the gap in supporting applications written time it takes to translate various functions. Then we evaluate
in the high-level Python language using FPGA accelerators. the overhead using a dynamic library versus the default
As such we have focused on the most necessary support SDSoC static library. Next, we evaluate the overhead to
and features initially. In this section, we describe future call an accelerator from Python versus calling it from a
extensions in 3 areas: Python-to-C translation, application- regular C program. Finally, we present the acceleration of a
level static analyses, and application to additional domains pure Python Canny edge detection application using existing
outside of the embedded space. source code from widely available open source resources.
The sPyC translator currently supports a limited set of We evaluated the applications by running through the
Python syntax, however this can be extended with more implementation flow using the Hot & Spicy tool suite and
development effort. Vivado HLS only supports a restricted the 2017.2 version of the Xilinx SDSoC tool (which includes
set of C/C++ for synthesis, which also restricts the capa- 2017.2 versions of Vivado, HLS, and SDK). We targeted
bilities of sPyC. Currently there is support for parsing a the PYNQ board which has a Zynq-7000 7020 dual ARM
single Python file and accelerating a single function. In Cortex A-9 processor with integrated FPGA fabric and a
order to target real-world codebases, we will need to add processor clock of 650MHz. We ran the accelerators at
support for multiple files, imported module tracking, and
accelerating multiple functions. Xilinx has recently released Table I: Results of sPyC source-to-source translation tests
the xfOpenCV library [21] which provides HLS-suitable C
Lines of Code Translate time HLS time
code for many OpenCV functions. Integrating these into Test Name
Python C [seconds] [seconds]
Python with our tools would require adding support for Empty 3 7 0.066 56.47
OpenCV’s cv::Mat data structure in the generation of For 4 9 0.073 55.17
IfElifElse 7 14 0.064 53.69
Python C API wrappers. Currently wrappers can only be MMult 61 31 0.070 67.07
generated for primitive arrays. Canny 250 107 0.078 249.15
To generate high-performance systems, users will want Huge1k 1,000 1,503 0.111 57.90
to take advantage of SDSoC’s ability to directly connect Huge10k 10,000 15,003 0.554 151.16
Table II: Overhead of using dynamic versus static library on DMA setup time
Static Setup Dynamic Setup Overhead
Test Name DMA Type Coherency Contiguity Scenario
Time [µs] Time [µs] [µs]
mmult Simple Coherent Contiguous Standard 1.70 1.91 0.21
mmult sg Scatter-Gather Coherent Contiguous Standard 8.06 8.40 0.34
mmult hp Simple Non-Coherent Contiguous Cache Flush 18.01 19.20 1.19
mmult sg hp Scatter-Gather Non-Coherent Contiguous Cache Flush 22.47 23.20 0.73
mmult malloc Scatter-Gather Coherent Non-Contiguous Page Pinning 198.16 201.77 3.61
mmult hp malloc Scatter-Gather Non-Coherent Non-Contiguous Page Pin+Flush 217.83 218.86 1.03

100MHz for simplicity even though many of the cores can 1024 times and we averaged the runtimes for each function.
achieve higher performance. We leave system customization Table II shows characteristics of each test and the additional
to max out the performance to future work. We used a overhead of the dynamic library.
modified version of the PYNQ 2.0 Ubuntu image with Notice that all tests require more time to use the dynamic
an updated SDSoC library from version 2016.1 to version library to setup the DMA versus the static library. Overall,
2017.2 to be compatible with the latest tools. these overheads are very reasonable for the benefit of using
a dynamic library since the overhead is not dependent on
A. Evaluation of sPyC Translations the amount of data being transferred. Also notice that the
To evaluate the performance of sPyC to translate Python amount of time it takes to setup the DMA to transfer
to C, we used a series of functions to see how capable data varies wildly depending on which DMA is used, how
our tool is to support real-world functions. We translated an memory is allocated, and whether the DMA is connected to
empty function (with no body), a simple for loop, a set of a coherent port. Users will need to keep this in mind when
conditional statements, a matrix multiplication function with determining which pragma to use in their code.
triple for loops, the Canny edge detection code, and artificial
functions with function bodies containing from 1,000 to C. Evaluation of Python Overheads
10,000 lines of code. Table I shows the results of translating Calling C/C++ routines from Python incurs additional
these functions to C using sPyC on a PC with a 2.2GHz overhead compared to calling the same routines from a
Intel Xeon E5-2630 and 512GB of memory (note that only C/C++ application. This experiment used the C/HDL-based
up to 16GB of memory was ever used). For each test, we flow, where we modified custom C accelerator code with
recorded the number of lines of code in the entire source file, different pragmas, but the Python code remained unchanged.
not just the function that was translated, since the whole file We analyzed these overheads for the same matrix multiplica-
is parsed by the tool. We also recorded the number of lines tion applications described previously, but on a much courser
of C code in the output file and the time necessary for both: level. In this experiment, we capture the time it takes to
sPyC translating Python-to-C, and Vivado HLS synthesizing call the same accelerator function as a whole. This includes
C-to-HDL. You can see that sPyC runs much faster than setting up the DMAs, starting the accelerator, and waiting
Vivado HLS and can still handle large real-world functions. for it to complete using the dynamic library. Table III shows
the amount of time it took to call an accelerator from Python
B. Evaluation of a Centralized Library verses from C/C++. These tests use the exact same bitstream
At runtime, Python buffers are allocated on-the-fly and and drivers analyzed in the previous experiment with the
will be accessed from SDSoC’s DMA drivers. This required dynamic library.
making the SDSoC library a dynamic shared object rather In each of these tests the same accelerator is used, but
than a static library. The caveat of doing this is a potential each one has a different data transfer configuration. Using
additional overhead of each call into the library. We eval- a Scatter-Gather DMA (mmult sg) versus a Simple DMA
uated this overhead by executing various C/C++ programs (mmult) results in 1.8x longer execution of the function.
and collected the time it took to call into the library. Not using coherency (mmult hp) results in a 2.1x longer ex-
In this experiment, we used a 32x32 floating-point matrix
multiplication application with the same source code be- Table III: Overhead of calling an accelerator from Python
tween each experiment. We varied pragmas to direct SDSoC for different DMA scenarios and their relative performance
to generate systems with different data movers, coherency, C Python Overhead Relative
and memory management in order to exercise many of Test Name
[µs] [µs] [µs] Perf.
the library functions for completeness. We collected the mmult 33.48 56.00 22.52 1.0x
mmult sg 62.41 104.00 41.59 1.8x
time it took to call into the library and setup the DMA to mmult hp 71.41 107.00 35.59 2.1x
transfer data. For each test, we compiled two executables: mmult sg hp 104.97 152.00 47.03 3.1x
one statically linked, and another that was linked against the mmult malloc 613.74 673.00 59.26 18.3x
dynamic shared library. Each test executes the accelerator mmult hp malloc 661.07 709.00 47.93 19.7x
ecution and using non-contiguous memory (mmult malloc) Table V: Canny edge detection resource results
results in a 18.3x longer execution. Yet, for each of these LUTs FFs DSPs 36kBRAMs
Test Name
tests the overhead of calling the function from Python [53,200] [106,400] [220] [140]
Original Python* 23,246 31,076 108 151.5
averages only 42.3 microseconds. This is because data is Unoptimized HLS 11,722 15,596 20 27.5
not copied from the Python domain to the C/C++ domain, Pipelined HLS 17,061 21,011 20 26.5
instead the same memory is accessed and only addresses are Partitioned HLS 18,250 22,356 30 28.5
*Note - Original Python results are post-synthesis as it failed placement
passed.
These results showed large performance benefits to trans-
ferring contiguous memory over virtual memory. Another partitioned. We also implemented a version of the applica-
important factor is the performance of the allocation. The tion that used the OpenCV library function in software for
time to allocate a 4KB buffer in Python using a non- comparison.
contiguous allocator takes approximately 32 microseconds These results show that modifying source code to be per-
compared to 283 microseconds to allocate a contiguous formant in HLS is normally counter-intuitive to improving
buffer. Even with the additional latency, in many cases software performance as the refactored code resulted in a
higher performance is achieved using contiguous memory significant slowdown running in Python. Yet implementing
by carefully managing allocations and reusing buffers. that refactored code in hardware gave a nice 820x speedup
over the original Python implementation. After adding some
D. Canny Edge Detection
HLS pragmas (pipeline, array partition, etc.) without mod-
For this experiment, we sought to accelerate a complex ifying the algorithm, we were able to achieve a 39,137x
real-world application in Python with the goal to use an speedup which was also 6x faster than the OpenCV version.
existing implementation in the style of Python rather than These results show that it is possible to, from Python,
C/C++ rewritten in Python. We chose a Canny edge detec- manipulate an algorithm to achieve performance in an FPGA
tion application that was available open source, designed and using our tools.
written in Python [15] in order to evaluate our tools. First, Table V shows the post-place and route resource results
we took the unmodified source code and tweaked it to use of implementing the various systems. The Original Python
syntax supported by sPyC. This required modifying 11 lines failed to implement due to too many BRAMs unable to be
of code out of 279 with annotations necessary for typing as placed on the device (151.5 out of 140 available) and as
described in Section III-B. Then we attempted to implement such these results are post-synthesis only. After refactoring
the function as-is via HLS but the placement of the design the code, the Unoptimized HLS system was able to be
in the small 7020 device failed because the algorithm as- implemented successfully. Notice that the Pipelined HLS
written was using too many copies of the current frame, example uses less BRAMs than the Unoptimized HLS.
which resulted in too many BRAMs being used. In order to This is because some arrays were implemented in flip-flops
get the design to fit, we combined the various loops together instead of BRAMs and as a result the pipelined design used
to operate on line buffers rather than whole frames. Finally, 5,415 more flip-flops. The final Partitioned HLS design used
we added directives to improve performance. more BRAMs as arrays were partitioned and more LUTs
Table IV shows the performance results of running the and flip-flops were used to generate muxes to interface all
Canny edge detection algorithm on the PYNQ board for of these memories. In addition, more DSPs were able to be
a 320x240 sized image. The Original Python is the un- fed thanks to the memory partitioning.
modified open source implementation and the Refactored
Python is the version modified to meet the sPyC syntax
and refactored to fit on the device, but executed in software. V. C ONCLUSION
The Unoptimized HLS uses the refactored code implemented
We presented Hot & Spicy, a tool suite for integrating
in hardware with no performance pragmas (pipeline, unroll,
FPGA accelerators in Python applications. We evaluated
etc.). The Pipelined HLS version has some loops pipelined,
the capabilities of the tools and showed the overheads of
and the Partitioned HLS has both pipelining and some arrays
accessing accelerators from Python to be minimal. In this
paper, we evaluated the tools by directly accelerating Python
Table IV: Canny edge detection performance results code for a Canny edge detection application and achieved
Test Name Performance Speedup a 39,137x speedup over the initial software design and a
Original Python 48.14 sec 1.0x 6x speedup compared to an high-performance, hand-tuned
Refactored Python 139.28 sec 0.3x
Unoptimized HLS 58.68 ms 820.0x
OpenCV implementation. In the future, we will evaluate
Pipelined HLS 12.22 ms 3,939.0x integrating existing C/C++ HLS accelerators such as the
Partitioned HLS 1.23 ms 39,137.0x xfOpenCV library and existing RTL into Python applica-
OpenCV 7.19 ms 6,695.0x tions.
The tools are open source and available online at [12] V. Kathail, J. Hwang, W. Sun, Y. Chobe, T. Shui, and J. Car-
https://spicy.isi.edu with all of the examples and rillo. SDSoC: A Higher-level Programming Environment for
tests mentioned in this work. Zynq SoC and Ultrascale+ MPSoC. ACM/SIGDA Interna-
tional Symposium on Field-Programmable Gate Arrays, Feb.
ACKNOWLEDGMENT 2016.

This work was supported by the National Aero- [13] E. Logaras, O. G. Hazapis, and E. S. Manolakos. Python to
nautics and Space Administration (NASA) under grant Accelerate Embedded SoC Design: A Case Study for Systems
80NSSC17K0286. The views, opinions, and/or findings Biology. ACM Transactions on Embedded Computer Systems,
13(4), Mar. 2014.
expressed are those of the author(s) and should not be
interpreted as representing the official views or policies of [14] A. Rigo and S. Pedroni. PyPy’s approach to virtual machine
NASA or the U.S. Government. construction. Computational Science & Discovery, Oct. 2006.
The authors would like to thank the Xilinx University
Program for their donation of boards and licenses. Thanks [15] Rosetta Code. Canny Edge Detector. https://rosettacode.org/
wiki/Canny edge detector#Python.
to both the PYNQ and SDSoC teams at Xilinx for their
support. [16] D. Sheffield, M. Anderson, and K. Keutzer. Automatic
Generation of Application-specific Accelerators for FPGAs
R EFERENCES from Python Loop Nests. International Conference on Field
[1] D. Beazley. SWIG: An Easy to use tool for Integrating Programmable Logic, Oct. 2012.
Scripting Languages with C and C++. USENIX Tcl/Tk
Workshop, July. 1996. [17] S. van der Walt, S. C. Colbert, and G. Varoquaux. The NumPy
Array: A Structure for Efficient Numerical Computation.
[2] S. Behnel, R. Bradshaw, C. Citro, L. Dalcin, D. S. Seljebotn, Computing in Science & Engineering, 13(2), 2011.
and K. Smith. Cython: The Best of Both Worlds. Computing
in Science & Engineering, 13(31), 2011. [18] G. van Rossum, J. Lehtosalo, and Ł. Langa. PEP 484 – Type
Hints. Index of Python Enhancement Proposals, Sept. 2014.
[3] A. Canis, J. Choi, M. Aldham, V. Zhang, A. Kammoona, https://www.python.org/dev/peps/pep-0484/.
J. H. Anderson, S. Brown, and T. Czajkowski. LegUp:
High-level Synthesis for FPGA-based Processor/Accelerator [19] Xilinx. DS083: Virtex-II Pro and Virtex-II Pro X Platform
Systems. International Symposium on Field Programmable FPGAs. Jan. 2002.
Gate Arrays, Feb. 2011.
[20] Xilinx. PYNQ: Python Productivity for Zynq. 2017. http:
[4] J. Cong, B. Liu, S. Neuendorffer, J. Noguera, K. Vissers, //www.pynq.io/.
and Z. Zhang. High-Level Synthesis for FPGAs: From
Prototyping to Deployment. IEEE Transactions on Computer- [21] Xilinx. Xilinx xfOpenCV Library. June 2017. https://github.
Aided Design of Integrated Circuits and Systems, 30(4), 2011. com/Xilinx/xfopencv.

[22] R. Cieszewski, K. Pozniak, and R. Romaniuk. Python


[5] Continuum Analytics. Numba. Dec. 2017. http://numba.
based high-level synthesis compiler. Photonics Applications
pydata.org/.
in Astronomy, Communications, Industry, and High-Energy
Physics Experiments, 9290(3A), 2014.
[6] J. Decaluwe. MyHDL: A Python-based hardware description
language. Linux Journal, 2004(127), Nov. 2004.

[7] N. Diakopoulos and S. Cass. The 2017 Top Programming


Languages. IEEE Spectrum, July. 2017.

[8] M. Dufour. ShedSkin: An experimental (restricted) Python-


to-C++ Compiler. June 2013. https://shedskin.github.io/.

[9] R. Gonzalez, P. House, I. Levkivskyi, L. Roach, and G. van


Rossum. PEP 526 – Syntax for Variable Annotations. Index
of Python Enhancement Proposals, Aug. 2016. https://www.
python.org/dev/peps/pep-0526/.

[10] S. Guelton, P. Brunet, M. Amini, A. Merlini, X. Corbillon,


and A. Raynaud. Pythran: Enabling Static Optimization of
Scientific Python Programs. Object-oriented Programming
Systems, Languages, and Applications, 8(1), 2015.

[11] H. Kataoka and R. Suzuki. Polyphony: A Python-Based


High-Level Synthesis Compiler. Workshop on Open Source
Supercomputing, June 2017.

You might also like