Thanks to visit codestin.com
Credit goes to www.scribd.com

0% found this document useful (0 votes)
13 views14 pages

Mapping and Execution of Nested Loops On

Mapping and Execution of Nested Loops on

Uploaded by

Jonathan Song
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
13 views14 pages

Mapping and Execution of Nested Loops On

Mapping and Execution of Nested Loops on

Uploaded by

Jonathan Song
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 14

1

Mapping and Execution of Nested Loops on


Processor Arrays: CGRAs vs. TCPAs
Dominik Walter, Marita Halm, Daniel Seidel, Indrayudh Ghosh,
Christian Heidorn, Frank Hannig, Jürgen Teich
[email protected]
Hardware/Software Co-Design, Department of Computer Science
Friedrich-Alexander-Universität Erlangen-Nürnberg (FAU), Germany
arXiv:2502.12062v1 [cs.AR] 17 Feb 2025

Abstract—Increasing demands for computing power also propel utilize a more fine-grained mapping approach. Unlike GPUs
the need for energy-efficient SoC accelerator architectures. One that exploit vectorized data processing over multiple programs
class of such accelerators are so-called processor arrays, which in SPMD mode, or application-specific processors such as
typically integrate a two-dimensional mesh of interconnected
processing elements (PEs). Such arrays are specifically designed tensor cores and AI accelerators, these arrays offer individually
to accelerate the execution of multidimensional nested loops programmable and locally register-to-register communicating
by exploiting the intrinsic parallelism of loops. Moreover, for PEs to efficiently support the execution of loop nets with
mapping a given loop nest application, two opposed mapping loop-carried dependencies. Therefore, processor arrays are
methods have emerged: Operation-centric and iteration-centric. particularly well-suited for executing multidimensional nested
Both differ in the granularity of the mapping. The operation-
centric approach maps individual operations to the PEs of the loops in parallel such as matrix computations and linear algebra
array, while the iteration-centric approach maps entire tiles problems in general. Such applications contain a substantial
of iterations to each PE. The operation-centric approach is amount of inherent parallelism that can be exploited across
applied predominantly for processor arrays often referred to as multiple PEs. However, the process of automatically mapping
Coarse-Grained Reconfigurable Arrays (CGRAs), while processor such loops onto an array is complex and requires sophisticated
arrays supporting an iteration-centric approach are referred to as
Tightly-Coupled Processor Arrays (TCPAs) in the following. This mapping strategies. Recent years have seen a tremendous boost
work provides a comprehensive comparison of both approaches in the development of processor arrays specifically designed
and related architectures by evaluating their respective benefits to target loop nests, leading to a diversity of architectures,
and trade-offs. We analyzed five toolchains for loop mapping mapping approaches, and toolchains. Out of this research, two
on CGRAs and TCPAs, evaluating as well qualitative factors principally different mapping philosophies have emerged:
(e. g., intuitiveness, flexibility) as quantitative metrics (power,
performance, area). As a result, it is shown that for an equal 1) Operation-centric mapping: A multidimensional loop
number of PEs, both architectures offer distinct advantages: The nest is sequentially executed except of the innermost
simpler structured CGRAs offer a better area cost (6.26 × ), and
a lower power consumption (1.69 × ) when implemented on an loop. The operations and data dependencies of the loop
FPGA, while TCPAs dominate in terms of achievable performance body are captured in a data flow graph (DFG). Each
and efficiency, outperforming CGRAs in all tested benchmarks by PE is then assigned a set of operations (nodes in the
up to 19 × . This suggests that CGRAs are particularly suitable for DFG), while data dependencies (edges) are mapped onto
area-constrained applications, while TCPAs excel in performance- the interconnections between PEs. A resulting sequence
critical scenarios by exploiting multicycle operations as well as
multi-level parallelism in higher-dimensional loop nests. of operations per PE is then generated, loaded, and
repeatedly executed in such a way that the execution
Index Terms—Loop accelerators, CGRA, TCPA of iterations of the innermost loop can overlap in time.
2) Iteration-centric mapping: The mapping granularity
I. I NTRODUCTION extends beyond individual operations to entire tiles of
iterations. Here, the multidimensional iteration space is
The escalating demand for computing power has driven the
divided into congruent tiles of iterations with all iterations
evolution of diverse compute architectures, each tailored to
within a tile being assigned for execution by a single
maximize performance, efficiency, or scalability. A prominent
PE. Hence, parallelism can be exploited potentially in
approach is organizing many processing elements (PEs) into a
multiple loop dimensions while preserving the inherit
two-dimensional grid, known as a processor array. This class
data locality of loops.
can be further distinguished between arrays of complex, full-
fledged processor cores and arrays in which the PEs act more Architectures supporting operation-centric mappings are com-
as compute units with minimal programmability. Processors monly referred to as Coarse-Grained Reconfigurable Arrays
of the first class are commonly known as manycores [1, (CGRAs) [4, 5], while those following an iteration-centric
2] or Massively Parallel Processor Arrays (MPPAs) [3] and mapping are called Tightly-Coupled Processor Arrays (TC-
target general-purpose applications written in typical parallel PAs) [6, 7, 8] in the following. Both approaches are targeting the
programming models such as, for example, OpenCL. In this acceleration of multidimensional loop programs, with authors
work, we focus on the alternative: Arrays of tiny PEs that claiming high speedups, low hardware costs, and low power
2

Sel i2 Mul Load


North IO
C[i2 , i1 ]
Add Add Add
C[i2 , i1 ] L S A
Crossbar
PE PE PE PE Switch ALU
Cmp Sel i1 Mul Store

West IO
S A C

East IO
Scratchpad Memory (SPM)
A M S A
Add Add Load
PE PE M PE PE Local Instruction
A[i2 , i0 ] M M C Registers Memory
Cmp Mul Mul
L A S A
B[i0 , i1 ] PE PE PE PE
South IO
Add i0 Add Load L A M C

Cmp Sel i0 Mul PE PE PE PE

Indices Address Memory MAC


Computation Computation Access Operations

Figure 1. Operation-centric mapping approach of Coarse Grained Reconfigurable Arrays. On the left, a simplified data flow graph (DFG) of a matrix
multiplication is shown. The nodes representing operations are grouped into indices computation (blue), address computation (brown), memory access (purple)
and finally multiply and accumulate operations (red). Edges denote the data dependencies. This DFG is mapped onto the 4×4 CGRA architecture shown on the
right. Each PE contains a functional unit (FU), e. g., an ALU, a local register file, switches, and a configuration memory, adapted from [10].

consumption. This paper aims to evaluate the pros and cons of data storage for intermediate results. The crossbar connects the
both mapping strategies and related architectures by providing PE with its adjacent neighbors, enabling data transfer between
a comparative analysis of CGRAs and TCPAs. Although both neighboring PEs in a single cycle. The operation performed
classes of architectures have been extensively studied, this by the FU and the routing for the crossbar can be configured
paper is, to the best of our knowledge, the first to provide an at the granularity of clock cycles. The instruction memory can
in-depth comparison of both approaches by: store a sequence of predetermined per-cycle configurations to
• Describing the essentials of the architecture classes, execute one loop iteration. For example, a configuration can
mapping approaches, and available toolchains for CGRAs specify that a FU of a PE performs an addition in one cycle,
and TCPAs in detail in Section II and Section III. and that the crossbar forwards the result to a neighboring PE
• Examining qualitative factors, i. e., intuitiveness, robust- in north, east, south, or west direction. In the next cycle, the
ness, correctness, scalability, flexibility, and limitations, PE can perform a multiplication, storing the result in a local
of available CGRA and TCPA toolchains in Section IV. register. Such a predetermined sequence of configurations is
• Conducting a quantitative comparison of power, perfor- repeated for each iteration of a given loop nest. Most CGRAs
mance, and area (PPA) for both architecture classes in also support conditional execution by predication, i. e., the
Section V for a set of loop nest benchmarks. execution of some instructions is masked by a predication bit
• Discussing resulting trade-offs and limitations of both that was set by a conditional instruction. According to Figure 1,
classes of architectures and mapping approaches in Sec- typically only a subset of PEs has a direct access to an attached
tion VI and Section VII. on-chip scratchpad memory (SPM) that can buffer input and
output locally. Moreover, because only neighboring PEs can
II. C OARSE -G RAINED R ECONFIGURABLE A RRAYS read or write data within one clock cycle, transferring data
Coarse-grained reconfigurable arrays (CGRAs) were first to a PE further away usually requires multiple cycles, while
presented in the 1990s, and since then, many CGRA architec- the intermediate PEs are then occupied for communication.
tures have been developed by both industry and academia, see, HyCUBE [10, 12] alleviates these issues by a reconfigurable
e. g., [4, 5]. Wijtvliet et al. [4] state that most architectures interconnect with single-cycle multi-hop connections.
are academic projects, but they have also found commercial
usage [9]. In the following, we will first introduce the B. Mapping
architecture of CGRAs in Section II-A and discuss the mapping
CGRAs are designed to accelerate nested loops by utilizing
approach in Section II-B.
an operation-centric mapping approach. The operations and data
dependencies of the loop body given from a multidimensional
A. Architecture nested loop nest, specified, e. g., by a C/C++ program, are
According to [11], a typical CGRA architecture consists of a captured in a DFG (V, E) in which a node vi ∈ V represents
network of interconnected PEs arranged in a two-dimensional an operation, and an edge (vi , vj ) ∈ E represents a dependency
grid, as shown in Figure 1 (right). To keep the hardware simple between nodes vi and vj . The mapping process of such a DFG
and modular, each PE contains one functional unit (FU), a onto a CGRA can be summarized as a) binding, i. e., assigning
set of local registers potentially arranged in a register file, a each node vi a PE β(vi ), b) scheduling, i. e., assigning each
crossbar switch, and an instruction memory. The FU usually node vi a start time τ (vi ), and c) routing, i. e., assigning
supports different arithmetic, logic, and memory operations each edge (vi , vj ) a route connecting the PEs β(vi ) and β(vj )
at the word level. The local registers are used as temporary such that the data arrives exactly at the right cycle at the FU
3

of the target PE. This is achieved by an allocation of ri,j a consequence, the Sel operation of the next iteration cannot
register slots, i. e., a register at a certain time, that must suffice be started before the Cmp and Add operations of the previous
τ (vi ) + di + ri,j = τ (vj ), where di denotes the latency of the iteration are completed. Thus, the cycle length determines a
node (operation) vi . The DFG describes the data dependencies minimal possible II, called the recurrence minimum initiation
of just one single loop iteration that is mapped and scheduled interval (RecMII). Also, the minimal possible II may be further
on the CGRA, in which node vi of loop iteration i2 is to be limited by a resource minimum initiation interval (ResMII).
computed at time τ (vi ) + II · i2 , where II denotes the initiation For example, given a CGRA with 9 PEs, the actual minimal
interval. Since it is possible that there is a node vi that is possible II is 3, because with II = 2, each iteration would only
planned after the next iteration has already been started, i. e., allow for 9 · 2 = 18 nodes to be scheduled. This is due to the
τ (vi ) ≥ II, the execution of multiple iterations may overlap in massive overhead for computing the indices and addresses. For
time. Overall, the mapping process aims to minimize the II as the considered loop nest example, i. e., a loop body consisting of
it directly reflects the overall latency of the resulting loop nest only a single MAC computation, the resulting DFG consists of
execution. While there exists a wide variety of approaches and a total of 22 nodes to be executed on the PEs per iteration. Note
toolchains in the literature, we give in the following a general that this example is still highly simplified and real-world DFGs
overview on the DFGs of multidimensional loop programs. are much more complex, but similar in structure. The merits
Specific toolchains will be introduced later in Section II-C. and drawbacks of such operation-centric mapping approaches
Example. consider Figure 1. Shown is a simplified but are analyzed in Section IV and Section V.
representative DFG of a typical 3-dimensional loop nest for
computing a matrix-matrix multiplication. Each node denotes
one operation, and the edges show the dependencies between C. Tools
the operations. For the execution of each iteration of the Many different CGRA architectures have been proposed
3-dimensional loop nest with index vector (i0 , i1 , i2 ), four over the years, and some related toolchains are also publicly
types of computations are involved: (a) Determination of the available. This paper selects four representative toolchains for
current loop indices—the corresponding operations are shown analysis and comparison with TCPA approaches, including
in Figure 1 on the left. Each loop index computation requires CGRA-Flow [13], Morpher [14], Pillars [15], and CGRA-
three operations, e. g., to compute the innermost loop (index i2 ), ME [16], which are briefly introduced in the following.
a Sel, Add, and Cmp operation is needed. Sel is a multiplex 1) CGRA-Flow: CGRA-Flow [13], also known as Open-
operation that uses the result of the Cmp instruction to either CGRA, is a toolchain for the compilation, exploration, synthesis,
forward the output of the Add operation or zero. The Cmp and development of CGRA architectures [13]. It is open-
compares the result of the Add operation, which increments source and available on GitHub1 . CGRA-Flow has a GUI
the current loop index, against a predefined constant, here, the for visualizing input, output, and intermediate results. As input,
loop bound. Note that the data dependencies towards the Sel users describe or select a loop program written in C/C++.
operations are inter-iteration dependencies. This effectively CGRA-Flow supports an operation-centric mapping of up to two
implements a cyclic accumulator. Furthermore, the result of innermost loop nests with control flow in the loop body or up to
the Cmp operation can also be used as an addend for the Add three innermost loop nests without any control flow in the loop
operation of the next loop index. Therefore, the second level body onto the user-specified CGRA. Within the GUI, users can
(index i1 ) is only incremented once the first one reaches its configure a CGRA architecture instance by selecting the number
loop bound, implementing a two-dimensional loop counter. of PEs, the number of operations mapped to one PE, the size of
This can be repeated for a third outer dimension (index i0 ) as the memory buffer, the operation types that the PE can execute,
shown in Figure 1. (b) Then, once the loop indices have been the connections to neighboring PEs, and the disablement of
properly determined for the current iteration (i0 , i1 , i2 ), the entire PEs. Note that each PE can only perform single-cycle
addresses of the matrix elements that are to be accessed in this operations. Compiling the user-given loop and generating
iteration must be computed. This is done by multiplying the the corresponding DFG is performed using LLVM’s [17]
loop indices with fixed strides and adding the results together. intermediate representation to extract the operations and the
(c) Afterward, the computed addresses are used to load the dependency between operations. The generated DFG can be
inputs and store the output (see the memory access section visualized in the GUI. Before starting the mapping phase,
in Figure 1). A restriction here is that in contrast to the other the user can select between two mapping algorithms, called
operations, the corresponding Load and Store operations exhaustive and heuristic. While the exhaustive algorithm checks
cannot be executed on all PEs, but only on those PEs that have all possible mappings for one given initiation interval II, the
access to the SPM, which are, as shown in Figure 1, only the heuristic approach starts with a minimal initiation interval and
border PEs. (d) Only then can the Mul and Add operations, iteratively increments it until a mapping with the lowest cost
forming the only computational part of the loop nest, i. e., (based on a heuristic function) for the current initiation interval
one partial product, be computed before the result is written II has been found. The resulting mapping is then visualized
back by a Store operation. By studying the above DFG, we in the GUI. After mapping, the user can generate Verilog via
can already observe some interesting properties of CGRAs PyMTL to run various tests on the architecture [13] and to
and related operation-centric mappings. First, note that the estimate the area and power of PEs and on-chip memory.
DFG contains multiple performance-constraining cycles, e. g.,
Sel → Add → Cmp → Sel, inside the indices computation. As 1 https://github.com/tancheng/CGRA-Flow
4

2) Morpher: Morpher is an integrated compilation and predication, no support for conditional code was available.
simulation toolchain [14] available on GitHub2 . As input, the CGRA-ME maps the extracted DFG onto a target CGRA,
user provides a description of the target CGRA architecture which is specified by a provided architecture description. It
and a loop program written in C/C++ that should be mapped offers to choose between three different operation-centric
onto the target architecture. The DFG generator begins by mapping approaches. First, CGRA-ME supports an ILP-based
extracting the innermost loop of the program and generating mapping that finds an optimal mapping but is very slow.
the corresponding DFG using LLVM [17]. It offers three Additionally, a heuristic approach reduces the search space of
schemes for handling loop control flow in a CGRA, influencing the ILP. Finally, CGRA-ME also includes a so-called clustered
DFG generation: partial predication, full predication, and dual- mapper that incorporates a simulated-annealing approach [20]
issue [14], described in detail in [18]. Partial predication maps utilizing both QuickRoute [24] and PathFinder [19]. After the
the if-part and else-part operations to different PEs, adding mapping, CGRA-ME produces a Verilog description of the
a select node if both parts update the same variable. Full given architecture and a bitstream containing the configuration
predication schedules both parts using the same variable to of the CGRA. However, simulation without additional external
the same PE, with one operation executed per cycle, avoiding toolchains is not available.
the need for a select node. Dual-issue merges both operations
into one DFG node, scheduling them simultaneously but only III. T IGHTLY-C OUPLED P ROCESSOR A RRAYS
executing one at run-time. In this work, we only consider In contrast to CGRAs, so-called Tightly-Coupled Processor
partial predication because it was the most reliably supported Arrays (TCPAs) [6, 7, 8] are designed such to support an
mapping. The resulting DFG and data layout for input/output iteration-centric mapping approach. TCPAs are constructed
variables on the SPM are then used by the CGRA mapper to to enable the parallel execution of multidimensional loop
find a valid mapping using three algorithms: PathFinder [19], nests that are expressed using a strict mathematical model
Simulated Annealing [20], and LISA [21]. The mapping can related to polyhedral recurrence equations called Piecewise
then be verified against automatically created test data [22] by Regular Algorithms (PRAs) [25, 26], see Section III-B. This
simulating the execution using a simulator that models a CGRA specification allows for the most natural description and
with FUs, registers, multiplexers, and memory banks [14], mapping of loops [27, 28]. For example, contrary to a loop
supporting variations of HyCUBE. nest in an iterative language such as C/C++, there is no implied
3) Pillars: Pillars is an open-source CGRA design toolchain order of iteration or operation execution existing at all. In the
based on Scala and Chisel [15]. The toolchain is publicly following, we give first in Section III-A a short overview of
available on GitHub3 . It has been designed as a tool for the hardware architecture followed by further details on the
conducting design space explorations and further hardware iteration-centric mapping approach.
optimizations of CGRAs. The user must provide a Scala-
based architecture description of the CGRA, which is then
systematically converted by Chisel into a synthesizable Verilog A. Hardware Architecture
description of the CGRA. The design can be synthesized Both CGRA and TCPA architectures feature a two-
for FPGAs to determine the performance, area and power dimensional array of small programmable PEs with a con-
consumption of the CGRA. In contrast to the other toolchains, figurable circuit-switched interconnect, as shown in Figure 2.
the Pillars toolchain does not support any automated DFG However, TCPAs differ significantly in their design and
generation from source code. Thus, the user must already capabilities. Unlike CGRAs, each PE of a TCPA may be
provide the application as a DFG. Pillars offers two mapping configured to include not only one but multiple parallel
algorithms, an ILP mapper and a Heuristic Search mapper. The functional units (FUs). Different to a VLIW organization that
ILP mapper is slow but succeeds more frequently than the can suffer from low code densities, each PE adheres to the
Heuristic Search mapper. Therefore, the ILP mapper is used principle of orthogonal instruction processing (OIP), see [29].
for all loop kernels in this work. The resulting mapping can In principle, each FU has its own instruction memory, branch
then be either simulated on the RTL by Verilator, or used to unit, and program counter, allowing it to run independent
determine the performance, area and power estimate for the micro-programs but sharing flags and registers with all other
execution on the actual hardware. FUs. Such micro-programs consist of FU-specific instructions
4) CGRA-ME: CGRA-ME is an open-source toolchain for and branch instructions. While each FU instruction refers to
modeling and exploration of CGRAs [16]. It is currently a certain operation Fi , branch instructions handle conditional
available in its second version (first release [23]) and supports jumps within the program. Operand dependencies are managed
end-to-end CGRA compilation and simulation with RTL code through a local data register file, with two source operands and
generation. The toolchain is open-source and can be downloaded one destination operand specified in each instruction, while
from the project’s website4 . It uses LLVM to extract a DFG the control signals used for the branch instructions originate
from a given C/C++ source code. However, it does not support from a control register file. The latter is connected to a Global
any nested loops as inputs, but although it supports partial Controller (GC) that coordinates and synchronizes all PEs,
generating the compiler-generated schedule once for the entire
2 https://github.com/ecolab-nus/morpher array of PEs. The LION I/O controller [31] manages DMA
3 https://github.com/pku-dasys/pillars transfers, fetching input and pushing output data between
4 https://cgra-me.ece.utoronto.ca/download/ external memory and the four I/O buffers surrounding the
5

Data Interconnect Control Interconnect

Branch

Decoder
Clock Global Controller I/O Buffers FP32/8×4 Instruction
(GC) Memory
Address Generators (AGs)
PE PE PE PE PE PE PE PE Input Registers
Branch

Address Generators (AGs)

Address Generators (AGs)

Decoder
PE PE PE PE PE PE PE PE
Configuration

Control Register File


CPU General Purpose Registers

Data Register File


PE PE PE PE PE PE PE PE FP32/8×4 Instruction
Manager

I/O Buffers

I/O Buffers
Memory
PE PE PE PE PE PE PE PE
I/O

PE PE PE PE PE PE PE PE
Branch

Decoder
DMA Loop I/O Controller PE PE PE PE PE PE PE PE
(LION) FP32/8×4 Instruction
PE PE PE PE PE PE PE PE Memory
PE PE PE PE PE PE PE PE Input Registers

MEM Address Generators (AGs) Branch Functional Unit

Decoder
DMA Feedback Registers Flags
I/O Buffers INT 32 Instruction
Host Output Registers Output Registers
Memory

Data Interconnect Control Interconnect

Figure 2. Architecture of an 8 × 8 TCPA (left) and the OIP-based [29] PE architecture (right) from [30]. The array is surrounded by 4 I/O buffers with
address generators and has peripheral controllers shown left to the array. Each PE has a data and control register file and may have multiple functional units.

array. Address calculations are all performed by programmable


address generators within each memory bank. This approach S1,a : a[i] = A[i0 , i2 ] if i1 = 0
fully relieves PEs from handling loop indices computations S1,b : a[i] = a[i0 , i1 − 1, i2 ] if i1 > 0
and address calculations. S2,a : b[i] = B[i2 , i1 ] if i0 = 0
S2,b : b[i] = b[i0 − 1, i1 , i2 ] if i0 > 0
B. Piecewise Regular Algorithms (PRA)
S3 : p[i] = a[i] · b[i]
In a PRA [25, 26], a multidimensional for-loop with n
nests is formed by an n-dimensional polyhedral iteration space S4,a : c[i] = p[i] if i2 = 0
I ⊆ Zn . Each iteration index i = (i0 , . . . , in−1 )⊺ ∈ I denotes S4,b : c[i] = c[i0 , i1 , i2 − 1] + p[i] if i2 > 0
a single iteration of the loop nest. Input data, output data, and S5,C : C[i0 , i1 ] = c[i] if i2 = N − 1
computations are defined on a set of variables x ∈ X . Each
instance of a variable is indexed by xi [i], i. e., xi at iteration Figure 3. A matrix multiplication C = A · B with A, B, C ∈ ZN ×N
index i. The loop nest itself is described by a set of quantized expressed as PRAs with iteration space I = {(i0 , i1 , i2 )⊺ ∈ Z3 | 0 ≤
equations S = {S0 , S1 , . . . } that define data dependencies i0 , i1 , i2 < N }. The notations x[i] and x[i0 , i1 , i2 ] are equivalent.
between variables within the iteration space I ⊆ Zn . These
equations are defined as follows, see, e. g., [25, 26]:
where it is accumulated in every step (Eq. S4,b ). The final matrix
Si : xi [Pi i + fi ] = Fi (. . . , yi,j [Qi,j i − di,j ], . . . ) if i ∈ Ii elements are computed at i2 = N − 1 and written to the output
matrix C (Eq. S5,C ). Now, instead of performing an operation-
For all i ∈ Ii ⊆ I, the variable xi is defined as the result centric mapping, TCPAs support an iteration-centric mapping
of applying the function Fi on a list of variables yi,j with that starts with a partitioning (or tiling) step and a subsequent
1 ≤ j ≤ arity(Fi ). Affine indexing functions Pi i + fi and scheduling step that will be explained in the following together
Qi,j i − di,j provide the indices with which the variables are with a number of additional steps like register binding, code
accessed. Furthermore, we name those variables that are only generation, I/O buffer allocation and, finally, the generation of
read but not defined as the set of input variables Xin and those the final TCPA configuration.
variables that are only defined but not used as the set of output
variables Xout and all others as the set of internal variables Xvar .
In a PRA, the indexing functions of the internal variables are C. Partitioning
further restricted to simple translations, i. e., Pi and Qi,j denote The iteration space I is partitioned into t0 × · · · × ti ×
identity matrices. Each equation Si is only defined within its · · · × tn−1 rectangular tiles of size p0 × · · · × pi × · · · × pn−1 .
domain, i. e., a subspace of the iteration space. The domain is Each tile, a subset of the iteration space, is then mapped
given by a condition space Ii that can usually be described by to a single PE. This approach follows a local sequential,
a system of linear inequalities, thus, Ii = {i ∈ I | Ai i ≥ bi } global parallel (LSGP) strategy [25, 26, 32], where each PE
with Ai ∈ Zm×n , bi ∈ Zn . As an example, Figure 3 lists implements in its instruction memory a sequential schedule
the equations of a PRA implementing a matrix multiplication of all iterations within its assigned tile, while all PEs run
computing C = A · B with A, B, C ∈ ZN ×N . A is read-in concurrently. Formally, the partitioned iteration space I ∗ is
at i1 = 0 (Eq. S1,a ), and propagated along i1 > 0 (Eq. S1,b ). decomposed into an intra-tile space J and an inter-tile space
Similarly, B is read-in at i0 = 0 (Eq. S2,a ), and propagated K, where j ∈ J denotes the index vector of an iteration
along i0 > 0 (Eq. S2,b ). Thus, each iteration has access to within a tile, and k ∈ K indicates the PE index responsible
a single element of A and a single element of B that are for executing that iteration. As an example, consider again the
multiplied (Eq. S3 ). The result is propagated along i2 (Eq. S4,a ), PRA of the matrix multiplication shown in Figure 3. Figure 4
6

must be mapped and scheduled on functional units within a


I/O Buffers & AGs
PE, with each operation Fi characterized by an execution
time δi in clock cycles on the respective functional unit5 .
The execution of successive iterations within a tile then starts
every initiation interval II cycles. Finally, the scheduler must
I/O Buffers & AGs

I/O Buffers & AGs


PE PE
k2 a a a a determine the order in which the iterations are started such
j2 b × b × b × b ×
a + a + a + a +
that all inter-iteration data dependencies are satisfied. This
a
b ×
+ a
b ×
+ a
b ×
+
b
a
×
+
so-called loop schedule is described by a linear schedule vector
b × a b × a
b × a b × a λ∗ = (λj , λk ), where the start time of each intra-tile iteration
a + a b × + a b × + a
b × + b ×
b × a b × + a
b × + a b × + a + j ∈ J is given by λj j, and the start time of each inter-tile
+
a
b × +
+ a
b × +
+ a
b
PE
× +
+ a
b ×
+ PE iteration (and PE) k ∈ K is defined by λk k. Further details
j0

a
b ×
+
a
a
×
+
b
× a
a b ×
× +
a b ×
× +
a
×
on the scheduling algorithms are available in [33, 34]. Note
b b b a b
b × a b × + a b × + a b × + a + that this approach even supports the scheduling of loop nests
+ b × +
b × + b × + b ×

k0
a + a + a + a + with at-compile time unknown loop bounds [35, 36].
b × ×
a b a
b × a b × a
a + a+ × a × + I/O Buffers & AGs
a × + ×
b × a b ×
b + a
b ×
b + a b ×
b + a b +
+ × + × + × + × E. Register Binding
a b + a b + a b + a b +
j0
× × × × Finally, with each iteration and operation scheduled, registers
a b + a b + a b + a b +
× × × × for data dependencies can be allocated and bound accordingly.
b + b + b + b +
j1 j1 To handle different data dependencies, each PE has a register
k1
file with specialized register types for each dependency type,
detailed in the following. The mapping of the inputs and output
Figure 4. Simplified 4 × 4 × 4 iteration space of a matrix multiplication that is
tiled into 2 × 2 × 1 tiles and mapped onto a 2 × 2 PE array shown behind. Each dependencies is explained later in Section III-G.
gray circle denotes one iteration consisting of 4 operations as specified by the 1) General Purpose Registers (RDs): Single-word registers
loop body. The edges denote data dependencies, whereas its color denote its for intra-iteration or inter-iteration intra-tile dependencies. In
type, i. e., input (red), intra-iteration (white), inter-iteration intra-tile (yellow),
or inter-iteration inter-tile (green). Although each iteration contains the same both cases, the lifetime, i. e., the number of cycles between
operations, the type of the data dependencies of the contained operations is write and read, must be shorter than the initiation interval.
different, which is reflected by both the position and color of the operation. 2) Feedback Registers (FDs): Internal FIFOs for dependen-
cies where the lifetime exceeds the initiation interval. These
are typically inter-iteration intra-tile dependencies that are
illustrates the corresponding iteration space after partitioning
written multiple times before the first read, as visualized, for
in case of an input matrix size N = 4. Formally, the 4 × 4 × 4
example, in Figure 4. FDs can also be used to map intra-iteration
iteration space is divided into t0 × t1 × t2 = 2 × 2 × 1
dependencies when needed.
tiles, each of size p0 × p1 × p2 = 2 × 2 × 4. As a result,
3) Input/Output Registers (IDs/ODs): For inter-iteration
each PE is assigned 16 iterations for execution. Each iteration
dependencies across tiles (inter-tile dependencies), communica-
(represented by gray circles) contains four operations: two MOVs
tion between PEs is enabled via a configurable interconnect.
for loading and propagating matrix elements a ∈ A ∈ Z4×4
Output registers act as sending ports, pushing data through the
and b ∈ A ∈ Z4×4 , and two arithmetic operations to compute
interconnect to a target PE, where it is received into an input
c = c + a · b. The arrows between the operations indicate
FIFO. This FIFO can then be accessed by the receiving PE by
data dependencies that can be further distinguished. Consider
reading from a corresponding input register. The interconnect
the definition of an equation Si . The result of operation Fi
is dynamically configured to create communication channels
defines the variable xi , while Fi itself uses a list of variables
for managing inter-tile dependencies.
yi,j as inputs. This creates data dependencies among equations
4) Virtual Registers (VDs): As shown in Figure 4, an
that can be classified as either intra-iteration, inter-iteration, or
operation may generate intermediate results needed within
input/output if xi is in the output variable space (Xout ) or input
the same iteration, the next iteration, and even in neighboring
variable space (Xin ). Intra-iteration dependencies, shown by
tiles, for which different register types must be used. Therefore,
white arrows in Figure 4, occur within the same iteration when
TCPAs contain virtual registers that allow a single instruction
both fi and di,j are zero. Inter-iteration dependencies occur
to broadcast a write to multiple target registers simultaneously.
between neighboring iterations and are categorized as intra-tile
(yellow) or inter-tile (green). Intra-tile dependencies happen
within the same tile, while inter-tile dependencies occur across F. Code Generation
tiles, with each tile executed by a different PE. Due to the different condition spaces Ii , the operations that
are executed in an iteration may not always be the same. Refer,
D. Scheduling again, to Figure 4. While each tile is similar, minor differences
necessitate different programs for each PE, though in larger
The execution of a tile of iterations on a PE requires that arrays, multiple PEs may share the same program. Thus, the
each operation Fi as defined by an equation Si is assigned a compiler generates, for all possible combinations of operations
start time τi that satisfies all intra-iteration data dependencies.
In the example, each iteration includes four operations that 5 Note that TCPAs naturally support multicycle FU operations.
7

within an iteration, a sequence of instructions that reflects the FILE-CODE Data-Generator FILE-CODE PAULA FILE-CODE Architecture

Project
computed start times, τi . Afterward, it uses branch instructions
FILE-ALT Parameters FILE-ALT Constraints
to implement an iteration sequencing that matches the loop
schedule λj . It is even possible that the execution of iterations
overlaps in time, for which the instruction sequences must also Tools Code-Generator Tools Compilation
be folded. Importantly, PEs do not compute the control flow
FILE-CODE Python Code FILE-CODE Symbolic Configuration

Mapping
directly; instead, the branch instructions respond to control
signals generated by the Global Controller (GC) as shown Tools Instantiation

in Figure 2. These signals, calculated once and distributed via


FILE-CODE Concrete Configuration
a control interconnect, are shared across all PEs.
Tools Binarization

Integration
G. I/O Buffer Allocation FILE Binary Configuration

While registers handle internal variables, input and output COPY Testbenches LINUX Driver
dependencies, represented by red arrows in Figure 4, need
a different mapping. The address of each element of a

Target
Python Python CODE SW-Simulator Wave-Square RTL-Simulator MICROCHIP FPGA MICROCHIP ASIC
multidimensional input/output variable x, stored as linearized
data in external memory, is given by a storage layout sx and
offset αx . This can be combined with the indexing of the Figure 5. Overview of the TURTLE toolchain for TCPAs.
variable to create an address translation mx with offset µx :
( 1 program m a t r i x m u l t i p l i c a t i o n {
sx (Qix ,jx i − dix ,jx ) + αx x ∈ Xin 2 / / In − / O u t p u t v a r i a b l e s
mx i + µx = 3 v a r i a b l e A 2 in f l o a t ; v a r i a b l e B 2 in f l o a t ; v a r i a b l e C 2 out f l o a t ;
sx (Pix i + fix ) + αx x ∈ Xout 4
5
// Internal variables
variable a 2 float ; variable b 2 float ; variable p 2 float ; variable c 2 float ;
6 / / Parameters with values assigned at i n s t a n t i a t i o n
7 p a r a m e t e r N;
Note that on purpose, these computations are not handled by 8 / / Loop w i t h i t e r a t i o n s p a c e i n b r a c k e t s
9 p a r ( i 0 >= 0 and i 0 < N and i 1 >= 0 and i 1 < N and i 2 >= 0 and i 2 < N) {
the PEs but by dedicated address generators (AGs) configured 10 / / Read m a t r i x A and p r o p a g a t e a l o n g i 1
11 a [ i 0 , i 1 , i 2 ] = A[ i 0 , i 2 ] i f ( i 1 == 0 ) ;
to compute any given affine address pattern. Consider again 12 a [ i 0 , i 1 , i 2 ] = a [ i 0 , i 1 − 1 , i 2 ] i f ( i 1 > 0) ;
13 / / Read m a t r i x B and p r o p a g a t e a l o n g i 0
Figure 2. Since the PEs lack direct memory access, only the 14 b [ i 0 , i 1 , i 2 ] = B [ i 2 , i 1 ] i f ( i 0 == 0 ) ;
15 b [ i 0 , i 1 , i 2 ] = b [ i 0 − 1 , i 1 , i 2 ] i f ( i 0 > 0) ;
border PEs have access to I/O buffers containing small memory 16 / / Compute p a r t i a l p r o d u c t
17 p[ i 0 , i 1 , i 2 ] = a[ i 0 , i 1 , i 2 ] * b[ i 0 , i 1 , i 2 ];
banks holding input and output data. The AGs address these 18 / / Accumulate along i 2
19 c [ i 0 , i 1 , i 2 ] = p [ i 0 , i 1 , i 2 ] i f ( i 2 == 0 ) ;
memory banks, after which data is forwarded to the PEs through 20 c [ i 0 , i 1 , i 2 ] = p [ i 0 , i 1 , i 2 ] + c [ i 0 , i 1 , i 2 − 1] i f ( i 2 > 0) ;
21 / / Write matrix C
input registers. LION [31] manages the data transfer between 22 C [ i 0 , i 1 ] = c [ i 0 , i 1 , i 2 ] i f ( i 2 == N − 1 ) ;
23 }
the TCPA and external memory, filling and clearing the I/O 24 }
buffers as required by a given schedule vector λ∗ in time.
Listing 1. PAULA code [37] specifying computing a matrix multiplication.

H. Configuration Generation
The mapping of the PRA onto a TCPA is, finally, represented
Before compilation, the PAULA program can be transformed
by a configuration—a binary file that programs the TCPA by
into equivalent Python code and, after execution, its results can
loading the micro-programs of each FU and setting various
be verified against reference values from the data generator.
configuration registers for, e. g., the interconnect, the AGs, the
Next, the PAULA program is compiled into a so-called symbolic
GC, or the LION [31]. This configuration file, loaded at runtime,
configuration [27, 36], primarily by generating a polyhedral
enables the TCPA to independently execute multidimensional
syntax tree—a symbolic representation that specifies which
loop nests without external control.
operands are used in which operations within each FU across
iterations. This requires computing a valid schedule and register
I. Tools binding, with the schedule achieved through symbolic modulo
Unlike the range of toolchains available for CGRAs, we are scheduling and regular RD registers bound using the left-edge
only aware of one toolchain for TCPAs. Figure 5 illustrates algorithm. Compilation also involves partitioning the iteration
the TCPA Utilities for Running and Testing Loop Execu- space and identifying all data dependencies and required FD,
tions (TURTLE) toolchain. First, the user provides a project. ID, OD, and VD registers. This symbolic configuration is
This contains the loop specified as a PAULA program [37], the concretized during instantiation, where parameters such as
target architecture, a data generator, parameters such as loop problem size and PE count are set. Furthermore, configurations
bounds and various mapping constraints. PAULA is a domain- are generated for all AGs and the LION. Groups of PEs sharing
specific programming language specifically designed to model the same FU programs are identified, and for each such so-
PRAs (Section III-B). For example, Listing 1 shows the PAULA called processor classes, the instantiator folds the polyhedral
code implementing the PRA of the matrix multiplication from syntax tree, generating specific programs for all FUs in each
Figure 4. The target architecture is specified in an XML file that class. Lastly, the control flow is extracted from the generated
details, e. g., the FUs within each PE, including FU instruction programs and converted into a GC configuration, completing
memory, PE registers, port connections, and I/O buffer sizes. the mapping phase. TURTLE supports various targets given a
8

concrete configuration. For example, a cycle-accurate simulator Table I


can load the XML file, execute the loop, and verify results Q UALITATIVE FEATURES OF CGRA AND TCPA TOOLCHAINS .
with the test data generator. For other targets, the XML file CGRA- Morpher Pillars CGRA- TURTLE
Feature
is converted to a binary file that can be loaded by either Flow[13] [14] [15] ME [16] [27, 36]

RTL testbenches or a TCPA driver for FPGA or ASIC targets. Intuitiveness


• Graphical interface ✓ ✗ ✗ ✗ ✗
TURTLE also provides a highly generic VHDL codebase, • Commandline interface ✓ ✓ ✓ ✓ ✓
from which concrete TCPA architectures can be generated and • Commonly used language ✓ ✓ ✗ ✓ ✗
Robustness
packaged into a dedicated IP core. • No manual optimization ✗ ✗ ✗ ✗ ✗
• Reliable mapping success ✓ ✓ ✗ ✓ ✓
Correctness
IV. Q UALITATIVE E VALUATION • Simulation of mapping ✓ ✓ ✓ ✗ ✓
• Simulation statistics ✓ ✗ ✓ ✗ ✓
In this section, we conduct a qualitative evaluation of different • Auto. test data generation ✗ ✓ ✗ ✗ ✗
features of the described compiler toolchains for CGRAs and Scalability
TCPAs. This includes intuitiveness, robustness, correctness, • Independent of #Operations ✗ ✗ ✗ ✗ ✗
• Independent of #Iterations ✓ ✓ ✓ ✓ ✓
scalability, flexibility, and limitations. An overview of the results• Independent of #PEs ✓ ✗ ✗ ✗ ✓
is given in Table I. • Independent of problem size ✓ ✓ ✓ ✓ ✓
Flexibility
1) Intuitiveness: CGRA-Flow provides a GUI, visualizing in- • Generic #PE ✓ ✓ ✓ ✓ ✓
put, output, and intermediate results, while the other toolchains • Generic #FU per PE ✗ ✓ ✓ ✓ ✓
only offer a rather simple commandline interface. Furthermore, • Generic interconnect ✓ ✓ ✓ ✓ ✓
• Generic operation latency ✗ ✓ ✓ ✓ ✓
most CGRA toolchains accept loops written in C/C++, which • Generic hop length ✗ ✓ ✓ ✓ ✓
are commonly used programming languages. On the other hand, • Generic memory size ✓ ✓ ✓ ✓ ✓
Limitations
developers of applications for TCPAs need to specify their loop
• Feature complete ✓ ✓ ✗ ✓ ✓
nests in a domain-specific polyhedral programming language, • Register-aware ✗ ✓ ✓ ✓ ✓
PAULA, which a user has to learn first. Thus, TCPAs are less
intuitive to use compared to CGRAs.
2) Robustness: One crucial aspect of the mapping approach as the mapping and scheduling complex steps are performed
is its robustness, i. e., the ability to always find a valid mapping, symbolically, i. e., with parameterized loop bounds and array
if possible. Here, we observed that all toolchains require to size, during compilation [27, 36]. It only increases with the
perform manual optimizations on the code, otherwise the small number of equations (typically less than 10) within the
mapping often fails. Still, most toolchains were able to find PRA. In common is that for all toolchains, the problem size
valid mapping reliably. Only Pillars fails consistently. of the loop does not generally increase the mapping time.
3) Correctness: We found no obvious errors in the mapping 5) Flexibility: Both CGRA and TCPA architectures are flex-
process, however, the DFG generator of CGRA-ME tends to ibly parameterized architectures. Hence, all compiler toolchains
produce erroneous DFGs at times. Lastly, all toolchains offer allow the user to configure the target architecture in terms
a functional or cycle-accurate validation through simulation. of, e. g., the number of PEs or the supported operations,
4) Scalability: The mapping complexity of loop nests to but there are subtle differences. In CGRA-Flow, the target
CGRAs scales with the number of PEs and the number of nodes architecture can be configured intuitively but is rather limited
in the DFG. This is due to the fact that each leads to an increase compared to the other toolchains, as neither multiple FUs per
of the search space of possible mappings, which is explored PE nor multicycle operations nor multi-hop connections are
by the CGRA toolchains. Only for CGRA-Flow, an increase supported. From an application perspective, all toolchains but
in the number of PEs and DFG nodes does not noticeably Pillars support the automated extraction of loops. However,
affect the compilation time because the mapper only checks only TURTLE and CGRA-Flow are able to map entire
a single mapping per initiation interval II. Thus, CGRA-Flow multidimensional loop nests. But the latter is restricted to
often finds a mapping within seconds if the heuristic algorithm only up to at most two nested loops, though. On the contrary,
is used. Nevertheless, CGRA-Flow still may take minutes to Morpher and CGRA-ME are limited to mapping only the
find a mapping if a generated DFG has many nodes (hundreds) innermost loop to the target CGRA. Note that although any
or a CGRA has many PEs (64 or more), which is similar to multidimensional loop can be flattened to a single inner loop.
the mapping times observed for Pillars and CGRA-ME. Only This does, however, require predication, which was not yet
Morpher takes significantly longer—somewhere from minutes available in CGRA-ME.
to hours, depending on the node count in a DFG and the PE 6) Limitations: With the exception of Pillars, which does
count in a CGRA. For larger CGRAs, i. e., 8×8 PEs, and DFGs not include a DFG generator, all tested toolchains are feature
with more than 100 nodes, practically no CGRA toolchain complete, i. e., they are able to generate a complete mapping
provided a mapping in an affordable time of less than 1 h. The to the target architecture. Note, however, that CGRA-Flow’s
obvious non-scalability of CGRA’s mapping approaches is a mapping does not consider any PE register mapping. As a
known challenge, also mentioned in [38, 11, 39]. In contrast, result, CGRA-Flow assumes an infinite number of registers
the runtime of the mapping of a multidimensional loop nest within each PE. Additionally, CGRA-Flow does not consider
onto TCPAs does neither increase with the number of PEs nor the data layout within the larger memory buffer, or rather, it
with the problem size, i. e., loop bounds, of the given loop nest, always assumes that the memory address starts at zero. A last
9

limitation on mappability may be related to physical architecture Table II


limitations or constraints. Consider, e. g., the different register M APPING RESULTS OF BENCHMARKS ONTO CGRA S AND TCPA S .
types of a TCPA that were introduced in Section III. Inter- Toolchain Optimization Architecture #Loops #op. II #unused PE max(#op. per PE)
GEMM
iteration intra-tile dependencies within a loop are mapped to CGRA-Flow - classical CGRA 3 23 10 6 4
PE-local FIFO memories within the register file in each PE to CGRA-Flow
CGRA-Flow
flat
flat+unroll
classical CGRA
classical CGRA
3
3
28
52
6
6
5
0
4
6
be able to exploit data locality between multiple loop iterations Morpher
Morpher
flat
flat
classical CGRA
HyCUBE
3
3
47
47
9
9
4
5
9
9
which CGRAs cannot do. This reuse of data may, however, Morpher flat+unroll classical CGRA 3 80 8 2 8
Morpher flat+unroll HyCUBE 3 80 8 1 8
introduce architectural constraints due to the required length CGRA-ME - HyCUBE 1 23 1 5 1
Pillars - ADRES 1 23 1 5 1
of these FIFOs, which correlates typically with the tile size TURTLE - TCPA 3 11 1 0 11

after partitioning. As a result, the problem size of a loop can CGRA-Flow - classical CGRA
ATAX
2 30 13 3 6
become limited by the available FIFO memory in a given CGRA-Flow
CGRA-Flow
flat
flat+unroll
classical CGRA
classical CGRA
2
2
36
87
10
25
0
0
4
8
TCPA architecture. Note that many real world problems can be Morpher flat classical CGRA 2 55 14 3 10
Morpher flat HyCUBE 2 55 10 1 8
decomposed into smaller subproblems that can be computed Morpher flat+unroll classical CGRA 2 118 - - -
Morpher flat+unroll HyCUBE 2 118 14 0 14
separately by individual calls to the accelerator [40]. In contrast, CGRA-ME - HyCUBE 1 21 2 11 2
Pillars - ADRES 1 21 - - -
CGRAs cannot support any data locality at all, as they do not TURTLE - TCPA 2 12 3 0 12
have local memory within the PEs. The only architectural CGRA-Flow - classical CGRA
GESUMMV
2 22 8 6 4
constraint is the size of the peripheral memory around the array CGRA-Flow flat classical CGRA 2 25 5 3 4
CGRA-Flow flat+unroll classical CGRA 2 58 7 0 6
as it must be large enough to store the entire input and output Morpher flat classical CGRA 2 41 6 4 6
Morpher flat HyCUBE 2 41 6 3 6
data. However, note that TCPAs do not share this restriction, Morpher flat+unroll classical CGRA 2 86 - - -
Morpher flat+unroll HyCUBE 2 86 9 2 6
because they may refill the I/O buffers during runtime [28, 31], CGRA-ME - HyCUBE 1 28 7 10 3
Pillars - ADRES 1 28 - - -
which is not supported by any considered CGRA. TURTLE - TCPA 2 12 3 0 12
MVT
CGRA-Flow - classical CGRA 2 29 8 3 5
CGRA-Flow flat classical CGRA 2 31 5 2 4
V. Q UANTITATIVE E VALUATION CGRA-Flow flat+unroll classical CGRA 2 73 7 0 6
Morpher flat classical CGRA 2 49 7 3 7
In this section, we evaluate different CGRA and TCPA Morpher flat HyCUBE 2 49 7 3 7
Morpher flat+unroll classical CGRA 2 106 9 0 9
processor array architectures quantitatively. Specifically, we Morpher flat+unroll HyCUBE 2 106 8 0 8
CGRA-ME - HyCUBE 1 21 2 11 2
compare the achievable Performance, Power, and Area (PPA) Pillars - ADRES 1 21 - - -
TURTLE - TCPA 2 13 3 0 12
of different architecture instances and mapping toolchains. TRISOLV
CGRA-Flow - classical CGRA 3 27 10 3 4
CGRA-Flow flat classical CGRA 3 44 - - -
CGRA-Flow flat+unroll classical CGRA 3 138 - - -
A. Performance Morpher flat classical CGRA 3 57 8 3 7
Morpher flat HyCUBE 3 57 7 3 4
For performance evaluation, we picked five common loop Morpher
Morpher
flat+unroll
flat+unroll
classical CGRA
HyCUBE
3
3
180
180
-
-
-
-
-
-
benchmarks from the Polybench suite [41]. Each benchmark CGRA-ME - HyCUBE 1 21 2 11 2
Pillars - ADRES 1 21 - - -
is a multidimensional loop nest that represents a typical TURTLE - TCPA 3 11 6 0 11

workload in the domain of linear algebra. In the following, we


briefly describe each benchmark mathematically assuming that
A, B, C, D are N × N matrices and x, x1 , x2 , y, y1 , y2 are
vectors of size N with xi ∈ x, yi ∈ y and ai,j ∈ A. should still be applied in all but one benchmark as it favorably
reduces the achievable initiation interval. Only in the case of
• GEMM: D = A · B + C
⊺ TRISOLV, the flattened benchmark could not be mapped by
• ATAX: y = A · (A · x)
CGRA-Flow. In this evaluation, we used two different target
• GESUMMV: y = A · x + B · x
⊺ architectures for Morpher. A classical CGRA without multi-hop
• MVT: z1 = x1 + AP · y1 ; z2 = xP 2 + A · y2
i−1 N −1
yi − j=0 xj ·aj,i + j=i+1 xj ·aj,i connections and HyCube, for which Morpher finds consistently
• TRISOLV: xi = ai,i better mappings. We targeted an ADRES-like [42] architecture
First, we observed that several CGRA toolchains do not for Pillars, and since Pillars does not come with its own DFG
accept as input a multidimensional loop nest directly, but generator, we utilized the DFG from CGRA-ME. However, it
require flattening, i. e., the multidimensional loop nest is reduced could only find a valid mapping for the GEMM kernel, failing
into a single loop by unfolding the iterations of the outer for all others. Without any direct support for multidimensional
loops. Furthermore, no considered CGRA toolchain unrolls loops, the loop must be flattened into a single loop. This
a given loop automatically; thus, this transformation was requires explicitly inserting conditional statements inside the
done manually. Then, we mapped each benchmark kernel loop body that update the outer loop indices. Moreover, CGRA-
onto a 4 × 4 processor array using the four selected CGRA ME currently does not support any predication; hence, it only
toolchains, CGRA-Flow [13], Morpher [14], CGRA-ME [16], maps the innermost loop. But this simplification allows this
and Pillars [15], together with TURTLE for TCPAs [27]. tool to achieve the lowest initiation interval II among the
Table II summarizes the mapping results. Whenever no mapping CGRA toolchains. However, as discussed in Section II-B, the
could be found, the entry is colored red, while an orange row generation of the loop indices should introduce a RecMII of 3.
indicates a successful mapping of only the innermost loop. This does not apply to CGRA-ME because it only maps the
CGRA-Flow is the only CGRA tool that directly supports innermost loop and omits any loop-bound checks. Besides the
multidimensional loops. However, we can observe that flattening achieved initiation interval, the table also shows the number
10

GEMM ATAX GESUMMV


·104 ·104
Morpher CGRA-Flow Morpher
1.5
2,000
Latency [cycles]

Latency [cycles]

Latency [cycles]
CGRA-Flow 1 CGRA-Flow
1
Morpher 3×
19× 0.5 1,000

0.5 TURTLE (last)
TURTLE (last) TURTLE (first)
TURTLE (last)
0 0 TURTLE (first) 0
TURTLE (first)
4×4 8×8 20×20 4×4 8×8 20×20 32×32 4×4 8×8 20×20 32×32
Input Matrix Size Input Matrix Size Input Matrix Size
MVT TRISOLV TRSM
·105

2,000 Morpher CGRA-Flow CGRA-Flow


1.5
CGRA-Flow
4,000
Latency [cycles]

Latency [cycles]

Latency [cycles]
Morpher Morpher
2× 1

1,000
TURTLE (last) 2,000 8×
TURTLE (last) 0.5

TURTLE (first) TURTLE (first) TURTLE (last)


0 0 0 TURTLE (first)
4×4 8×8 20×20 32×32 4×4 8×8 20×20 32×32 4×4 8×8 20×20 32×32
Input Matrix Size Input Matrix Size Input Matrix Size

Figure 6. Latencies achieved for different benchmarks with varying input size, i. e., the matrix size, achieved on CGRAs and TCPAs with 4 × 4 PEs. Shown is
each time the best result by the respective mapping tool, while for the TCPA, the latency of the first and last PE to complete are shown separately.

30 CGRA-Flow [13]
the last PE is completed. Additionally, Figure 7 shows the
25.1

Morpher [14] latency of CGRA schedules for each benchmark normalized


Normalized Latency

CGRA-ME [16] to a TURTLE-generated loop nest schedule for a TCPA target


18.8

Pillars [15] of equal size 4 × 4 PEs. For the GEMM kernel, we observe
17.5
17.5

20
TURTLE [27, 36] that CGRA-Flow beats Morpher but is outclassed by the TCPA
11.5

11.5

that is 19 × faster. This is the result of each of the 16 PEs


starting a new iteration every single cycle, while in the CGRA
8.1

10
7.8

every 6 cycles only 4 unrolled iterations are started. Similarly,


5.3

5.2
4.4

3.7
2.9
2.7

2.4

the TCPA outperforms the CGRAs for ATAX and GESUMMV


2.1

2.1
3
1.3
1

0 by huge factors of 4 × and 3 × , respectively, and by 2 × for


GEMM ATAX GESUMMV MVT TRISOLV TRSM both MVT and TRISOLV. Here, we see a gap between the
latency of the first and last PE. This is due to the fact that
Figure 7. Speedup of TURTLE-compiled loop nests compared to other CGRA both MVT and TRISOLV are only two-dimensional algorithms
frameworks for an input matrix size of 20 × 20 for GEMM and 32 × 32 for
all other benchmarks and assuming an array size of 4 × 4 PEs.
that, mapped on a two-dimensional array of PEs, cannot be
executed entirely in parallel. Thus, all PEs are used, but the
first PEs already terminate much earlier. To investigate this
of PEs that were not utilized and the maximum number of futher, an additional experiment with a TRSM kernel, a three-
operations mapped to a PE. It was rarely possible to employ dimensional loop that executes TRISOLV in the two innermost
all the 16 available PEs. Moreover, the number of operations loops, was performed. This utilizes the PEs better, i. e., 8 ×
per PE indicates that even the used PEs are underutilized. For faster than the next best CGRA mapping found by Morpher,
example, with II = 10 and a maximum of 4 operations per as indicated by the nearly identical latency of the first and last
PE, the most active PE will only execute 4 operations within a PE. However, considering the fact that an application might
window of 10 cycles. TURTLE can map multidimensional loop invoke the same kernel execution multiple times in a row, as
nests directly on all available PEs and consistently achieves shown in [40], the latency to complete one invocation is not
the best II among all toolchains, excluding CGRA-ME and as important as the earliest time at which the next invocation
Pillars due to the reasons described above. can be started. This time is the shown latency of the first
The achieved II reflects the mapping quality, but to assess PE, thus, whenever multiple independent invocations of the
the actual performance, we also investigate for each benchmark same loop program are considered, the TCPA outperforms the
the latencies for different input matrix sizes for the best CGRA CGRAs even more. Such overlapped execution is not unique
mappings achieved by Morpher and CGRA-Flow together with to TCPAs, but was not available on the considered CGRAs.
the mapping found for the TCPA. The achievable latencies are In summary, the performance evaluation shows that a TCPA
shown in Figure 6, whereas the latencies of the first and last outperforms a CGRA with the same number of PEs. However,
PE in the TCPA to complete are shown separately, because the PE architecture of CGRAs and TCPAs differ significantly.
a TCPA could allow to start a next call of the accelerator This necessitates an analysis of the hardware costs to put the
already after the latency of the first PE and not wait until achieved performance into perspective.
11

Table III copy unit. The actual arithmetic units, i. e., adder, multiplier,
R ESOURCE UTILIZATION OF A GENERIC 4 × 4 CGRA AND A 4 × 4 TCPA. and divider, are the same as in the generic CGRA. Furthermore,
Insts. LUTs FFs BRAMs DSPs the data register file in each PE is chosen to contain in total
CGRA 32 addressable registers (8 general-purpose, 8 feedback, 8
4 × 4 CGRA 1 35 250 32 552 20 48 input, and 8 output registers) that can be written by all FUs
Avg. Processing element (PE) 16 2202 2034 1 3
Avg. ALU (without division) 16 505 102 0 3 simultaneously. Note that the feedback and input registers are
Avg. Divider 16 1293 1629 0 0 FIFOs, with a combined total capacity of 280×32 bit per PE.
Avg. Instruction memory and decoder 16 400 16 1 0
Scratchpad memory (multi bank) 1 37 2 4 0 The array is surrounded by the I/O buffers containing a total of
TCPA 32 512 B banks and 32 AGs that are configured, similar to the
4 × 4 TCPA 1 220 524 205 774 656 48
Avg. Processing element (PE) 16 11 091 8563 39 3
other peripherals (GC, LION), to control the schedule of up
Avg. Functional units 16 2967 3380 7 3 to 4 loop dimensions. Finally, each PE is configured to have
Avg. Data register file 16 6000 2947 2 0 8 channels to each of its neighbors. With these assumptions,
Avg. Control register file 16 645 711 30 0
Avg. Interconnect 16 712 683 0 0 we synthesized and compared CGRA and TCPA architectures
Avg. I/O buffer incl. AGs 4 6523 11 197 8 0 achieving similar clock frequencies of 200 MHz to 250 MHz.
Avg. Address Generator 32 483 740 0 0
Global controller 1 9741 17 861 0 0 Table III shows the resource requirements of the previously
Loop I/O controller 1 5738 4277 4 0 discussed generic CGRA architecture implementing a 4×4 array
of PEs. As can be observed, the costs of the scratchpad memory
next to the array are negligible. Furthermore, the average cost
B. Area of each PE and its main components are listed. 58 % of LUTs
In the following, we also determine and compare the area and 80 % of FFs are used for the divider, while the remaining
cost of both architectures for equal size of the processor array resources are equally distributed among the ALU and the
(number of PEs). For this purpose, we synthesize a generic instruction pipeline including memory and decoder. Table III
4 × 4 PE CGRA as well as a 4 × 4 TCPA to an FPGA. Later, also shows the costs of a 4 × 4 TCPA and its components.
we also compare area margins of actual chip designs. We observe that 80.47 % of LUTs and 66.58 % of FFs are
1) FPGA Resource Requirements: We obtain the resource used within the PE array making each TCPA PE approximately
utilization directly by synthesizing an RTL description for 5 times more costly than that of the generic CGRA. Most
an AMD/Xilinx Ultrascale+ FPGA target using Vivado. Most resources within a PE are needed to implement the data register
toolchains already provide synthesizable RTL, but for a fair file (6000 LUTs, 2947 FFs) to keep intermediate local data
comparison we used in this work for the TCPA the RTL created in iterations and reused in each PE at later iterations
provided by TURTLE and developed for the CGRAs a generic due to tile-based assignment of iteration spaces to PEs (iteration-
architecture in VHDL as shown in Figure 1 (right). This centric mapping). The other dominant area requirements come
generic CGRA implements the bare minimum architecture from the FUs (2967 LUTs, 3380 FFs). These costs arise from
required to execute the loop mappings of the previously the implementation of virtual registers that enable all 7 FUs
evaluated benchmarks in Section V-A. Moreover, it resembles to simultaneously write to any register within the data register
the HyCube architecture [10], i. e., it features a single-cycle file. For the control generation, each PE requires also a control
ALU per PE, a single input and output channel to each register file which internal registers could be mapped to BRAM.
neighbor PE, no local register file but 10 multiplexed registers Additionally, the GC that feeds these control register files with
along the data path, and an instruction memory containing control signals comes at a cost of 9741 LUTs and 17 861 FFs,
up to 16 cycle-by-cycle configurations. This configuration is but is required only once for the entire processor array. Finally,
necessary to map all of the previously introduced benchmarks. the I/O buffers at each border are rather cheap (6523 LUTs,
As shown in Figure 1, only the leftmost PEs have access 11 197 FFs) containing 8 AGs each requiring 483 LUTs and
to a scratchpad memory that uses a multi bank approach 740 FFs. Finally, the LION used to fill and drain the I/O buffers
where each left border PE has its own distinct 4 kB memory costs about half of a PE, i. e., 5738 LUTs and 4277 FFs. In
bank. The single ALU in each PE supports support addition, summary, this 4 × 4 TCPA architecture requires 6.26 × the
multiplication, and division on 32 bit integer. Besides these resources of the simple generic 4 × 4 CGRA architecture.
arithmetic operations, the CGRA also requires each PE to 2) ASIC Area: For TCPAs, [30] presents an actual 8 × 8,
perform basic logic (and, or, etc.), comparison and load/store 10 mm2 chip manufactured in 22 nm. Meanwhile, there exists
operations. All operations are implemented as single-cycle also work on CGRAs with [12] presenting a CGRA with 16
operations except the division which takes 16 cycles. Similarly, PEs on 4.7 mm2 in 40 nm and [43] showing a 16 nm chip
we chose the TCPA parameters such that it is able to execute containing 384 PEs within an area of 20.1 mm2 . We may
the previously discussed benchmarks accordingly. As a result, compare these chips by normalizing the chip area to the
each PE contains the following functional units: two adder, PE count and technology size. We use a scaling factor of
one multiplier, one divider, and 3 copy units for moving data 1.89 and 6.25 for 22 nm and 40 nm, respectively. This results
between registers. Each unit possesses its own instruction in a normalized area per PE of 0.083 mm2 for [30] (TCPA),
pipeline and runs a separate program. These programs are 0.047 mm2 for [12] (CGRA), and 0.052 mm2 for [43] (CGRA).
stored in FU-local instruction memories of the following size. Note that the supported number format of the FUs inside each
78×47 bit and 25×43 bit for the adders, 51×45 bit for the PE is different in each of those chips. In particular, the FUs
multiplier, 29×43 bit for the divider, and 20×43 bit for each of the TCPA architecture in [30] support 32 bit floating point,
12

10 CGRA-Flow [13]: 8 × unroll on 4 × 4 CGRA benchmarks. In this case, however, the number of PEs and the

9.4

9.4
9.4
CGRA-Flow [13]: 8 × unroll on 8 × 8 CGRA unroll factor are varied. However, since no CGRA toolchain
CGRA-Flow [13]: 16 × unroll on 4 × 4 CGRA could find an actual mapping for the configurations shown, the
7.9

7.8
8 CGRA-Flow [13]: 16 × unroll on 8 × 8 CGRA figure only shows a theoretical lower-bound latency computed
Normalized Latency

Morpher [14]: 8 × unroll on 4 × 4 CGRA


according to the ResMII and the RecMII after DFG generation.
6.3

Morpher [14]: 8 × unroll on 8 × 8 CGRA


6 Morpher [14]: 16 × unroll on 4 × 4 CGRA
Therefore, it only shows the best latency the mapping tool could
Morpher [14]: 16 × unroll on 8 × 8 CGRA have achieved, but no actual mapping could be determined.
4.7

TURTLE [27, 36]: 4 × 4 TCPA In contrast, we observe that TURTLE found a mapping for
4 TURTLE [27, 36]: 8 × 8 TCPA both 4 × 4 and 8 × 8 arrays. A larger array only results in
3.2

smaller tiles during partitioning, while the mapping complexity


2.5
2.5

2.1
1.9
1.9

1.8

1.8
itself does not increase. Note also that larger TCPAs naturally
1.7
1.7

1.6
2
1.5
1.5

1.5

1.4
1.3

1.2

1.2
0.9 support larger problems, since the previously discussed problem

0.9
0.7
1

1
0.6

0.6

0.6

0.6
0.5

0.5
0.4

0.3
size limit depends only on the tile size, i. e., the number of
0 iterations per PE. However, the performance gain from 16
GEMM ATAX GESUMMV MVT
to 64 PEs is not 4 × . This is due to the fact that the wave-
Figure 8. Speedup of TURTLE-compiled loop nests compared to CGRA-
like starting and stopping of a larger array combined with
Flow [13] and Morpher [14] for an input matrix size of 20 × 20 for GEMM smaller tiles leads to a worse overall PE utilization, i. e., the
and 32 × 32 for all other benchmarks assuming different unroll levels and difference between the latency of the first and last PE to
number of PEs. Note that settings, where the respective tool was not able
to provide any feasible schedule, are marked in stripe color showing only a
complete increases. Note, however, that the minimum time at
theoretical lower bound. which the next problem can be started on the same array is
reduced by 4 × on the larger array. Hence, TCPAs benefit
from larger arrays but require either an increase in problem size
while the CGRAs in [12, 43] only support 32 bit fixed point and, thus, tile size or a batch processing use case where the
and both 16 bit bfloat and 16 bit integer, respectively. same kernel is repeatedly executed many times. For CGRAs,
however, increasing the number of PEs is less effective. We
C. Power observed that without increasing the unroll factor, more PEs
only mitigate the ResMII, but do not reduce the RecMII. Thus,
Analogously to the area analysis in Section V-B, we
in many settings, the latency difference between a 4 × 4 array
investigate for both CGRAs and TCPA the power consumption
and an 8 × 8 array, i. e., 4 × more PEs, is often zero. Even
on FPGAs and summarize results from literature for ASICs.
in settings where the difference is significant, only a speedup
1) FPGA Power: The vectorless power analyzer of Vivado
of 2 × to 3 × should be theoretically possible, while 4 ×
reported for the two 4 × 4 architectures presented above a
hardware resources are required. A significant performance
power consumption of 3.313 W for the TCPA and 1.957 W for
gain is only possible by increasing the number of nodes within
the CGRA. Surprisingly, the TCPA design requiring 6.26 ×
the DFG by further unrolling the loop. Here, we observe that
the resources only consumes 1.69 × the power.
still no evaluated CGRA approach is able to find lower latency
2) ASIC Power: The authors of the previously discussed
mappings than found for the TCPA for the GEMM and ATAX
published chip designs for CGRAs and TCPAs also report power
benchmarks. Only for the GESUMMV and MVT benchmarks,
and energy efficiency. The 8 × 8 floating-point TCPA in [30]
an 8 × 8 CGRA can theoretically outperform a 4 × 4 TCPA if
consumes 7.5 W at peak (117 mW per PE), while the CGRA
the unrolling factor is also chosen sufficiently large. However,
in [12] has a peak power consumption of 102 mW in total and
the DFGs in these cases become large with up to 330 nodes,
6.375 mW per PE. [43] only reports a peak energy efficiency
which makes the mapping a massively complex challenge that
of 538 GOPS/W, but not the actual power. [12] (CGRA) and
none of the analyzed tools could handle. In theory, it should
[30] (TCPA) show a peak energy efficiency of 26.4 GOPS/W
be possible to map a larger DFG onto a larger array, but this
and 270 GFLOPS/W, respectively.
assumes that the nodes can be evenly distributed across the
available resources. However, this may not be possible due
VI. D ISCUSSION to increased routing congestion around the border PEs. In the
The evaluation in Section V assumed the same number of classic CGRA architecture, only the left border PEs have access
PEs for both CCRA and TCPA. This raises the question of to memory. Thus, in a 4 × 4 array, only 4/16 PEs can issue
scaling up a CGRA in terms of the number of PEs to achieve load and store operations. In larger arrays, this factor decreases
the same area cost as a TCPA and compare the achievable further, e. g., 8/64 in an 8 × 8 array. Since unrolling a loop
performance. A major advantage of processor arrays is that their causes the number of load and store operations to increase
area, power, and theoretical performance scale linearly with the linearly, routing contention around the border PE will worsen
number of PEs, since the architecture of a PE is the same for significantly. Additional memory banks around all 4 sides of
larger arrays and the cost of additional peripheral controllers is the array mitigate this problem to some extent, but when the
typically very small, as shown in Section V-B. We conducted array becomes too large, this will not suffice. A better solution
an additional experiment to assess the performance gain from is the approach used in TCPAs. Instead of reading and writing
larger arrays, shown in Figure 8. Similar to Figure 7, it shows the inputs and outputs of all unrolled iterations within the DFG
the speedup between TCPA and CGRA toolchains for different in every execution, the data must be kept locally within the
13

array and reused within different unrolled iterations of the CGRAs, achieving up to a 19 × speedup on the GEMM
same DFG. In summary, larger CGRAs may only simplify benchmark. This is due to several factors: In many CGRA
the mapping without a significant performance gain to justify mappings, several PEs remain completely inactive, i. e., no
the increase in hardware cost. Moreover, even with the poor operations are mapped to them, while the maximum number
scaling properties of CGRAs, multiple parallel CGRAs could of operations mapped to a PE further indicates that these
have the same area cost as a TCPA, assuming there is enough PEs are not well utilized, one reason being a lack of routing
parallelism at the kernel level. Note, however, that in this case opportunities. For example, we observe that the HyCube
the TCPA could also exploit its ability to overlap multiple kernel architecture, which increases routing capability by introducing
executions, further outperforming CGRAs and their operation- multi-hop connections, consistently outperforms the classic
centric mapping approaches, as shown in Section V-A. CGRAs without multi-hop connections. In addition, the PEs of
a CGRA must perform all control flow and address computation,
often contributing to more than 70 % of the operations in
VII. C ONCLUSION
common loop programs, as illustrated in Figure 1. Also, the
In this paper, we analyzed two prominent classes of architec- generated DFGs often contain fatal throughput-limiting cyclic
tures for accelerating nested loops on processor arrays: Coarse- dependencies. We believe that in the future, the pure operation-
Grained Reconfigurable Arrays (CGRAs) with an operation- centric approach used in CGRAs will be combined with some
centric mapping approach, and Tightly-Coupled Processor iteration-centric methods, e. g., extensions similar to [44] that
Arrays (TCPAs) with an iteration-centric mapping approach. separate control flow from data flow. Moreover, for TCPAs,
While the toolchains for CGRAs map operations from a DFG automatic single-assignment generation from imperative for
to PEs, a TCPA mapper tiles a given n-dimensional iteration loop descriptions, e.g., [45], may make these architectures more
space into as many congruent tiles as available PEs and attractive from the standpoint of programmability and usability.
schedules these iterations both globally and locally to best
exploit multiple levels of parallelism and data locality. This Acknowledgment: This work was funded by the Deutsche Forschungs-
study provides a comprehensive qualitative and quantitative gemeinschaft (DFG, German Research Foundation) – Project number
comparison of four CGRA toolchains and one TCPA toolchain. 146371743 – TRR 89: Invasive Computing.
The qualitative evaluation shows that CGRAs may be more
intuitive to use, especially for developers familiar with C/C++, R EFERENCES
but their mapping process often struggles with scalability, [1] N. Anantharajaiah et al. Invasive Computing. Ed. by J. Teich,
requiring significant time for large arrays or complex problems. J. Henkel, and A. Herkersdorf. 2022. DOI: 10.25593/978-3-
In contrast, compiling loop nests to TCPAs is independent 96147-571-1.
of the size of the iteration space or the size of the target [2] J. Howard et al. “A 48-Core IA-32 Processor in 45 nm CMOS
processor array. Optimal schedules can even be determined Using On-Die Message-Passing and DVFS for Performance
independent of the size of the loop nest bounds, as shown and Power Scaling”. In: JSSC 46.1 (2011), pp. 173–183. DOI:
10.1109/JSSC.2010.2079450.
in [35, 36]. The quantitative evaluation showed significant
[3] B. D. de Dinechin et al. “A Clustered Manycore Processor
differences in performance, power, and area (PPA) due to the
Architecture for Embedded and Accelerated Applications”. In:
different mapping methods of these architectures. In TCPAs, HPEC. 2013, pp. 1–6. DOI: 10.1109/HPEC.2013.6670342.
the PEs must be able to execute a full tile of iterations. This [4] M. Wijtvliet, L. Waeijen, and H. Corporaal. “Coarse grained
requires a more complex PE architecture with local memory reconfigurable architectures in the past 25 years: Overview
in the form of a register file, multiple FUs to locally execute and classification”. In: SAMOS. 2016, pp. 235–244. DOI: 10.
all operations within the loop body, and multiple interconnect 1109/SAMOS.2016.7818353.
channels to neighboring PEs. Special feedback registers also [5] L. Liu et al. “A Survey of Coarse-Grained Reconfigurable
distinguish a TCPA PE from other PE architectures. These Architecture and Design: Taxonomy, Challenges, and Appli-
support the efficient reuse of data computed in earlier iterations cations”. In: Comput. Surv. 52.6 (2019), 118:1–118:39. DOI:
within a tile, without the need to communicate intermediate 10.1145/3357375.
data to external memories. Obviously, this makes the PE [6] D. Kissler, F. Hannig, A. Kupriyanov, and J. Teich. “A Highly
architecture used in CGRAs — basically an ALU, a crossbar, Parameterizable Parallel Processor Array Architecture”. In:
FPT. 2006, pp. 105–112. DOI: 10.1109/FPT.2006.270293.
some registers, and an instruction memory with a decoder —
[7] F. Hannig, V. Lari, S. Boppu, A. Tanase, and O. Reiche.
smaller because it only has to execute single operations, not
“Invasive Tightly-Coupled Processor Arrays: A Domain-
entire loop bodies, and not multiple iterations. In addition, Specific Architecture/Compiler Co-Design Approach”. In:
TCPAs are designed to offload all control overhead, such as TECS 13.4s (2014), 133:1–133:29. DOI: 10.1145/2584660.
loop counter incrementing, memory access address generation, [8] J. Teich, M. Brand, F. Hannig, C. Heidorn, D. Walter, and
loop bound tests, etc., to additional control units outside the M. Witterauf. “Invasive Tightly-Coupled Processor Arrays”.
core array. But despite an approximately 5 × higher resource In: Invasive Computing. FAU University Press, 2022. DOI:
count of a TCPA when synthesized on an FPGA, the resulting 10.25593/978-3-96147-571-1.
power consumption is only 1.69 × higher than that of a [9] S. Kim, Y.-H. Park, J. Kim, M. Kim, W. Lee, and S. Lee.
generic CGRA. More importantly, this additional cost translates “Flexible video processing platform for 8K UHD TV”. In:
into significant performance gains. In our experiments with HCS. 2015, pp. 1–1. DOI: 10.1109/HOTCHIPS.2015.7477475.
five common benchmarks, TCPAs consistently outperformed
14

[10] M. Karunaratne, A. K. Mohite, T. Mitra, and L.-S. Peh. [29] M. Brand, F. Hannig, A. Tanase, and J. Teich. “Orthogonal
“HyCUBE: A CGRA with reconfigurable single-cycle multi- Instruction Processing: An Alternative to Lightweight VLIW
hop interconnect”. In: DAC. 2017, pp. 1–6. DOI: 10.1145/ Processors”. In: MCSoC. 2017, pp. 5–12. DOI: 10 . 1109 /
3061639.3062262. MCSoC.2017.17.
[11] Z. Li, D. Wijerathne, and T. Mitra. “Coarse-Grained Re- [30] D. Walter, M. Brand, C. Heidorn, M. Witterauf, F. Hannig,
configurable Array (CGRA)”. In: Handbook of Computer and J. Teich. “ALPACA: An Accelerator Chip for Nested
Architecture. 2022. DOI: 10.1007/978-981-15-6401-7 50-1. Loop Programs”. In: ISCAS. 2024, pp. 1–5. DOI: 10.1109/
[12] B. Wang, M. Karunarathne, A. K. Mohite, T. Mitra, and L. ISCAS58744.2024.10558549.
Peh. “HyCUBE: A 0.9V 26.4 MOPS/mW,290 pJ/op, Power [31] D. Walter and J. Teich. “LION: Real-Time I/O Transfer
Efficient Accelerator for IoT Applications”. In: A-SSCC. 2019, Control for Massively Parallel Processor Arrays”. In: MEM-
pp. 133–136. DOI: 10.1109/A-SSCC47793.2019.9056954. OCODE. 2021, pp. 32–43. DOI: 10.1145/3487212.3487349.
[13] C. Tan, C. Xie, A. Li, K. J. Barker, and A. Tumeo. “OpenC- [32] H. Nelis and E. Deprettere. “Automatic design and partitioning
GRA: An open-source unified framework for modeling, of systolic/wavefront arrays for VLSI”. In: Circuits, Systems
testing, and evaluating CGRAs”. In: ICCD. 2020, pp. 381–388. and Signal Processing 7.2 (1988), pages 235–252.
DOI : 10.1109/ICCD50377.2020.00070. [33] J. Teich, L. Thiele, and L. Z. Zhang. “Partitioning Processor
[14] D. Wijerathne, Z. Li, M. Karunaratne, L.-S. Peh, and T. Arrays under Resource Constraints”. In: J. VLSI Signal Pro-
Mitra. “Morpher: An Open-Source Integrated Compilation cess. 17.1 (1997), pp. 5–20. DOI: 10.1023/A:1007935215591.
and Simulation Framework for CGRA”. In: WOSET (2022). [34] J. Teich and L. Thiele. “Exact Partitioning of Affine Depen-
[15] Y. Guo and G. Luo. “Pillars: An Integrated CGRA Design dence Algorithms”. In: SAMOS. 2002, pp. 135–153. DOI:
Framework”. In: WOSET (2020). 10.1007/3-540-45874-3\ 8.
[16] O. Ragheb et al. “CGRA-ME 2.0: A Research Framework for [35] M. Witterauf, A. Tanase, F. Hannig, and J. Teich. “Modulo
Next-Generation CGRA Architectures and CAD”. In: IPDPS. Scheduling of Symbolically Tiled Loops for Tightly Coupled
2024, pp. 642–649. DOI: 10.1109/IPDPSW63119.2024.00124. Processor Arrays”. In: ASAP. 2016, pp. 58–66. DOI: 10.1109/
[17] C. Lattner and V. Adve. “LLVM: a compilation framework for ASAP.2016.7760773.
lifelong program analysis & transformation”. In: CGO. 2004, [36] A. Tanase, M. Witterauf, J. Teich, and F. Hannig. “Symbolic
pp. 75–86. DOI: 10.1109/CGO.2004.1281665. Multi-Level Loop Mapping of Loop Programs for Massively
[18] M. Hamzeh, A. Shrivastava, and S. Vrudhula. “Branch-aware Parallel Processor Arrays”. In: TECS 17.2 (2018), pp. 1–27.
loop mapping on CGRAs”. In: DAC. 2014, pp. 1–6. DOI: DOI : 10.1145/3092952.
10.1145/2593069.2593100. [37] F. Hannig. “Scheduling Techniques for High-Throughput
[19] L. McMurchie and C. Ebeling. “PathFinder: A Negotiation- Loop Accelerators”. Dissertation. Friedrich-Alexander-
based Performance-driven Router for FPGAs”. In: ISFPGA. Universität Erlangen-Nürnberg (FAU), Aug. 2009.
1995, pp. 111–117. DOI: 10.1145/201310.201328. [38] A. Podobas, K. Sano, and S. Matsuoka. “A Survey on Coarse-
[20] S. Kirkpatrick, C. Gelatt, and M. Vecchi. “Optimization by Grained Reconfigurable Architectures From a Performance
Simulated Annealing”. In: Science 220 (1983), pp. 671–80. Perspective”. In: Access 8 (2020), pp. 146719–146743. DOI:
DOI : 10.1126/science.220.4598.671. 10.1109/ACCESS.2020.3012084.
[21] Z. Li, D. Wu, D. Wijerathne, and T. Mitra. “LISA: Graph Neu- [39] D. Wijerathne, Z. Li, A. Pathania, T. Mitra, and L. Thiele.
ral Network Based Portable Mapping on Spatial Accelerators”. “HiMap: Fast and Scalable High-Quality Mapping on CGRA
In: HPCA. 2022, pp. 444–459. DOI: 10.1109/HPCA53966. via Hierarchical Abstraction”. In: TCAD 41.10 (2022),
2022.00040. pp. 3290–3303. DOI: 10.1109/TCAD.2021.3132551.
[22] D. Wijerathne, Z. Li, and T. Mitra. Accelerating Edge AI with [40] D. Walter, T. Adamtschuk, F. Hannig, and J. Teich. “Analysis
Morpher: An Integrated Design, Compilation and Simulation and Optimization of Block LU Decomposition for Execution
Framework for CGRAs. Sept. 2023. DOI: 10.48550/arXiv. on Tightly Coupled Processor Arrays”. In: ASAP. 2024,
2309.06127. pp. 97–106. DOI: 10.1109/ASAP61560.2024.00029.
[23] S. A. Chin et al. “CGRA-ME: A unified framework for CGRA [41] L.-N. Pouchet. PolyBench/C: The Polyhedral Benchmark suite.
modelling and exploration”. In: ASAP. 2017, pp. 184–189. https://web.cs.ucla.edu/∼pouchet/software/polybench/. 2012.
DOI : 10.1109/ASAP.2017.7995277. [42] B. Mei, S. Vernalde, D. Verkest, H. D. Man, and R. Lauwere-
[24] S. Li and C. Ebeling. “QuickRoute: a fast routing algorithm ins. “ADRES: An Architecture with Tightly Coupled VLIW
for pipelined architectures”. In: FPT. 2004, pp. 73–80. DOI: Processor and Coarse-Grained Reconfigurable Matrix”. In:
10.1109/FPT.2004.1393253. FPL. 2003, pp. 61–70. DOI: 10.1007/978-3-540-45234-8\ 7.
[25] J. Teich. “A Compiler for Application Specific Processor [43] K. Feng et al. “Amber: Coarse-Grained Reconfigurable Array-
Arrays”. Dissertation. Saarland University, Germany, 1993. Based SoC for Dense Linear Algebra Acceleration”. In: HCS.
230 pp. Shaker Verlag. 2022, pp. 1–30. DOI: 10.1109/HCS55958.2022.9895616.
[26] J. Teich and L. Thiele. “Partitioning of processor arrays: a [44] K. Koul et al. “AHA: An Agile Approach to the Design of
piecewise regular approach”. In: Integr. 14.3 (1993), pp. 297– Coarse-Grained Reconfigurable Accelerators and Compilers”.
332. DOI: 10.1016/0167-9260(93)90013-3. In: TECS 22.2 (2023), 35:1–35:34. DOI: 10.1145/3534933.
[27] M. Witterauf, D. Walter, F. Hannig, and J. Teich. “Symbolic [45] P. Feautrier. “Automatic Parallelization in the Polytope
Loop Compilation for Tightly Coupled Processor Arrays”. In: Model”. In: The Data Parallel Programming Model: Foun-
TECS 20.5 (2021), 49:1–49:31. DOI: 10.1145/3466897. dations, HPF Realization, and Scientific Applications. 1996,
[28] D. Walter, M. Witterauf, and J. Teich. “Real-time Schedul- pp. 79–103. DOI: 10.1007/3-540-61736-1\ 44.
ing of I/O Transfers for Massively Parallel Processor Ar-
rays”. In: MEMOCODE. 2020, pp. 1–11. DOI: 10 . 1109 /
MEMOCODE51338.2020.9315179.

You might also like