On Dataflow Computing With OpenSPL
On Dataflow Computing With OpenSPL
Computing With
OpenSPL
(Draft v1.0-79-g838c34b)
John Winans
[email protected]
No parts of this publication may be reproduced, stored in a retrieval system, or transmitted, in any
form or by any means, electronic, mechanical, photocopying, recording, or otherwise, without the prior
written permission of the publisher.
The authors and publisher have made every effort in the preparation of this book to ensure the accuracy
of the information. However, the information contained in this book is provided without warranty, either
express or implied. Neither the authors, the publisher nor its distributors will be held liable for any
damages caused or alleged to be caused directly or indirectly by this book.
The authors and publisher have made every effort to provide trademark information about all the
companies and products mentioned in this book by the appropriate use of capitals. However, the
accuracy of this information cannot be guaranteed.
Altera® and Stratix® are either registered trademarks or trademarks of Altera Corporation in the
United States and/or other countries.
ARM® , ARM Powered® , AMBA® , ARMulator® , Cortex® , Jazelle® , Multi–ICE® , StrongARM® ,
Thumb® , and TrustZone® are the registered trademarks of ARM Limited in the EU and other countries.
AHBTM , APBTM , ARM9TTM , ARM9TDMITM , ARM922TTM , ARM1022ETM , ASBTM , ATBTM , AXITM ,
CoreSightTM , ETM9TM , ETM10TM , ModelGenTM , MPCoreTM , NEONTM , PrimeCellTM , and VFP10TM
are the trademarks of ARM Limited in the EU and other countries
Microsoft® and Windows® are either registered trademarks or trademarks of Microsoft Corporation in
the United States and/or other countries.
Apple® , Macintosh® , Mac OS® and SafariTM are either registered trademarks or trademarks of Apple
Computer, Inc. in the United States and/or other countries.
Oracle® , VirtualBox® and Java are registered trademarks of Oracle and/or its affiliates.
Adobe, the Adobe logo, Acrobat, the Acrobat logo, Distiller, PostScript, and the PostScript logo are
trademarks or registered trademarks of Adobe Systems Incorporated in the U.S. and/or other countries.
Intel, Intel Core, and Xeon are trademarks of Intel Corp. in the U.S. and other countries.
OpenGL® is a registered trademark of Silicon Graphics, Inc.
UNIX® is a registered trademark of The Open Group.
OpenSPL is a trademark of Maxeler Technologies Limited.
X Window System is a trademark of the X Consortium, Inc
Linux® is the registered trademark of Linus Torvalds in the U.S. and other countries.
Other company and product names mentioned herein are trademarks of their respective owners. Mention
of third-party products is for informational purposes only and constitutes neither an endorsement nor a
recommendation.
Page ii of 81 ~/NIU/courses/532/2015-fa/book/openspl/./book.tex
[email protected] 2015-10-20 15:01:52 -0500 v1.0-79-g838c34b
Contents
1 Introduction 1
1.1 The FPGA . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
1.1.1 Evolution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
1.1.2 HDL Programming . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2
1.1.3 The Machine . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
1.2 Some Words From The Marketing Department . . . . . . . . . . . . . . . . . . . . . . . . 4
2 Dataflow Computing 5
2.1 An Example Problem . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
2.2 Problem Solving With a CPU . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
2.3 Problem Solving With Pipelined Computing . . . . . . . . . . . . . . . . . . . . . . . . . . 9
2.3.1 The Kernel Graph . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
2 2
2.3.2 Timing Diagrams for y = x + z . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
2.4 Observations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11
3 OpenSPL Basics 13
3.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13
3.2 An Accelerator Architecture . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13
3.2.1 CPU Code . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14
3.2.2 MaxJ Code . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15
3.2.3 Kernel . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16
3.2.4 Manager . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16
4 Maxeler IDE 19
4.1 Accessing the OpenSPL Environment . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19
4.1.1 Installation of VM . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19
4.1.2 hermes.niu.edu . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19
4.2 Maxeler First Time Use . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20
4.2.1 Set up Your Workspace . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20
4.2.2 Documentation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20
4.2.2.1 Documentation Available in the IDE . . . . . . . . . . . . . . . . . . . . 21
4.2.3 Importing an Example Project . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21
4.3 MaxIDE Problems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25
6 The Kernel 45
6.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45
6.2 Widening the Pipeline . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45
6.2.1 Overloading a Kernel . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45
6.2.2 An N-fold kernel . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46
6.3 Temporal Alignment . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 48
6.3.1 y = x2 + z 2 + z . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 48
6.3.2 y = x2 + z 2 + z − x . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51
6.3.3 Reality Check (There’s a Pipeline in my Pipeline!) . . . . . . . . . . . . . . . . . . 52
Page iv of 81 ~/NIU/courses/532/2015-fa/book/openspl/./book.tex
[email protected] 2015-10-20 15:01:52 -0500 v1.0-79-g838c34b
CONTENTS
D Java Resources 63
D.1 Web Resources . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 63
D.1.1 MIT OpenCourseware . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 63
D.1.2 Introduction to Programming Using Java . . . . . . . . . . . . . . . . . . . . . . . 63
D.1.3 A Primer on Java . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 64
D.2 References From Multiscale Dataflow Programming[1] . . . . . . . . . . . . . . . . . . . . . 64
Bibliography 80
Index 81
~/NIU/courses/532/2015-fa/book/openspl/./book.tex Page v of 81
[email protected] 2015-10-20 15:01:52 -0500 v1.0-79-g838c34b
CONTENTS
Page vi of 81 ~/NIU/courses/532/2015-fa/book/openspl/./book.tex
[email protected] 2015-10-20 15:01:52 -0500 v1.0-79-g838c34b
List of Figures
1
Introduction
An FPGA (Field Programmable Gate Array) is a type of integrated circuit (as opposed to a
computing system consisting of many parts) that, as its name implies, can be programmed to perform
various functions.
The fact that it is field programmable means that it can be programmed after it has left the factory.
Being a gate array, it is programmed by specifying the manner in which the its gates are to be
interconnected.
1.1.1 Evolution
Over the years FPGA manufacturers have improved upon the operations that the so-called “gates” can
perform to the point where the more advanced devices are far from containing just simple logic gates.
In spite of the continued presence of the word “gate” in their name, an FPGA is an array of CLBs
(Configurable Logic Blocks) that range from simple logic to complex truth tables (called LUTs) and
~/NIU/courses/532/2015-fa/book/openspl/./intro/chapter.tex Page 1 of 81
[email protected] 2015-10-20 15:01:52 -0500 v1.0-79-g838c34b
CHAPTER 1. INTRODUCTION
register simple memories and a plurality of other types of Hard IP1 such as mathematical units of various types,
specialized control units (for accessing large memories), communication units (for Ethernet links, PCIe,
and others) and even whole multicore CPUs such as the Sparc or ARM.
ý Fix Me: A simplified CLB block diagram is shown in Figure 1.1. The LUT contains a truth table with as
Add line showing direct output many rows as the quantity of input bits enumerate. The clock signal and associated D-Latch comprise
from the LUT too.
a register that is used to store/remember the last value that was “looked up” in the truth table.2
ý Fix Me:
Note that the a LUT and a latch
each takes a nonzero period of
time to emit a result after the
input bit(s) have changed.
Programming an FPGA consists of specifying 1) the values of the bits in the LUTs and 2) a network
map (referred to as a netlist) that describes which signals (the bits) that flow out of one block (a CLB
or some Hard IP) and into another. Ultimately the signals originate on some of the pins of the FPGA
chip and terminate at others which are, in turn, connected to other devices such as an Ethernet, large
memory chip(s) and/or the PCIe signals in a PC so that the FPGA can interact with the rest of the
world and perform useful work.
Since the late ’80s, languages such as Verilog and VHDL, both known as HDLs (Hardware Descrip-
tion Language), have been used to program FPGAs. These languages are akin to using an assembly
language to program a CPU. (If we continue this analogy downward then creating a netlist by hand
would be akin to typing in CPU machine code in binary.) While assembly code is necessary for some
specific functions and can often result in the most efficient execution of a program on some CPUs, the
1 Hard IP (Intellectual Property) refers to commonly used functions that might have been historically programmed
into the FPGA by using multiple CLBs. But are more efficiently built by dedicating part of the silicon of the chip to a
specific purpose.
2 Actual CLBs include additional components like a Full Adder because implementing them using LUTs would be
Page 2 of 81 ~/NIU/courses/532/2015-fa/book/openspl/./intro/chapter.tex
[email protected] 2015-10-20 15:01:52 -0500 v1.0-79-g838c34b
1.1. THE FPGA
additional effort required and lack of portability often drive programmers to use simpler higher level
languages like C or Java. . . at the expense of (possibly) ending up with slower-performing code.
Where FPGA programming is concerned, the next “higher-level” languages are being invented and
discovered right now. One such language is called OpenSPL and is the subject of the rest of this book.
OpenSPL is expressed as a combination of C and Java.
The applications discussed in this book were designed to execute on a Maxeler DFE (Data Flow
Engine) board. An DFE is what is generally known as a co-processor or application accelerator
because it is connected to a traditional computer3 in the form of a peripheral device similar to a hard
drive or audio interface as seen in Figure 1.2
As a PC peripheral device, the DFE board is connected to the PCIe (Peripheral Component Inter-
connect Express) bus. The PCIe bus is a set of high-speed serial lanes. The DFE upon which this
text focuses is the Maxeler ICSA. The ICSA has eight lanes.
As an eight-lane PCIe device, the ICSA DFE can exchange up to eight simultaneous streams of data
with the main memory of a host PC. These streams represent one of the types of I/O that an OpenSPL
program can use. Other types of I/O include various types of memory and serial interfaces such as
Ethernet that can be connected directly to an FPGA.
Note that from the perspective of an FPGA even memory starts to “look” and act like a peripheral
device in that it requires the application to read from and write to it!
Each of the (on the order of) 1,000,000 CLBs operates independently, providing a great deal of fine-
grained parallelism.
3 For sake of completeness it is important to point out that FPGA (stand-alone) applications do not require connection
to a computer. Other applications are implemented using FPGAs that include an entire CPU within the FPGA.
~/NIU/courses/532/2015-fa/book/openspl/./intro/chapter.tex Page 3 of 81
[email protected] 2015-10-20 15:01:52 -0500 v1.0-79-g838c34b
CHAPTER 1. INTRODUCTION
pipeline By connecting CLBs together to create complex functions and connecting the output of one function to
the input of another, one or more chains or pipelines can be created that can receive/read one or more
streams of data, process it in some way and then transmit/write one or more resulting streams of data.
OpenSPL is well suited for implementing solutions to such a “streaming” application.
Stratix® 10 FPGAs and SoCs combine the industry’s highest performance (2X), and high-
est density (5.5MLEs) with advanced embedded processing capabilities (quad-core ARM®
Cortex® -A53), GPU-class floating-point computation performance of up to 10 Tera floating-
point operations per second (TFLOPS), heterogeneous 3D system-in-package (SiP) integra-
tion, and the most advanced security capabilities in a high-performance FPGA.
Searching the Internet for maximum performance numbers on Intel processors is tough since there are
so many variations available. As of Q4 2014, it appears that the fastest Intel CPUs are capable of
approximately 1 TFLOPs.
Keep in mind that all of these “maximum speeds” are theoretical and are not likely to be achieved
unless one is extremely careful about designing and writing code to suite the needs of each of the
specific devices.
Page 4 of 81 ~/NIU/courses/532/2015-fa/book/openspl/./book.tex
[email protected] 2015-10-20 15:01:52 -0500 v1.0-79-g838c34b
2
Dataflow Computing
A fairly obvious conclusion which can be drawn at this point is that the effort expended on
achieving high parallel processing rates is wasted unless it is accompanied by achievements
in sequential processing rates of very nearly the same magnitude.[4]
—Gene M. Amdahl, 1967
IBM
The speedup of a program using multiple processors in parallel computing is limited by the time needed
for the sequential fraction of the program.1
Slotnick’s Law:
The parallel approach to computing does require that some original thinking be done about
numerical analysis and data management in order to secure efficient use.
In an environment which has represented the absence of the need to think as the highest
virtue this is a decided disadvantage.
—Daniel Slotnick, 1967
Chief Architect
Illiac IV
It is the purpose of this text to discuss some original thinking about numerical analysis and data
management while keeping an eye on the requirements of sequential processing in order to maximize the
performance of an application.
1 See [5, Section 7.12] for a discussion of the pitfalls of improperly interpreting Amdahl’s Law.
~/NIU/courses/532/2015-fa/book/openspl/./introDataFlow/chapter.tex Page 5 of 81
[email protected] 2015-10-20 15:01:52 -0500 v1.0-79-g838c34b
CHAPTER 2. DATAFLOW COMPUTING
ý Fix Me: As a CPU iterates over the body of the loop, there are multiple operations that take place to compute
Consider replacing the example the right side of the assignment statement, they all must complete before the assignment is made on line
problem here with one that can be
implemented with operations that 6, and the assignment must be made before proceeding to the next iteration of the loop.
can each be implemented with a
single CLB. This will provide a
more natural progression into
creating pipelines within pipelines.
2.2 Problem Solving With a CPU
Focusing only on the body of the loop, we know that a CPU will perform following operations one at a
time (described in Figure 2.1 using RTL (Register Transfer Language[7]) notation.)2
t1 ← x[i]
t1 ← t1 × t1
t2 ← z[i]
t2 ← t2 × t2
t2 ← t1 + t2
y[i] ← t2
i←i+1
Using a timing diagram3 we can see how and when the ALU (Arithmetic Logic Unit) and the
memory interface units of the CPU are used over the course of time while the CPU executes the RTL in
Figure 2.1. A timing diagram shows what operations take place in each functional unit over a continuum
of time. When a unit is performing a useful task, such as squaring the number a, that particular operation
2 RTL is commonly used as an intermediate language in compilers
3 For more information on timing diagrams see: http://en.wikipedia.org/wiki/Digital_timing_diagram
Page 6 of 81 ~/NIU/courses/532/2015-fa/book/openspl/./introDataFlow/chapter.tex
[email protected] 2015-10-20 15:01:52 -0500 v1.0-79-g838c34b
2.2. PROBLEM SOLVING WITH A CPU
is indicated with a × a . Note that, as depicted here, any input(s) to an operation are read/sampled once loop unrolling
at the beginning of the time period and the results of the operation are provided at the end and are
held stable until the next output value is generated. An idle unit is indicated with a gray time period:
.4 The width of an item in a timing diagram is proportional the amount of time used to execute
the specified operation. The position on the diagram’s horizontal axis represents the span of time over
which the operation takes place.
The background colors in the timing diagram in this book have been chosen to indicate the type of
operation being performed. ALU operations are displayed in blue. Memory transfers, in amber.
In the simplest case, where a CPU can only do one thing at a time. Let us assume that each of the 7
operations in our loop body take the same amount of time to complete. Each iteration of the loop body
results in the execution of the same operations in the same order. Figure 2.2 is a timing diagram showing
how the the first two iterations of our loop perform a total of 7 × 2 = 14 operations, each consuming one
unit of time. That our example CPU can only do one thing at a time is made evident by that fact that
only one of the units is not idle at any point in time.
t1 ← t1 × t1 t2 ← t2 × t2 t2 ← t1 + t2 i←i+1 t1 ← t1 × t1 t2 ← t2 × t2 t2 ← t1 + t2 i←i+1
ALU
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14
If a CPU is capable of exchanging data with memory at the same time that it is performing an operation ý Fix Me:
with its ALU and it can “look ahead” in the instruction stream then it is possible for it to optimize the Find a reference that discusses
how this is done. This should be
use of its functional units by scheduling more than one at the same time.[9] As long as care is taken to discussed in [5] and/or [8]
ensure that the data required for any given operation is present when the operation starts, operating
the units in parallel will reduce the time that it takes to execute the body of our loop from 7 to 5 units
per-iteration. Figure 2.3 shows how a total of 5 × 2 = 10 units of time are used to to complete two
iterations of our loop when CPU schedules its functional units in parallel. The performance improvement
is evident as the same 14 total units of time are allocated to the same 14 operations. The only difference
is when they have been scheduled to take place.
t1 ← t1 × t1 t2 ← t2 × t2 t2 ← t1 + t2 i←i+1 t1 ← t1 × t1 t2 ← t2 × t2 t2 ← t1 + t2 i←i+1
ALU
0 1 2 3 4 5 6 7 8 9 10
We can, however, change the RTL to better suit our needs. Using a technique called loop unrolling [10, ý Fix Me:
p. 735] we can rewrite our program showing the iterations of our loop in the form of one long instruction Check Aho reference.
stream. To better illustrate what is happening we will now consider three iterations of our loop body
and introduce an additional variable k that we will use along with i as our index counter as shown in
Figure 2.4.
4 The notion of any part of a machine being “idle” is a misnomer. Unless the power is removed, nothing actually
stops per se. When used in the context of a timing diagram or pipeline, idle literally indicates that the specified unit’s
activities are not consequential because its output will go unused during the indicated idle period. As an optimization,
modern processors will dispatch specific instructions that are known to consume the least amount of power during such
idle periods. It is easy to demonstrate the results of this by detecting the temperature (and fan speed) changes of a laptop
when its activity changes from idle to busy.
~/NIU/courses/532/2015-fa/book/openspl/./introDataFlow/chapter.tex Page 7 of 81
[email protected] 2015-10-20 15:01:52 -0500 v1.0-79-g838c34b
CHAPTER 2. DATAFLOW COMPUTING
Out-of-order execution
k ← i+1 i ← k+1 k ← i+1
t1 ← x[i] t1 ← x[k] t1 ← x[i]
t1 ← t1 × t1 t1 ← t1 × t1 t1 ← t1 × t1
t2 ← z[i] t2 ← z[k] t2 ← z[i]
t2 ← t2 × t2 t2 ← t2 × t2 t2 ← t2 × t2
t2 ← t1 + t2 t2 ← t1 + t2 t2 ← t1 + t2
y[i] ← t2 y[k] ← t2 y[i] ← t2
Figure 2.4: Three iterations of RTL unrolled loop version of y[i] = x[i]2 + z[i]2
Using two counter registers (i and k) instead of one (i) and interleaving which is used during each of the
original loop bodies we can now see how the operations can be reordered to better suit the capabilities
of our two-unit CPU. Advanced CPUs are capable looking far enough ahead in the instruction stream to
implement this type of optimization using what is called out-of-order execution.[9] The CPU now can
use the ALU during the first time unit in each loop body to perform the counter increment for the next
loop body and then it can relocate the t1 ← x[i]. As shown in Figure 2.4, out of order execution allows
the first iteration of our loop to complete in 5 and the remaining iterations in 4 units of time completing
3 loops in less time that it took the original CPU design to do 2. The cost for this optimization is that
the CPU has to allocate an additional register.
Using an extra register as a way to save time is a trade-off between space (silicon to make/use another
register) and time (additional cycles required to perform operations that can not be scheduled to occur
at other times).
t1 ← x[i] t2 ← z[i] t1 ← x[k] y[i] ← t2 t2 ← z[k] t1 ← x[i] y[k] ← t2 t2 ← z[i] t1 ← x[k] y[i] ← t2
mem
Figure 2.5: Three iterations of out-of-order RTL-parallel execution of y[i] = x[i]2 + z[i]2
The ALU is now saturated with work. Therefore we have gone as far as we can. . . with a CPU that has
only two functional units.
Further optimization would require that we either eliminate some of the operations or add more functional
units to further distribute the work. For example we could add an additional ALU and another path to
access the memory in the system.
The extent to which adding functional units is helpful depends on how many of the operations must be
completed before others can begin as well as the ability of the CPU to schedule the instructions across
all of the units in an efficient manner.
While all of this is possible, some types of problems are more easily optimized by using a pipeline than
a CPU.
Page 8 of 81 ~/NIU/courses/532/2015-fa/book/openspl/./introDataFlow/chapter.tex
[email protected] 2015-10-20 15:01:52 -0500 v1.0-79-g838c34b
2.3. PROBLEM SOLVING WITH PIPELINED COMPUTING
Amdahl’s law states that gaining efficiency by performing operations in parallel is limited by the amount
of time that is required to execute the longest single serial task. Serial, in this context, refers to the
dependency path through the sequence of events that must performed in order.
To better understand the dependency path in our application let us express it in the form of a network
where edges will represent dependencies and nodes represent operations. Let us call our diagram a
Kernel graph and draw it as shown in Figure 2.6. The multiplication units (the top two blue circles)
have their input data flowing in from the x and z (inverted orange house) data-source units and their
results flowing out to an adder unit (the bottom blue circle) that, in turn, has its output flowing to the
y (orange house) data-sink unit. This notation was adopted from [1, p. 23].
x z
* *
Expressed in this form we can see that the x and z data items can be (theoretically) fetched simultane-
ously because they have no direct or indirect dependencies shown in the graph. The two multiplication
operations can also happen at the same time as long as they have data delivered from x and z. The
addition can not start until after both the multiplications have completed because the data items that
the addition requires flows in from the multipliers. Finally, the delivery of the sum to the y data-sink
can not start until the addition has completed.
At first glance it appears that 4 units of time is the best we can do. But, as Slotnick pointed out, “some
original thinking” might offer additional opportunity.
Allocating and dedicating a functional unit for each node in our kernel graph and parallelizing them might
~/NIU/courses/532/2015-fa/book/openspl/./introDataFlow/chapter.tex Page 9 of 81
[email protected] 2015-10-20 15:01:52 -0500 v1.0-79-g838c34b
CHAPTER 2. DATAFLOW COMPUTING
Pipeline!Fill yield a machine that operates as shown in Figure 2.7 for the case when x = {1, 2, 3} and z = {4, 5, 6}.
Pipeline!Flush For now let us continue to assume that i is (somehow) initialized to 0.
1 2 3
t1 ← x[i]
4 5 6
t2 ← z[i]
1 4 9
t1 ← t1 × t1
16 25 36
t2 ← t2 × t2
17 29 45
t2 ← t1 + t2
17 29 45
y[i] ← t2
1 2 3
i←i+1
0 1 2 3 4 5 6 7 8 9 10 11 12
Figure 2.7: A naive pipelined implementation of y[i] = x[i]2 + z[i]2 for x = {1, 2, 3} and z = {4, 5, 6}.
Our timing diagram now has a stair-step characteristic due to the chain of dependencies in the operations
that can not be parallelized.
Note that our notation of labeling the timing diagram has changed. Since each “row” now repre-
sents a single-purpose dedicated functional unit, we can now indicate the one and only operation
that each performs along the left edge of our diagram as opposed to within the time periods
as was the case earlier when each of the units were used for different operations at different times.
We can take advantage of this situation by now indicating the values of the data that are
output by each functional unit at every point in time.
However, trading space for time by adding some more temporary registers we can see that the stair-steps
can be collapsed as shown in Figure 2.8. This time we assume that i and k are initialized to 0 and that
x = {1, 2, 3, 4, 5, 6, 7, 8, 9} and z = {4, 5, 6, 7, 8, 9, 10, 11, 12}.
This time we use a separate counter for the x and y inputs (k) than we do for the y output (i) because,
while they count the same things, they now have to do so at different times. We have also added enough
temporary registers so that there is now one for every edge in our Kernel graph.
By not reusing any registers for more than one specific purpose, we have eliminated the need for
any functional unit to wait on any other unless the two have a problem-specific data-dependency
(represented by an edge in the Kernel graph) between them.
We can now see that the first “iteration” of our loop takes 4 time units and the rest each take 1. The
first 3 time units in Figure 2.8 represent what is called filling the pipeline. The last 3 time units
represent what is called flushing the pipeline.
Page 10 of 81 ~/NIU/courses/532/2015-fa/book/openspl/./introDataFlow/chapter.tex
[email protected] 2015-10-20 15:01:52 -0500 v1.0-79-g838c34b
2.4. OBSERVATIONS
1 2 3 4 5 6 7 8 9
t1 ← x[ k ]
4 5 6 7 8 9 10 11 12
t2 ← z[ k ]
1 2 3 4 5 6 7 8 9
k ←k+1
1 4 9 16 25 36 49 64 81
t3 ← t1 × t1
16 25 36 49 64 81 100 121 144
t4 ← t2 × t2
17 29 45 65 89 117 149 185 225
t5 ← t3 + t4
17 29 45 65 89 117 149 185 225
y[ i ] ← t5
1 2 3 4 5 6 7 8 9
i←i+1
0 1 2 3 4 5 6 7 8 9 10 11 12
Figure 2.8: A properly pipelined version of y[i] = x[i]2 + z[i]2 for x = {1, 2, 3, 4, 5, 6, 7, 8, 9} and z =
{4, 5, 6, 7, 8, 9, 10, 11, 12}.
2.4 Observations
1. The more data elements we process the greater the advantage gained by our pipelined implemen-
tation due to amortization of the pipeline fill and flush costs. Therefore: complexityn→∞ = O(n).
2. The complexity of the function implemented will determine the number of stages required in the
pipeline.
3. The number of stages in the pipeline will define the latency in our design. Latency is the amount
of time between the arrival of data element(s) at the input unit(s) and the corresponding result
leaving the output unit(s).
4. The duration of one time unit is equal to the latency divided by the number of stages in our
pipeline.
~/NIU/courses/532/2015-fa/book/openspl/./book.tex Page 11 of 81
[email protected] 2015-10-20 15:01:52 -0500 v1.0-79-g838c34b
CHAPTER 2. DATAFLOW COMPUTING
Page 12 of 81 ~/NIU/courses/532/2015-fa/book/openspl/./introOpenSPL/chapter.tex
[email protected] 2015-10-20 15:01:52 -0500 v1.0-79-g838c34b
Data Flow
Engine—seeDFE
DFE
OpenSPL
FPGA
Field Programmable Gate
Array—seeFPGA
3
Application Accelerator
MaxJ
OpenSPL Basics
This chapter provides a brief overview of OpenSPL as a system for creating pipelines for dataflow
computing.
With a basic understanding one can navigate the Maxeler IDE, run example programs, write a “Hello
World!” application, simulate its execution and deploy applications on a DFE (Data Flow Engine).
3.1 Introduction
OpenSPL is an open specification for a Spatial Programming Language. The operations of a spatial
program exists in space rather than as a sequence of operations over time. This means that all of the
operations of program can happen at once and that the notion of what it means to execute a program
is more about getting the data in and out of the system as opposed to the sequence of events that take
place in typical procedural languages.
OpenSPL applications tend to manifest themselves in the form of one or more pipelines that are deployed
using an FPGA (Field Programmable Gate Array) that, when acting in this fashion, is referred to
as a DFE.
An application that uses a DFE to improve its performance does so by using a it as an application
accelerator. In doing so the code executes sequentially and that which executes spatially is written
using two different styles and languages.
The code that runs sequentially can be written in a language like C and runs on a CPU in the manner
that any C programmer is accustomed. The code that runs spatially is written in a variation of Java
called MaxJ and ultimately runs on an FPGA on a DFE.
The coordination of compiling everything can be performed by the MaxIDE (Eclipse) and consists of
~/NIU/courses/532/2015-fa/book/openspl/./introOpenSPL/chapter.tex Page 13 of 81
[email protected] 2015-10-20 15:01:52 -0500 v1.0-79-g838c34b
CHAPTER 3. OPENSPL BASICS
MaxCompiler executing the MaxCompiler to compile the MaxJ code into a .max file suitable for configuring an
SLiC FPGA and a standard C compiler to create an executable program for the CPU. See Figure 3.1.
Simple Live
CPU—seeSLiC
When the C application runs it can implicitly or explicitly configure and use the DFE with one or more
.max files to process data streams.
CPU code can call functions that are generated by the MaxCompiler. Supported languages include C,
Python, Matlab and R. This text will focus on the use of C code on the CPU.
The CPU code is, no different than any other application you might write. The CPU portion of an
OpenSPL application requires adding as little as one #include and one function-call statement to
exchange data streams with the DFE.
To the C code, the data stream exchanges with the DFE are regions of memory like an array or a buffer
created by calling the malloc(3) library function.
The above-mentioned header file to include and the function(s) to call to use the DFE are generated by
the MaxCompiler when it builds the MaxJ files. The generated function(s) use the SLiC (Simple Live
CPU) library interface that is part of the OpenSPL system. The SLiC interface provides the low-level
services needed for the CPU to configure and exchange data with one or more DFEs.
Page 14 of 81 ~/NIU/courses/532/2015-fa/book/openspl/./introOpenSPL/chapter.tex
[email protected] 2015-10-20 15:01:52 -0500 v1.0-79-g838c34b
3.2. AN ACCELERATOR ARCHITECTURE
• One or more Kernels (pipelines) that are responsible for processing data streams.
• A single Manager that tends to the movement of data between the host system memory,
Kernels, State Machines and memories on the DFE.
The MaxJ code is written in an extended version of Java which adds operator overloading to the base
Java language. MaxJ source files have a .maxj file extension to differentiate them from pure Java.[1,
p. 20] The operator overloading makes expressing mathematical operations used in Kernel pipelines
easier to write.
There is a subtlety hiding in the box labeled Hardware Build or Simulation in Figure 3.1. As
part of the build process, your MaxJ/Java application is actually executed. The output of the
Java application is used to create the netlist that is ultimately deployed on the DFE.
The ultimate output of the MaxCompiler is a .max file that contains the executable DFE code and a
.h file that contains generated function declarations and constants required to compile the CPU code.
Thus the MaxJ code is compiled before the C code because the MaxJ code is the origin of that which
defines the interface to the DFE code.1
The life-cycle of the DFE code is similar to that of any executable program. . . as long as the responsibility
of the operating system are taken into account: [1, p. 112]
1. Load - A .max file is loaded onto a DFE by the CPU code. The DFE card is now exclusively owned
by the calling CPU process. Loading the .max file takes in the order of 100ms to 1s.
2. Run - The CPU code calls SLiC functions to execute actions on the DFE. A loaded .max file should
be utilized for long enough to justify having waited up to a second to load it.
3. Unload - The DFE is released by the CPU process that returns it to the pool of DFEs managed
by MaxelerOS for use by other applications.
The Basic Static2 SLiC interface implicitly loads the .max file onto the DFE when the first SLiC
function is called, and then unloads the DFE and frees the .max file when the CPU code terminates.
This means that your application will stall if/when the DFE card(s) are in use by another application
until that application terminates (or otherwise explicitly releases the DFE).
Note that a single application may serially reuse one DFE card by handling the loading and unloading
of multiple .max files by using the SLiC API.
1 This creates a chicken-and-egg problem when it comes to writing the CPU application because the order of the
arguments in the DFE-generated functions is not known until it has been compiled. To address that problem stub-in a call
to the function and leave out its parameters, compile the application, it will fail on the call with incorrect arguments, look
at the generated header-file (or use the IDE ‘insight’ to see what they are), add them and recompile. Empirical evidence
suggests that the ordering is reliably reproducible and sorted alphabetically by type.
2 One of three SLiC interfaces discussed in ??
~/NIU/courses/532/2015-fa/book/openspl/./introOpenSPL/chapter.tex Page 15 of 81
[email protected] 2015-10-20 15:01:52 -0500 v1.0-79-g838c34b
CHAPTER 3. OPENSPL BASICS
Stream
Once the code for Kernels and the Manager are combined they form a complete dataflow program.
The execution of this program results in either the generation of a dataflow engine configuration
file (.max file), or the execution of a DFE simulation. In either case, the MaxCompiler always
generates an include file to go with a .max file.[1, p. 31]
3.2.3 Kernel
An OpenSPL application will contain one or more Kernels. A Kernel that implements the Kernel graph
shown in Figure 2.6 would contain logic like that shown in Listing 3.1.
Listing 3.1: KernelBody.maxj
The body of a simple kernel.
1 DFEVar xs = io . input ( " x " , dfeFloat (8 , 24) ) ; // A float stream called x
2 DFEVar zs = io . input ( " z " , dfeFloat (8 , 24) ) ; // A float stream called z
3
As can be seen the Kernel defines the name and type of data for each stream that it will process along
with the operations it will perform on the data stream. In this case it will sum the squares of the
elements in the x and z input streams and write the result to an output stream named y.
The data types of each of the three streams is identical and set to dfeFloat(8, 24). This is the
OpenSPL way of defining what would appear in a C program as a float.
3.2.4 Manager
An OpenSPL application will contain one and only one Manager. The Manager coordinates the data
flow between the CPU, Kernels, the DFE’s memory and other devices depending on the particular type
of DFE card(s) in the system.[1, p. 20] Each of these dataflows are are called a stream.
The simplest of all Managers is one that connects all of I/O defined in a single Kernel to the CPU and
is shown in Listing 3.2.[11, p. 41]
Listing 3.2: SimpleManager.maxj
The simplest of Managers.
1 public static void main ( String [] args )
2 {
3 EngineParameters params = new EngineParameters ( args ) ;
4 Manager manager = new Manager ( params ) ;
5 Kernel kernel = new SimpleKernel ( manager . makeKernelParameters () ) ;
6 manager . setKernel ( kernel ) ;
7 manager . setIO ( IOType . ALL_CPU ) ;
8 manager . createSLiCinterface () ;
9 manager . build () ;
10 }
This Manager makes boiler-plate calls to initialize the OpenSPL environment in lines 3 and 4.
Page 16 of 81 ~/NIU/courses/532/2015-fa/book/openspl/./introOpenSPL/chapter.tex
[email protected] 2015-10-20 15:01:52 -0500 v1.0-79-g838c34b
3.2. AN ACCELERATOR ARCHITECTURE
The Kernel is created in line 5 and the default parameters are passed to the Kernel object’s constructor.
All of the scalar and stream I/O variables are routed to the CPU application in line 7. This means that
they will appear in the generated C-callable function in the generated .h file and will be named based
on how the Kernel named them in the io.input() and io.output() calls such as those in Listing 3.1
on lines 1, 2 and 6.
~/NIU/courses/532/2015-fa/book/openspl/./book.tex Page 17 of 81
[email protected] 2015-10-20 15:01:52 -0500 v1.0-79-g838c34b
CHAPTER 3. OPENSPL BASICS
Page 18 of 81 ~/NIU/courses/532/2015-fa/book/openspl/./introMaxIDE/chapter.tex
[email protected] 2015-10-20 15:01:52 -0500 v1.0-79-g838c34b
4
Maxeler IDE
This chapter will present the Maxeler IDE to orient the reader before writing a first program.
There are two ways to access the Maxeler IDE at NIU. Note that some documentation resources mention
the availability of a web-based IDE. This is not available at NIU.
4.1.1 Installation of VM
Download and install the Maxeler VM from the University program web site and execute it using a lab
PC or your own. In this configuration you will be limited to only development and testing applications
using a simulation environment.
As you will see the simulated environment is where you will do the majority of your work. You will want
to use this.
See Appendix B for details on installing and using VMware on Linux, Mac and Windows systems.
4.1.2 hermes.niu.edu
Using the software on hermes.niu.edu will allow for deployment of applications on real DFE hardware
for final release testing and timing analysis.
~/NIU/courses/532/2015-fa/book/openspl/./introMaxIDE/chapter.tex Page 19 of 81
[email protected] 2015-10-20 15:01:52 -0500 v1.0-79-g838c34b
CHAPTER 4. MAXELER IDE
Alternatively, the MaxIDE icon on the desktop (on the VM) may be clicked.
Note that the Maxeler IDE is based on Eclipse. See http://www.eclipse.org/ for general information
about the Eclipse IDE.
When started the first time the IDE will present a Workspace Launcher window (see Figure 4.2) asking
you to where to put all of your files. Accepting the default of ~/workspace should be suitable.
4.2.2 Documentation
When the IDE is started and there are no projects to display (as is the case when running it for the first
time) a Welcome window is displayed (Figure 4.3).
Page 20 of 81 ~/NIU/courses/532/2015-fa/book/openspl/./introMaxIDE/chapter.tex
[email protected] 2015-10-20 15:01:52 -0500 v1.0-79-g838c34b
4.2. MAXELER FIRST TIME USE
Note: When running the IDE on hermes.niu.edu you may not be able to view any of the help documents.
The Welcome window presents a number of tutorials on how to use OpenSPL. These documents are
very useful. It is recommended that they be skimmed early on in order to familiarize yourself with what
is there so that help can be located down the road when it is needed.
Also appearing on the Welcome Window is a link to a set of example projects that are discussed in the
tutorial documents.
Select the Auto-import MaxCompiler tutorial projects link in the Welcome window (Figure 4.3).
Select MaxCompiler Dataflow Programming Tutorial from the menu box and check the examples
box and then click finish in the Import MaxCompiler Projects window (Figure 4.4). ý Fix Me:
rework this to describe and show
how to only import
chap03-example1 and/or whatever
~/NIU/courses/532/2015-fa/book/openspl/./introMaxIDE/chapter.tex Page 21 of 81 else we end up using rather than
[email protected] 2015-10-20 15:01:52 -0500 v1.0-79-g838c34b all examples.
CHAPTER 4. MAXELER IDE
This will import every project discussed in the Open MaxCompiler Dataflow Programming Tutorial.
Once completed the IDE will replace the Welcome window with the Project Explorer panel and list all
of the imported projects.
Page 22 of 81 ~/NIU/courses/532/2015-fa/book/openspl/./introMaxIDE/chapter.tex
[email protected] 2015-10-20 15:01:52 -0500 v1.0-79-g838c34b
4.2. MAXELER FIRST TIME USE
Click on the Select Run/simulation box over the project explorer and chose simulation. ý Fix Me:
This needs more screen caps with
arrows pointing at these unnamed
Then click the play button icon to run it. things
ý Fix Me:
It will build and run your application. Add arrow naming the play button
If you receive a pop-up/warning about the simulator being started outside the IDE, select the force
reset option.
~/NIU/courses/532/2015-fa/book/openspl/./introMaxIDE/chapter.tex Page 23 of 81
[email protected] 2015-10-20 15:01:52 -0500 v1.0-79-g838c34b
CHAPTER 4. MAXELER IDE
Output from the run shows up in the terminal window below the build messages.
Once things have run, you can open the Run Rules > Simulation > Final Kernel Graph to see a
diagram of the dataflow for the kernel and/or manager as seen in Figure 4.9.
Page 24 of 81 ~/NIU/courses/532/2015-fa/book/openspl/./introMaxIDE/chapter.tex
[email protected] 2015-10-20 15:01:52 -0500 v1.0-79-g838c34b
4.3. MAXIDE PROBLEMS
Note that generated C headers appear in the workspace area under the CPU Code part of the project
tree.
If your connection to the server fails and/or the MaxIDE crashes, you may find that your workspace
has become corrupted. If starting the MaxIDE results in displaying a broken workspace, terminate it
immediately and start it again. Additionally, will want to backup copies of your source code files early
and often.
See Appendix E for instructions on how to use SVN to backup and hand in your work.
~/NIU/courses/532/2015-fa/book/openspl/./book.tex Page 25 of 81
[email protected] 2015-10-20 15:01:52 -0500 v1.0-79-g838c34b
CHAPTER 4. MAXELER IDE
Page 26 of 81 ~/NIU/courses/532/2015-fa/book/openspl/./firstprog/chapter.tex
[email protected] 2015-10-20 15:01:52 -0500 v1.0-79-g838c34b
5
Your First OpenSPL Program
5.1 Introduction
This chapter presents the details of creating a new project that will implement an application that
calculates si = x2i + 30a.
The goals are to learn how to create new projects using the IDE, and alter the CPU, Manager and
Kernel code to add/remove items from the SLiC interface.
To create a new application from scratch we will use the IDE to create an application from a template
based on the type of application we wish to develop and then change it to suit our needs.
In this example we will create an application with a Standard Manager using CPU Streams that will be
compiled for the Icsa DFE hardware we will be using.
The stub application template created by the IDE will implement si = xi + yi + a. The template code
will be altered in order to implement si = x2i + 30a.
~/NIU/courses/532/2015-fa/book/openspl/./firstprog/chapter.tex Page 27 of 81
[email protected] 2015-10-20 15:01:52 -0500 v1.0-79-g838c34b
CHAPTER 5. YOUR FIRST OPENSPL PROGRAM
Begin by clicking the New icon in the upper-left of the IDE and select MaxCompiler Project from
the menu as shown in Figure 5.1.
This will open a window that will prompt you for the details needed to create your project.
Page 28 of 81 ~/NIU/courses/532/2015-fa/book/openspl/./firstprog/chapter.tex
[email protected] 2015-10-20 15:01:52 -0500 v1.0-79-g838c34b
5.1. INTRODUCTION
In the window that opens enter a suitable name for the new project, set the Manager Templates to CPU
Stream (Vector Addition) and click Next (Figure 5.2).
The Project name field is used as the name that will appear in the Project Explorer tab on the left of
the IDE.
ý Fix Me:
Should we mention something
about what the Vector Addition
option means?
~/NIU/courses/532/2015-fa/book/openspl/./firstprog/chapter.tex Page 29 of 81
[email protected] 2015-10-20 15:01:52 -0500 v1.0-79-g838c34b
CHAPTER 5. YOUR FIRST OPENSPL PROGRAM
Set the DFE Model to the type of hardware you are targeting with your application (hermes.niu.edu
has a Icsa MAX4AB24B), chose a Standard Manager, provide a Stem Name for your manager
and kernel and then click Next (Figure 5.3).
Page 30 of 81 ~/NIU/courses/532/2015-fa/book/openspl/./firstprog/chapter.tex
[email protected] 2015-10-20 15:01:52 -0500 v1.0-79-g838c34b
5.1. INTRODUCTION
Provide a suitable name for the file that will contain your C source code (the default is suitable), set the
SLiC Interface type to Basic Static and click Finish (Figure 5.4).
~/NIU/courses/532/2015-fa/book/openspl/./firstprog/chapter.tex Page 31 of 81
[email protected] 2015-10-20 15:01:52 -0500 v1.0-79-g838c34b
CHAPTER 5. YOUR FIRST OPENSPL PROGRAM
Note: Have a look at the template files by opening the project in the Project Explorer tab and navigate to
At the moment we are not your CPU Code and Engine Code Manager and Kernel files. Double-clink on TestStreamCpu-
interested in the
TestStreamEngineParameters Code.c, TestStreamKernel.maxj, and TestStreamManager.maxj files to see them in an editor
file. tab (Figure 5.5).
Page 32 of 81 ~/NIU/courses/532/2015-fa/book/openspl/./workspace/FirstProject/CPUCode/TestStreamCpuCode.c
[email protected] 2015-10-20 15:01:52 -0500 v1.0-79-g838c34b
5.1. INTRODUCTION
25 }
Page 33
~/NIU/courses/532/2015-fa/book/openspl/./workspace/FirstProject/EngineCode/src/teststream/TestStreamManager.maxj of 81
[email protected] 2015-10-20 15:01:52 -0500 v1.0-79-g838c34b
CHAPTER 5. YOUR FIRST OPENSPL PROGRAM
1 package teststream ;
2
31 manager . build () ;
32 }
33
Page 34 ~/NIU/courses/532/2015-fa/book/openspl/./workspace/FirstProject/EngineCode/src/teststream/TestStreamManager.maxj
of 81
[email protected] 2015-10-20 15:01:52 -0500 v1.0-79-g838c34b
5.1. INTRODUCTION
~/NIU/courses/532/2015-fa/book/openspl/./firstprog/chapter.tex Page 35 of 81
[email protected] 2015-10-20 15:01:52 -0500 v1.0-79-g838c34b
CHAPTER 5. YOUR FIRST OPENSPL PROGRAM
To test your application you can quickly build it for simulation and run it without using the DFE. In
simulation, your code will compile quickly and run slowly. But only a simulation can be debugged using
watches in the IDE. (Compiling for the DFE takes at least 20 minutes and can not be easily debugged
while running!)
Building a project for simulation can be done by right-clicking on its name in the Project Explorer tab.
Right-click on FirstProject. Then navigate the menu to Run As and click on Simulation. (Fig-
ure 5.6).
Page 36 of 81 ~/NIU/courses/532/2015-fa/book/openspl/./firstprog/chapter.tex
[email protected] 2015-10-20 15:01:52 -0500 v1.0-79-g838c34b
5.1. INTRODUCTION
While a project is building or running, messages will appear in the Console tab in the IDE (Figure 5.7).
Recall that the C application prints “Running on DFE.” and “Done.” from lines 22 and 31, respectively,
in Listing 5.1. We see those lines appearing timestamped at “Thu 19:17” in the Console tab thus
verifying that the application has executed.
~/NIU/courses/532/2015-fa/book/openspl/./firstprog/chapter.tex Page 37 of 81
[email protected] 2015-10-20 15:01:52 -0500 v1.0-79-g838c34b
CHAPTER 5. YOUR FIRST OPENSPL PROGRAM
The IDE generates a graphic version of the pipeline created by the kernel. You view it by clicking on
Original Kernel Graph in the Project Explorer tab.
The original kernel graph represents the kernel as described by the java code.
Page 38 of 81 ~/NIU/courses/532/2015-fa/book/openspl/./firstprog/chapter.tex
[email protected] 2015-10-20 15:01:52 -0500 v1.0-79-g838c34b
5.2. CONVERT TEMPLATE CODE TO DESIRED APPLICATION
The IDE generates a graphic version of the pipeline created by the kernel. You view it by clicking on
Original Kernel Graph in the Project Explorer tab.
The final kernel graph shows how the DFE will actually process the dataflow. It shows the optimizations
that are applied and indicates when and how buffering is used for temporal alignment.
In this trivial kernel, we can see that a three-input adder has been created to perform the kernel operation
rather than a cascade of two-input adders. We do not see any temporal alignment because the kernel is
trivial enough not to require any.
si = xi + yi + a (5.2.1)
. . . and it would also be nice to see the input and output data streams so that we can hand-verify our
code.
~/NIU/courses/532/2015-fa/book/openspl/./firstprog/chapter.tex Page 39 of 81
[email protected] 2015-10-20 15:01:52 -0500 v1.0-79-g838c34b
CHAPTER 5. YOUR FIRST OPENSPL PROGRAM
To accomplish these changes, we will add some printing logic, remove the y input stream from the SLiC
interface, change the computation performed by the kernel and change the verification logic to match
the new kernel computation.
To see what is going on we add printing logic to dump the input and output data streams. We need
not use large data sets to test this application, so we will also reduce the number of elements in the I/O
streams to 96 to minimize the noise-level.
Note the addition of printInt32Vector() on line 8, the calls to it on lines 34–41 in Listing 5.5.
Listing 5.5: workspace/FirstProject2/CPUCode/TestStreamCpuCode.c
Add print logic to the CPU code.
1 # include < math .h >
2 # include < stdio .h >
3 # include < stdlib .h >
4
Page 40 of 81 ~/NIU/courses/532/2015-fa/book/openspl/./workspace/FirstProject2/CPUCode/TestStreamCpuCode.c
[email protected] 2015-10-20 15:01:52 -0500 v1.0-79-g838c34b
5.2. CONVERT TEMPLATE CODE TO DESIRED APPLICATION
42
Running the new version of the application renders the output shown in Listing 5.6.
Open the TestStreamManager.maxj file and delete line 22 though the end of the file with lines xx-yy
shown in Listing 5.7 (which can be copied from the MovingAverageSimpleManager.maxj tutorial source)
shown (thus replacing 38 lines of code with 5).
~/NIU/courses/532/2015-fa/book/openspl/./TestStreamManager.maxj Page 41 of 81
[email protected] 2015-10-20 15:01:52 -0500 v1.0-79-g838c34b
CHAPTER 5. YOUR FIRST OPENSPL PROGRAM
Remove the extra input stream be deleting the DFEVar y definition on line 16 in TestStreamKernel.maxj
and change the assignment on line 20 as shown in Listing 5.8.
Remove the y stream and change the verification logic in the CPU code to match the new kernel. See
changes on lines 23, 29, 37-38 and 45 in Listing 5.5
Page 42 of 81 ~/NIU/courses/532/2015-fa/book/openspl/./firstprog/chapter.tex
[email protected] 2015-10-20 15:01:52 -0500 v1.0-79-g838c34b
5.2. CONVERT TEMPLATE CODE TO DESIRED APPLICATION
Run your program one more time and see the final output in Listing 5.10
~/NIU/courses/532/2015-fa/book/openspl/./FirstProject3.out Page 43 of 81
[email protected] 2015-10-20 15:01:52 -0500 v1.0-79-g838c34b
CHAPTER 5. YOUR FIRST OPENSPL PROGRAM
15 70 13 26 91 80 56 73 62 70 96 81 5 25 84 27 36 5 46 29 13 57 24 95 82 45
14 67 34 64 43 50 87 8 76 78 88 84 3 51 54 99 32 60 76 68 39 12
11 Tue 15:22: Output x
12 Tue 15:22: 6979 7486 6019 315 8739 1315 7486 8554 2491 531 3934 819 8190 3571
4059 766 1690 766 5274 1386 211 4714 4579 931 6814 990 3934 619 4579 1315
931 94 574 3454 4851 4579 8739 3226 211 1854 931 5419 531 451 7146 1459 9694
666 315 4990 259 766 8371 6490 3226 5419 3934 4990 9306 6651 115 715 7146
819 1386 115 2206 931 259 3339 666 9115 6814 2115 286 4579 1246 4186 1939
2590 7659 154 5866 6174 7834 7146 99 2691 3006 9891 1114 3690 5866 4714 1611
234
13 Tue 15:22: Done .
14 Tue 15:22: Process terminated with exit code 0.
Page 44 of 81 ~/NIU/courses/532/2015-fa/book/openspl/./book.tex
[email protected] 2015-10-20 15:01:52 -0500 v1.0-79-g838c34b
Kernel
6
The Kernel
The Kernels present in an OpenSPL application are responsible for the “data processing.”
6.1 Introduction
Maximum performance in a Maxeler solution is achieved through a combination of deep-
pipelining and exploiting both inter- and intra-Kernel parallelism. The high I/O-bandwidth
required by such parallelism is supported by flexible high-performance memory controllers
and a highly parallel memory system.[1, p. 20]
The computation-to-data ratio, which describes how many mathematical operations are per-
formed per item of data moved, is a key metric for estimating the performance of the final
dataflow implementation. Code that requires large amounts of data to be moved and then
performs only a few arithmetic operations poses higher balancing challenges than code with
significant localized arithmetic activity.[1, p. 22]
A simple and straight forward method of improving the performance of and OpenSPL application is to
do as much as can be done with the data that is present in one kernel.
Multiple streams (related or not) can flow through a kernel pipeline at the same time.
In some situations the same data streams are needed for multiple different calculations. For example an
application might require that the following calculations:
~/NIU/courses/532/2015-fa/book/openspl/./kernel/chapter.tex Page 45 of 81
[email protected] 2015-10-20 15:01:52 -0500 v1.0-79-g838c34b
CHAPTER 6. THE KERNEL
a = x2 + y 2
b = x2 + z 2
c=x+y+z
d=x∗y∗z
e = x2 + y 2 + z 2
Putting all four of these functions into a single kernel will allow the x, y and z streams to be transferred
into the DFE once and only once as opposed to the five times that it would otherwise take if each
function were implemented in a separate kernel and invoked serially. See the graph for this kernel in
Figure 6.1.
y x z
* * * + *
+ + d + c
a e b
This kernel will generate all five output streams in the same amount of time that would be required to
execute to execute only one of the above equations as seen in Figure 6.2.
Another variation on the theme of doing more data processing at the same time is to observe that some
times it is not that different operations are performed on same data but, rather, the same operations are
performed on different data.
a=o+p+q+r
b=s+t+u+v
Page 46 of 81 ~/NIU/courses/532/2015-fa/book/openspl/./kernel/chapter.tex
[email protected] 2015-10-20 15:01:52 -0500 v1.0-79-g838c34b
6.2. WIDENING THE PIPELINE
x y z
* * *
. . . that would produce the kernel graph in Figure 6.3. To implement such an application, one kernel
with eight input streams can be created or a fancy manager could be used to create and connect two
copies of a four-input stream kernel to eight input streams.
~/NIU/courses/532/2015-fa/book/openspl/./kernel/chapter.tex Page 47 of 81
[email protected] 2015-10-20 15:01:52 -0500 v1.0-79-g838c34b
CHAPTER 6. THE KERNEL
o p q r s t u v
+ + + +
+ +
a b
So far we have only considered kernels that can be implemented without any internal delays. Some
functions require the same data to be applied to different parts of the same pipeline. . . at different
times.
6.3.1 y = x2 + z 2 + z
Solving y = x2 + z 2 was straight forward due to the clean symmetry of its kernel graph (see Figure 2.6.)
If we add z to the sum of the squares, something interesting happens.
Note the addition of the red edge in Figure 6.4 indicating the need to add z to x2 and z 2 .
The problem with this graph is that when one considers the pipeline expressed in Figure 2.8, the z values
have to be made available to the adder at a different time than when they have to be delivered to the
multiplier.
In order to address this issue, the z can be delayed using a FIFO buffer as shown in Figure 6.5.
Notice that the FIFO is drawn as a green table with a number in the center that represents the number
of elements (and therefore units of time) that the FIFO contains. When writing an element into a FIFO
with length 1, it will come out one unit of time later.
A timing diagram for Figure 6.5 is shown in Figure 6.6. Note that this diagram is identical to Figure 2.8
except for those rows highlighted on the left in yellow.
The pipeline latency has not changed because we can implement it using an adder with three inputs.
The extent to which we can add additional inputs to any type of operational unit depends on the type
of FPGA and versions of the compilers we use.
Page 48 of 81 ~/NIU/courses/532/2015-fa/book/openspl/./kernel/chapter.tex
[email protected] 2015-10-20 15:01:52 -0500 v1.0-79-g838c34b
6.3. TEMPORAL ALIGNMENT
x z
* *
x z
* * 1
~/NIU/courses/532/2015-fa/book/openspl/./kernel/chapter.tex Page 49 of 81
[email protected] 2015-10-20 15:01:52 -0500 v1.0-79-g838c34b
CHAPTER 6. THE KERNEL
1 2 3 4 5 6
t1 ← x[k]
4 5 6 7 8 9
t2 ← z[k]
0 1 2 3 4 5
k ←k+1
1 4 9 16 25 36
t3 ← t1 × t1
16 25 36 49 64 81
t4 ← t2 × t2
4 5 6 7 8 9
t6 ← t2
21 34 51 72 97 126
t5 ← t3 + t4 + t6
21 34 51 72 97 126
y[i] ← t5
0 1 2 3 4 5
i←i+1
0 1 2 3 4 5 6 7 8 9 10
Figure 6.6: Six pipelined iterations of y[i] = x[i]2 +z[i]2 +z for x = {1, 2, 3, 4, 5, 6} and z = {4, 5, 6, 7, 8, 9}
Page 50 of 81 ~/NIU/courses/532/2015-fa/book/openspl/./kernel/chapter.tex
[email protected] 2015-10-20 15:01:52 -0500 v1.0-79-g838c34b
6.3. TEMPORAL ALIGNMENT
6.3.2 y = x2 + z 2 + z − x
The intention of this example is to illustrate a situation where a FIFO is required to provide a delay
of more than one element as shown in Figure 6.7 with the corresponding timing diagram in Figure 6.8.
The items added or changed from Figure 6.6 are highlighted on the left in yellow.
x z
* * 1
2 +
Note that along with adding a 2-element FIFO, the increased complexity of the mathematical function
in this example has also changed the pipeline latency from 3 to 4.1
The end-result still consumes one set of inputs per-tick and provides one output per-tick (once the
pipeline has been filled). But we can start to see the impact of what it means to trade space for time
as the six pipeline-stages that perform actual computation (shown in blue) are starting to be rivaled by
the number of stages that just hold or move data (shown in yellow and green.)
1 Note that the problem presented here is to illustrate the use of FIFOs. Because we can create a 3-input adder, it
would have been more efficient (lower latency) if the solution negated x and then fed it into a 3-input adder rather than
through a FIFO and into a subtractor.
~/NIU/courses/532/2015-fa/book/openspl/./kernel/chapter.tex Page 51 of 81
[email protected] 2015-10-20 15:01:52 -0500 v1.0-79-g838c34b
CHAPTER 6. THE KERNEL
1 2 3 4 5 6
t1 ← x[k]
4 5 6 7 8 9
t2 ← z[k]
1 2 3 4 5 6
k ←k+1
1 4 9 16 25 36
t3 ← t1 × t1
16 25 36 49 64 81
t4 ← t2 × t2
4 5 6 7 8 9
t6 ← t2
21 29 51 72 97 126
t5 ← t3 + t4 + t6
1 2 3 4 5 6
t7 ← t1
1 2 3 4 5 6
t8 ← t7
20 27 48 68 92 120
t9 ← t5 − t8
20 27 48 68 92 120
y[i] ← t9
1 2 3 4 5 6
i←i+1
0 1 2 3 4 5 6 7 8 9 10
Figure 6.8: Six pipelined iterations of y[i] = x[i]2 + z[i]2 + z − x for x = {1, 2, 3, 4, 5, 6} and z =
{4, 5, 6, 7, 8, 9}
There is more to temporal alignment that we have let on so far. Looking at an optimized kernel graph
for x2 + x we see that a FIFO is created with a delay of two (rather than one!)
When calculating x2 + x, the time it takes for the value of x2 to appear at the output of the multiplier
depends on the type of FPGA and data representation used. In this example, x is a 32-bit integer on
an Altera Stratix V FPGA (that is on a Maxeler Icsa board).
Page 52 of 81 ~/NIU/courses/532/2015-fa/book/openspl/./kernel/chapter.tex
[email protected] 2015-10-20 15:01:52 -0500 v1.0-79-g838c34b
6.3. TEMPORAL ALIGNMENT
Note that the number ‘2’ in the FIFO in Figure 6.9. This implies that the multiplier operation is itself
implemented in a two-stage pipeline that is abstracted out of the kernel graph but it shown in timing
diagram in Figure 6.10.
Note that the actual times for any FIFOs required for temporal alignments only appear in optimized
kernel graphs as their presence and size are dependent on the optimizations performed by the compiler.
1 2 3 4 5 6 7 8 9
x
1 2 3 4 5 6 7 8 9
x[−1]
1 2 3 4 5 6 7 8 9
x[−2]
x2 1 4 9 16 25 36 49 64 81
x2 + x 1 2 3 4 5 6 7 8 9
1 4 9 16 25 36 49 64 81
s
0 1 2 3 4 5 6 7 8 9 10 11 12 13
~/NIU/courses/532/2015-fa/book/openspl/./book.tex Page 53 of 81
[email protected] 2015-10-20 15:01:52 -0500 v1.0-79-g838c34b
CHAPTER 6. THE KERNEL
Page 54 of 81 ~/NIU/courses/532/2015-fa/book/openspl/./NX/chapter.tex
[email protected] 2015-10-20 15:01:52 -0500 v1.0-79-g838c34b
A
Installing and Using NX
A.1.1 Linux
ý Fix Me:
This screen-shot sequence is wrong
and it never worked right for me.
~/NIU/courses/532/2015-fa/book/openspl/./NX/chapter.tex Page 55 of 81
[email protected] 2015-10-20 15:01:52 -0500 v1.0-79-g838c34b
APPENDIX A. INSTALLING AND USING NX
A.1.2 Windows
A.1.3 Mac
A.2 Setting up NX
Page 56 of 81 ~/NIU/courses/532/2015-fa/book/openspl/./NX/chapter.tex
[email protected] 2015-10-20 15:01:52 -0500 v1.0-79-g838c34b
A.2. SETTING UP NX
5. Leave Unix in the first box and in the second box change KDE to custom
6. Click on the settings box and change ”Run the console” to ”Run the following command” and
enter ”MaxIDE” into the box
7. Leave the options as a ”floating window”
8. Create a desktop shortcut if you would like one
16. This is where you can change the size of the MaxIDE font to a more readable size
17. Highlight ”Text Font” in the box displayed and click on the bod that says ”Edit...”
18. Select the font size that is most comfortable and click ok, then apply the results and click
~/NIU/courses/532/2015-fa/book/openspl/./book.tex Page 57 of 81
[email protected] 2015-10-20 15:01:52 -0500 v1.0-79-g838c34b
APPENDIX A. INSTALLING AND USING NX
Page 58 of 81 ~/NIU/courses/532/2015-fa/book/openspl/./vmware/chapter.tex
[email protected] 2015-10-20 15:01:52 -0500 v1.0-79-g838c34b
VMware
VM
B
Running MaxIDE on VMware
You can download a free VMware Player application for your operating system.
B.1.1 Linux
https://my.vmware.com/web/vmware/free#desktop_end_user_computing/vmware_player/7_0
B.1.2 Windows
https://my.vmware.com/web/vmware/free#desktop_end_user_computing/vmware_player/7_0
B.1.3 Mac
http://www.vmware.com/products/fusion/fusion-evaluation.html
The .vmx is an image of an entire virtual machine that has been configured to run The Maxeler IDE in
simulation mode on Centos 6.3.
Run the VM, click on the .vmx file. A Linux system will run contained in a window. Once started you
can launch the Maxeler IDE as discussed in chapter 4.
~/NIU/courses/532/2015-fa/book/openspl/./book.tex Page 59 of 81
[email protected] 2015-10-20 15:01:52 -0500 v1.0-79-g838c34b
APPENDIX B. RUNNING MAXIDE ON VMWARE
Page 60 of 81 ~/NIU/courses/532/2015-fa/book/openspl/./virtualbox/chapter.tex
[email protected] 2015-10-20 15:01:52 -0500 v1.0-79-g838c34b
VirtualBox
VM
Virtual Machine—seeVM
C
Running MaxIDE on VirtualBox
You can download the free VirtualBox application for your operating system.1
Installations and configuration instructions for Windows, OS X, Linux and Solaris are all available from
Oracle here:
https://www.virtualbox.org/wiki/Downloads
The .vmx is an image of an entire virtual machine that has been configured to run The Maxeler IDE in
simulation mode on Centos 6.3.
Run the VM, click on the .vmx file. A Linux system will run contained in a window. Once started you ý Fix Me:
can launch the Maxeler IDE as discussed in chapter 4. Install and verify how to start
VirtualBox to run the .vmx file.
1 At time of writing this text, the Maxeler VM version 2015 is known to run on VirtualBox version 5.0.6.
~/NIU/courses/532/2015-fa/book/openspl/./book.tex Page 61 of 81
[email protected] 2015-10-20 15:01:52 -0500 v1.0-79-g838c34b
APPENDIX C. RUNNING MAXIDE ON VIRTUALBOX
Page 62 of 81 ~/NIU/courses/532/2015-fa/book/openspl/./java/chapter.tex
[email protected] 2015-10-20 15:01:52 -0500 v1.0-79-g838c34b
D
Java Resources
MIT offers the lecture notes for a number of courses for free on the web. MIT Course Number 6.092
Introduction to Programming in Java may be of interest to the new Java programmer that already has
some programming experience:
http://ocw.mit.edu/courses/electrical-engineering-and-computer-science/6-092-introduction-to-programming-in-
lecture-notes/
The Seventh Edition of Introduction to Programming Using Java is a free on-line textbook on introduc-
tory programming, which uses Java as the language of instruction. This book is directed mainly towards
~/NIU/courses/532/2015-fa/book/openspl/./java/chapter.tex Page 63 of 81
[email protected] 2015-10-20 15:01:52 -0500 v1.0-79-g838c34b
APPENDIX D. JAVA RESOURCES
beginning programmers, although it might also be useful for experienced programmers who want to learn
something about Java.
http://math.hws.edu/javanotes/
A few google searches turned up A Primer on Java that appears to be an easy read and may be of
interest to anyone that hasn’t written programs in a while.
https://leanpub.com/aprimeronjava
For further information on the Java language we recommend the following resources:
http://docs.oracle.com/javase/tutorial/java/index.html
http://docs.oracle.com/javase/tutorial/collections/index.html
An overview of the Java ”Collections” API which is used often in MaxCompiler interfaces.
http://docs.oracle.com/javase/6/docs/api/
http://www.java-tips.org/java-se-tips/java.lang/using-the-varargs-language-feature.html
Introduction to using variable-argument methods in Java which are also common in MaxCompiler in-
terfaces.
Page 64 of 81 ~/NIU/courses/532/2015-fa/book/openspl/./book.tex
[email protected] 2015-10-20 15:01:52 -0500 v1.0-79-g838c34b
SVN—seeSubversion
Subversion
E
Managing Projects With Subversion
Subversion is a version control system is a database that records and tracks changes to files over time.
When configured on a server machine one can use it to share files between multiple users and machines.
There are many resources on the Internet that discuss how to use Subversion.1 Here we will discuss
accessing and using an existing repository with the MaxIDE.
E.1 Introduction
Given the vintage of the Eclipse version used in the Maxeler IDE the only version control system
supported out-of-the-box is “SVN”.2
We can use Subversion to store files for backup and to copy them between a virtual machine on a personal
laptop for development, simulation and debugging, transferring them to hermes.niu.edu for building
and running on the DFE card and for handing in homework.
In the simplest of terms Subversion manages database called a repository or repo that provides mecha-
nisms for inserting and retrieving files. When one or more files are inserted into the repo the action is
called a commit or a check-in. When a file is retrieved from the repo the action is called a checkout or
an update.
The Subversion database contains a copy of every version of every file that was ever committed over the
course of the file’s evolution. This means that the database has the ability to retrieve was was committed
to the repo one hour ago, four days ago, 14 months ago and so on.
~/NIU/courses/532/2015-fa/book/openspl/./svn/chapter.tex Page 65 of 81
[email protected] 2015-10-20 15:01:52 -0500 v1.0-79-g838c34b
APPENDIX E. MANAGING PROJECTS WITH SUBVERSION
Because the repo database has every version of every file, it can also be used to ask to show what
has changed between two or more versions of the same file. This is very useful when you break
some code and forgot what you did since the last working version!
Of course, you have to remember to commit your files every now and then or else the repo will
not have copies of the evolving versions of your files to make this possible.
To use Subversion a repository must be present. You may create one named MyRepo in the current
directory by running run the following command:
Once created, the repository directory and any files within it should never be touched directly. They
are only accessed using a client application like svn or an IDE such as Eclipse or MaxIDE.
For CSCI 532 a repo has already been created and assigned for your use on hermes.niu.edu. In order
to reference your repo a URL is used that looks like this:
svn+ssh://hermes.niu.edu/home/repos/z1234567
An svn client capable of checking projects into and out of a Subversion repo is built into MaxIDE.
Page 66 of 81 ~/NIU/courses/532/2015-fa/book/openspl/./svn/chapter.tex
[email protected] 2015-10-20 15:01:52 -0500 v1.0-79-g838c34b
E.3. CHECKING A PROJECT INTO A REPOSITORY
~/NIU/courses/532/2015-fa/book/openspl/./svn/chapter.tex Page 67 of 81
[email protected] 2015-10-20 15:01:52 -0500 v1.0-79-g838c34b
APPENDIX E. MANAGING PROJECTS WITH SUBVERSION
Figure E.4: Enter the URL for the repository and click Next.
Page 68 of 81 ~/NIU/courses/532/2015-fa/book/openspl/./svn/chapter.tex
[email protected] 2015-10-20 15:01:52 -0500 v1.0-79-g838c34b
E.3. CHECKING A PROJECT INTO A REPOSITORY
At this point you will have created a place for your project in the specified repository. You now have to
proceed to commit the current version of your files.
~/NIU/courses/532/2015-fa/book/openspl/./svn/chapter.tex Page 69 of 81
[email protected] 2015-10-20 15:01:52 -0500 v1.0-79-g838c34b
APPENDIX E. MANAGING PROJECTS WITH SUBVERSION
At this point you should note that each of the files that are in your repository are displayed with a
revision number indicating the when they were last committed to the repository. This number is simply
incremented each time that anything is committed into the repository. As seen in Figure E.9 the version
is 12.
Page 70 of 81 ~/NIU/courses/532/2015-fa/book/openspl/./svn/chapter.tex
[email protected] 2015-10-20 15:01:52 -0500 v1.0-79-g838c34b
E.3. CHECKING A PROJECT INTO A REPOSITORY
Figure E.9: Eclipse indicates the SVN version next to each file in the Project Explorer.
At this point the project files are all in the repository. Any editing of the files will (as one would expect)
make them out of date with the version in the repository. Eclipse will indicate that a file is out of date
by placing a small brown indicator in the icon that represents the edited file(s) as well as in the parent
directory icons up to the root project level. See Figure E.10.
Figure E.10: Eclipse indicates out of date files with a brown decoration on related Project Explorer
icons.
After making changes to any files files, repeat the commit process as described beginning in Figure E.7.
Again, enter a suitable comment before clicking OK. The comments that are entered when committing
files should accurately describe the changes that were made to the out of date files so that it is possible
to identify and understand the various versions of the files as they are edited over time. Should a project
stop working, an older version of the file(s) can be recovered from the repository. At times like this the
easiest way to locate the desired version of a file is to plan ahead and properly annotate changes as they
occur over time.
~/NIU/courses/532/2015-fa/book/openspl/./svn/chapter.tex Page 71 of 81
[email protected] 2015-10-20 15:01:52 -0500 v1.0-79-g838c34b
APPENDIX E. MANAGING PROJECTS WITH SUBVERSION
Figure E.11: Committing the project files changes the version number of a1CpuCode.c as seen in the
Project Explorer.
After committing a new version of a file, the version number will change in the Project Explorer view.
In the example shown in Figure E.11 committing changes to the file named a1CpuCode.c has changed
the version from 12 to 14 at 6:31 PM on July 16th by ‘winans’.
ý Fix Me: Once files are committed/checked into a repository they can then be checked out elsewhere. This feature
Add discussion on how to check can be used to keep copies of a project’s files on multiple machines in sync and to hand in your homework.
out the latest version of files from
a Subversion server to the IDE.
Page 72 of 81 ~/NIU/courses/532/2015-fa/book/openspl/./book.tex
[email protected] 2015-10-20 15:01:52 -0500 v1.0-79-g838c34b
F
IEEE-754 Floating Point Number
Representation
• Note that the place values for integer binary numbers are:
... 128 64 32 16 8 4 2 1
• We can extend this to the right in binary similar to the way we do in decimal:
... 128 64 32 16 8 4 2 1 . 1/2 1/4 1/8 1/16 1/32 1/64 1/128 ...
Note that the ‘.’ in a binary number is a binary point, not a decimal point.
• Scientific notation as in 27 × 10−47 is used when either small fractions or large numbers when we
are concerned with fewer digits than are necessary to represent the entire number.
• The format of a number in scientific notation is mantissa × baseexponent
• In binary we have mantissa × 2exponent
• For simplicity sake, IEEE–754 format requires binary numbers to be normalized to 1.signif icand×
2exponent where the significand is the part of the mantissa that is to the right of the binary–point.
• We need not store the ‘1.’ because all normalized floating point numbers will start that way. Thus
we can save memory by simply remembering that that first bit is always there and that is supposed
to be a 1.
31 30 23 22 0
1 1 0 0 0 0 0 0 0 0 1 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
sign exponent significand
~/NIU/courses/532/2015-fa/book/openspl/./float/chapter.tex Page 73 of 81
[email protected] 2015-10-20 15:01:52 -0500 v1.0-79-g838c34b
APPENDIX F. IEEE-754 FLOATING POINT NUMBER REPRESENTATION
• −((1 + 1
4 + 1
16 )
5
× 2128−127 ) = −(1 16 × 21 ) = −(1.3125 × 21 ) = −2.625
• −((1 + 1
4 + 1
16 ) × 2128−127 ) = −((1 + 1
4 + 1
16 ) × 21 ) = −(2 + 1
2 + 18 ) = −(2 + .5 + .125) = −2.625
• IEEE754 formats:
• When the exponent is all ones, the mantissa is all zeros, and the sign is zero, the number represents
positive infinity.
• When the exponent is all ones, the mantissa is all zeros, and the sign is one, the number represents
negative infinity.
• Note that the binary representation of an IEEE754 number in memory can be compared for mag-
nitude with another one using the same logic as for comparing sign–magnitude numbers because
the magnitudes of the IEEE number’s magnitude grows upward and downward in the same fashon
as sign-magnitude integers. This is why we use excess notation for the exponent. . . numbers with
larger exponents look larger than numbers with smaller exponents even when incorrectly inter-
preted as sign–magnitude integers.
• Note that zero is a special case number. This is why the exponent of all–zeros is not used to
represent the smallest possible exponent value. Zero is represented by an exponent of all–zeros
and a mantissa of all–zeros. This allows for a positive and a negative zero if we observe that the
sign can be either 1 or 0.
• On the numberline, numbers between zero and the smallest fraction in either direction are in the
underflow areas.
• On the numberline, numbers greater than the mansissa of all–ones and the largest exponent allowed
are in the overflow areas.
• Note that numbers have a higher resolution on the number–line when the exponent is smaller.
Page 74 of 81 ~/NIU/courses/532/2015-fa/book/openspl/./float/chapter.tex
[email protected] 2015-10-20 15:01:52 -0500 v1.0-79-g838c34b
F.1. FLOATING POINT NUMBER ACCURACY
Due to the finite number of bits used to store the value of a floating point number, it is not possible to
represent every one of the infinite values on the real number line. The following C programs illustrate
this point.
Just like the integer numbers, the powers of two that have bits to represent them can be represented
perfectly. . . as can their sums:
5 union floatbin
6 {
7 unsigned int i;
8 float f;
9 };
10 int main ()
11 {
12 union floatbin x, y;
13 int i;
14
15 x . f = 1.0;
16 while ( x . f > 1.0/1024.0)
17 {
18 y . f = -x . f ;
19 printf ( " %25.10 f = %08 x %25.10 f = %08 x \ n " , x .f , x .i , y .f , y . i ) ;
20 x . f = x . f /2.0;
21 }
22
23 }
~/NIU/courses/532/2015-fa/book/openspl/./float/chapter.tex Page 75 of 81
[email protected] 2015-10-20 15:01:52 -0500 v1.0-79-g838c34b
APPENDIX F. IEEE-754 FLOATING POINT NUMBER REPRESENTATION
When dealing with decimal values, you will find that they don’t map simply into binary floating point
values (the same holds true for binary integer numbers).
Note how the decimal numbers are not accurately represented as they get larger. The decimal number
10 can be perfectly represented in IEEE format. The problem that arises after the 11th loop iteration
is not because the prior number was not multiplied by 10. It is due to the fact that the prior number
can not be represented accurately in IEEE format. Therefore its least significant bits were truncated in
a best-effort attempt at rounding the value off. Once this happens, the value of x.f may not be what a
programmer expects.
5 union floatbin
6 {
7 unsigned int i;
8 float f;
9 };
10 int main ()
11 {
12 union floatbin x, y;
13 int i;
14
15 x . f = 10;
16 while ( x . f <= 10000000000000.0)
17 {
18 y . f = -x . f ;
19 printf ( " %25.10 f = %08 x %25.10 f = %08 x \ n " , x .f , x .i , y .f , y . i ) ;
20 x . f = x . f *10.0;
21 }
22 }
Page 76 of 81 ~/NIU/courses/532/2015-fa/book/openspl/./float/chapter.tex
[email protected] 2015-10-20 15:01:52 -0500 v1.0-79-g838c34b
F.1. FLOATING POINT NUMBER ACCURACY
This effect of rounding errors can be exaggerated if the number we multiply the x.f value by is itself
something that can not be accurately represented in IEEE form.
1
If we multiply our x.f value by 10 each time, we can never be accurate and we start accumulating errors
immediately.
5 union floatbin
6 {
7 unsigned int i;
8 float f;
9 };
10 int main ()
11 {
12 union floatbin x, y;
13 int i;
14
15 x . f = .1;
16 while ( x . f <= 2.0)
17 {
18 y . f = -x . f ;
19 printf ( " %25.10 f = %08 x %25.10 f = %08 x \ n " , x .f , x .i , y .f , y . i ) ;
20 x . f += .1;
21 }
22 }
~/NIU/courses/532/2015-fa/book/openspl/./float/chapter.tex Page 77 of 81
[email protected] 2015-10-20 15:01:52 -0500 v1.0-79-g838c34b
APPENDIX F. IEEE-754 FLOATING POINT NUMBER REPRESENTATION
In order to use floating point numbers in a program without causing undesirable results, you can consider
redesigning your algorithm so that an accumulation of errors is eliminated. This example is similar to
the previous one, but this time we recalculate the desired value from known–accurate integer values.
Thus we might see some rounding errors, but they can not accumulate.
1 # include < stdio .h >
2 # include < stdlib .h >
3 # include < unistd .h >
4
5 union floatbin
6 {
7 unsigned int i;
8 float f;
9 };
10 int main ()
11 {
12 union floatbin x, y;
13 int i;
14
15 i = 1;
16 while ( i <= 20)
17 {
18 x . f = i /10.0;
19 y . f = -x . f ;
20 printf ( " %25.10 f = %08 x %25.10 f = %08 x \ n " , x .f , x .i , y .f , y . i ) ;
21 i ++;
22 }
23 return (0) ;
24 }
Page 78 of 81 ~/NIU/courses/532/2015-fa/book/openspl/./book.tex
[email protected] 2015-10-20 15:01:52 -0500 v1.0-79-g838c34b
Bibliography
[1] Maxeler Technologies, Multiscale Dataflow Programming, Feb 2014. Version 2013.3b. v, vii, 1, 9,
11, 14, 15, 16, 45, 64
[2] T. Starnes and J. Handy, “Intel really set to buy altera fpgas,” Electronic Design, Jun 2015. http:
//electronicdesign.com/fpgas/intel-really-set-buy-altera-fpgas. 4
[3] M. Parker, “Understanding peak floating-point performance claims,” tech. rep., Altera Corporation,
Jun 2014. http://design.altera.com/HFP_White_Paper. 4
[4] G. M. Amdahl, “Validity of the single processor approach to achieving large scale computing ca-
pabilities,” in Proceedings of the April 18-20, 1967, Spring Joint Computer Conference, AFIPS ’67
(Spring), (New York, NY, USA), pp. 483–485, ACM, 1967. 5
[5] D. A. Patterson and J. L. Hennessy, Computer Organization and Design, Fourth Edition: The
Hardware/Software Interface (The Morgan Kaufmann Series in Computer Architecture and Design).
San Francisco, CA, USA: Morgan Kaufmann Publishers Inc., 4th ed., 2008. 5, 7
[7] J. W. Davidson and C. W. Fraser, “The design and application of a retargetable peephole optimizer,”
ACM Trans. Program. Lang. Syst., vol. 2, pp. 191–202, Apr. 1980. 6
[8] A. S. Tanenbaum and J. R. Goodman, Structured Computer Organization. Upper Saddle River,
NJ, USA: Prentice Hall PTR, 4th ed., 1998. 7
[9] R. M. Tomasulo, “An efficient algorithm for exploiting multiple arithmetic units,” IBM J. Res.
Dev., vol. 11, pp. 25–33, Jan. 1967. 7, 8
[10] A. V. Aho, M. S. Lam, R. Sethi, and J. D. Ullman, Compilers: Principles, Techniques, and Tools
(2Nd Edition). Boston, MA, USA: Addison-Wesley Longman Publishing Co., Inc., 2006. 7
[11] Maxeler Technologies, MaxCompiler Manager Compiler Tutorial, 2013. Version 2013.3f. 16
[12] Maxeler Technologies, MaxCompiler Kernel Numerics Tutorial, 2013. Version 2013.3f.
[13] Maxeler Technologies, Acceleration Tutorial Loops and Pipelining, 2013. Version 2013.3f.
[14] Maxeler Technologies, Dataflow Programming for Networking, 2013. Version 2013.3f.
[15] Maxeler Technologies, MaxCompiler State Machine Tutorial, 2013. Version 2013.3f.
[16] A. Einstein, “The foundation of the general theory of relativity,” Annalen Phys., vol. 49, pp. 769–
822, 1916.
~/NIU/courses/532/2015-fa/book/openspl/./book.bbl Page 79 of 81
[email protected] 2015-10-20 15:01:52 -0500 v1.0-79-g838c34b
BIBLIOGRAPHY
[17] C. G. Bell and A. C. Newell, Computer Structures: Readings and Examples (McGraw-Hill Computer
Science Series). McGraw-Hill Pub. Co., 1971.
[18] J. A. Darringer, The Description, Simulation, and Automatic Implementation of Digital Computer
Processors. PhD thesis, Carnegie Mellon University, Pittsburgh, PA, USA, 1969. AAI6919088.
Page 80 of 81 ~/NIU/courses/532/2015-fa/book/openspl/./book.tex
[email protected] 2015-10-20 15:01:52 -0500 v1.0-79-g838c34b
Index
A S
ALU, 6 Simple Live CPU, see SLiC
Application Accelerator, 13 SLiC, 14
Arithmetic Logic Unit, see ALU Stream, 16
Subversion, 65
C SVN, see Subversion
CPUCode, 32
T
D Timing Diagram, 6
Data Flow Engine, see DFE
DFE, 13 V
Virtual Machine, see VM
E VirtualBox, 61
EngineCode, 33 VM, 59, 61
VMware, 59
F
Field Programmable Gate Array, see FPGA W
FPGA, 1, 13 Waveform Diagram, see Timing Diagram
K
Kernel, 33, 45
Kernel Graph, 9
L
loop unrolling, 7
M
Manager, 33
MaxCompiler, 14
MaxJ, 13
O
OpenSPL, 13
Out-of-order execution, 8
P
Pipeline, 9
Fill, 10
Flush, 10
pipeline, 4
R
register, 2
Register Transfer Language, see RTL
RTL, 6
~/NIU/courses/532/2015-fa/book/openspl/./book.ind Page 81 of 81
[email protected] 2015-10-20 15:01:52 -0500 v1.0-79-g838c34b