0% found this document useful (0 votes)

17 views89 pages

On Dataflow Computing With OpenSPL

Uploaded by

liyi.leee.338

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

17 views89 pages

On Dataflow Computing With OpenSPL

Uploaded by

liyi.leee.338

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 89

On Dataflow

Computing With
OpenSPL
(Draft v1.0-79-g838c34b)

John Winans
[email protected]

October 20, 2015

No parts of this publication may be reproduced, stored in a retrieval system, or transmitted, in any
form or by any means, electronic, mechanical, photocopying, recording, or otherwise, without the prior
written permission of the publisher.
The authors and publisher have made every effort in the preparation of this book to ensure the accuracy
of the information. However, the information contained in this book is provided without warranty, either
express or implied. Neither the authors, the publisher nor its distributors will be held liable for any
damages caused or alleged to be caused directly or indirectly by this book.
The authors and publisher have made every effort to provide trademark information about all the
companies and products mentioned in this book by the appropriate use of capitals. However, the
accuracy of this information cannot be guaranteed.

Altera® and Stratix® are either registered trademarks or trademarks of Altera Corporation in the
United States and/or other countries.
ARM® , ARM Powered® , AMBA® , ARMulator® , Cortex® , Jazelle® , Multi–ICE® , StrongARM® ,
Thumb® , and TrustZone® are the registered trademarks of ARM Limited in the EU and other countries.
AHBTM , APBTM , ARM9TTM , ARM9TDMITM , ARM922TTM , ARM1022ETM , ASBTM , ATBTM , AXITM ,
CoreSightTM , ETM9TM , ETM10TM , ModelGenTM , MPCoreTM , NEONTM , PrimeCellTM , and VFP10TM
are the trademarks of ARM Limited in the EU and other countries
Microsoft® and Windows® are either registered trademarks or trademarks of Microsoft Corporation in
the United States and/or other countries.
Apple® , Macintosh® , Mac OS® and SafariTM are either registered trademarks or trademarks of Apple
Computer, Inc. in the United States and/or other countries.
Oracle® , VirtualBox® and Java are registered trademarks of Oracle and/or its affiliates.
Adobe, the Adobe logo, Acrobat, the Acrobat logo, Distiller, PostScript, and the PostScript logo are
trademarks or registered trademarks of Adobe Systems Incorporated in the U.S. and/or other countries.
Intel, Intel Core, and Xeon are trademarks of Intel Corp. in the U.S. and other countries.
OpenGL® is a registered trademark of Silicon Graphics, Inc.
UNIX® is a registered trademark of The Open Group.
OpenSPL is a trademark of Maxeler Technologies Limited.
X Window System is a trademark of the X Consortium, Inc
Linux® is the registered trademark of Linus Torvalds in the U.S. and other countries.
Other company and product names mentioned herein are trademarks of their respective owners. Mention
of third-party products is for informational purposes only and constitutes neither an endorsement nor a
recommendation.

Page ii of 81 ~/NIU/courses/532/2015-fa/book/openspl/./book.tex
[email protected] 2015-10-20 15:01:52 -0500 v1.0-79-g838c34b
Contents

1 Introduction 1
1.1 The FPGA . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
1.1.1 Evolution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
1.1.2 HDL Programming . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2
1.1.3 The Machine . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
1.2 Some Words From The Marketing Department . . . . . . . . . . . . . . . . . . . . . . . . 4

2 Dataflow Computing 5
2.1 An Example Problem . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
2.2 Problem Solving With a CPU . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
2.3 Problem Solving With Pipelined Computing . . . . . . . . . . . . . . . . . . . . . . . . . . 9
2.3.1 The Kernel Graph . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
2 2
2.3.2 Timing Diagrams for y = x + z . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
2.4 Observations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11

3 OpenSPL Basics 13
3.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13
3.2 An Accelerator Architecture . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13
3.2.1 CPU Code . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14
3.2.2 MaxJ Code . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15
3.2.3 Kernel . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16
3.2.4 Manager . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16

4 Maxeler IDE 19
4.1 Accessing the OpenSPL Environment . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19
4.1.1 Installation of VM . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19
4.1.2 hermes.niu.edu . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19
4.2 Maxeler First Time Use . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20
4.2.1 Set up Your Workspace . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20

~/NIU/courses/532/2015-fa/book/openspl/./book.tex Page iii of 81

[email protected] 2015-10-20 15:01:52 -0500 v1.0-79-g838c34b
CONTENTS

4.2.2 Documentation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20
4.2.2.1 Documentation Available in the IDE . . . . . . . . . . . . . . . . . . . . 21
4.2.3 Importing an Example Project . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21
4.3 MaxIDE Problems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25

5 Your First OpenSPL Program 27

5.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27
5.1.1 Create a New Project . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27
5.1.1.1 Create New MaxCompiler Project . . . . . . . . . . . . . . . . . . . . . . 28
5.1.1.2 Name the Project . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29
5.1.1.3 Set the DFE Hardware Type . . . . . . . . . . . . . . . . . . . . . . . . . 30
5.1.1.4 Select the SLiC Interface Type . . . . . . . . . . . . . . . . . . . . . . . . 31
5.1.1.5 Inspect the Project Template Stub Files . . . . . . . . . . . . . . . . . . . 32
5.1.1.6 CPU Code Template . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32
5.1.1.7 Kernel Engine Code Template . . . . . . . . . . . . . . . . . . . . . . . . 33
5.1.1.8 Manager Code Template . . . . . . . . . . . . . . . . . . . . . . . . . . . 33
5.1.1.9 Build & Simulate Your Project . . . . . . . . . . . . . . . . . . . . . . . . 36
5.1.1.10 Build and Run Messages . . . . . . . . . . . . . . . . . . . . . . . . . . . 37
5.1.1.11 Original Kernel Graph . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38
5.1.1.12 Final Kernel Graph . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39
5.2 Convert Template Code to Desired Application . . . . . . . . . . . . . . . . . . . . . . . . 39
5.2.1 Add Display Logic . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40
5.2.2 Replace the Manager Template With a Simple Manager . . . . . . . . . . . . . . . 41
5.2.3 Change the Computation in the Kernel . . . . . . . . . . . . . . . . . . . . . . . . 42
5.2.4 Update the CPU Code . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42
5.2.5 Final Program Output . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43

6 The Kernel 45
6.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45
6.2 Widening the Pipeline . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45
6.2.1 Overloading a Kernel . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45
6.2.2 An N-fold kernel . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46
6.3 Temporal Alignment . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 48
6.3.1 y = x2 + z 2 + z . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 48
6.3.2 y = x2 + z 2 + z − x . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51
6.3.3 Reality Check (There’s a Pipeline in my Pipeline!) . . . . . . . . . . . . . . . . . . 52

A Installing and Using NX 55

Page iv of 81 ~/NIU/courses/532/2015-fa/book/openspl/./book.tex
[email protected] 2015-10-20 15:01:52 -0500 v1.0-79-g838c34b
CONTENTS

A.1 Download and Install the NX Client . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 55

A.1.1 Linux . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 55
A.1.2 Windows . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 56
A.1.3 Mac . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 56
A.2 Setting up NX . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 56

B Running MaxIDE on VMware 59

B.1 Installing VMware . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 59
B.1.1 Linux . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 59
B.1.2 Windows . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 59
B.1.3 Mac . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 59
B.2 Loading the .vmx File . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 59

C Running MaxIDE on VirtualBox 61

C.1 Installing VirtualBox . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 61
C.2 Loading the .vmx File . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 61

D Java Resources 63
D.1 Web Resources . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 63
D.1.1 MIT OpenCourseware . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 63
D.1.2 Introduction to Programming Using Java . . . . . . . . . . . . . . . . . . . . . . . 63
D.1.3 A Primer on Java . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 64
D.2 References From Multiscale Dataflow Programming[1] . . . . . . . . . . . . . . . . . . . . . 64

E Managing Projects With Subversion 65

E.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 65
E.2 Creating a Subversion Repository . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 66
E.3 Checking a Project Into a Repository . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 66
E.4 Checking Files Out . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 72

F IEEE-754 Floating Point Number Representation 73

F.1 Floating Point Number Accuracy . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 75
F.1.1 Powers Of Two . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 75
F.1.2 Clean Decimal Numbers . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 76
F.1.3 Accumulation of Error . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 77
F.2 Reducing Accumulation of Errors . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 78

Bibliography 80

Index 81

~/NIU/courses/532/2015-fa/book/openspl/./book.tex Page v of 81
[email protected] 2015-10-20 15:01:52 -0500 v1.0-79-g838c34b
CONTENTS

Page vi of 81 ~/NIU/courses/532/2015-fa/book/openspl/./book.tex
[email protected] 2015-10-20 15:01:52 -0500 v1.0-79-g838c34b
List of Figures

1.1 Configurable Logic Block . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2

1.2 ICSA Block Diagram . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3

2.1 RTL description of y[i] = x[i]2 + z[i]2 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6

2 2
2.2 Two unoptimized iterations of y[i] = x[i] + z[i] . . . . . . . . . . . . . . . . . . . . . . . 7
2.3 Two RTL-parallel iterations of y[i] = x[i]2 + z[i]2 . . . . . . . . . . . . . . . . . . . . . . . 7
2 2
2.4 Three iterations of RTL unrolled loop version of y[i] = x[i] + z[i] . . . . . . . . . . . . . 8
2 2
2.5 Three iterations of out-of-order RTL-parallel execution of y[i] = x[i] + z[i] . . . . . . . . 8
2.6 A Kernel graph for y = x2 + z 2 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
2.7 A naive pipelined implementation of y[i] = x[i]2 + z[i]2 for x = {1, 2, 3} and z = {4, 5, 6}. 10
2 2
2.8 A properly pipelined version of y[i] = x[i] + z[i] for x = {1, 2, 3, 4, 5, 6, 7, 8, 9} and
z = {4, 5, 6, 7, 8, 9, 10, 11, 12}. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11

3.1 Compiling an OpenSPL application.[1, p. 20] . . . . . . . . . . . . . . . . . . . . . . . . . 14

4.1 VM Desktop Icons. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20

4.2 Create a new workspace . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20
4.3 The MaxIDE Welcome Window . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21
4.4 Import MaxCompiler Projects . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22
4.5 Select Project . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22
4.6 Open Example Project . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23
4.7 Build and Run Simulation console window messages . . . . . . . . . . . . . . . . . . . . . 24
4.8 Simulation output messages . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24
4.9 Example kernel Graph . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25

5.1 Creating a new MaxCompiler project. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28

5.2 Name the new project. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29
5.3 Set DFE hardware type. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30
5.4 Select the SLiC interface type. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31
5.5 Open the template stub files. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32
5.6 Build & Run your project in simulation. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36

~/NIU/courses/532/2015-fa/book/openspl/./book.tex Page vii of 81

[email protected] 2015-10-20 15:01:52 -0500 v1.0-79-g838c34b
LIST OF FIGURES

5.7 Build and run messages in the console tab. . . . . . . . . . . . . . . . . . . . . . . . . . . 37

5.8 The original kernel graph. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38
5.9 The final kernel graph. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39
5.10 Final kernel graph of s = x2 + 30a. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44

6.1 Multiple outputs from the same data stream. . . . . . . . . . . . . . . . . . . . . . . . . . 46

6.2 An inefficient use of data (in contrast to Figure 6.1). . . . . . . . . . . . . . . . . . . . . . 47
6.3 A primitive way to implement an N-fold kernel. . . . . . . . . . . . . . . . . . . . . . . . . 48
6.4 A (Broken) Kernel graph for y = x2 + z 2 + z . . . . . . . . . . . . . . . . . . . . . . . . . 49
6.5 A Kernel graph for y = x2 + z 2 + z . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49
6.6 Six pipelined iterations of y[i] = x[i]2 + z[i]2 + z for x = {1, 2, 3, 4, 5, 6} and z =
{4, 5, 6, 7, 8, 9} . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 50
6.7 A Kernel graph for y = x2 + z 2 + z − x . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51
6.8 Six pipelined iterations of y[i] = x[i]2 + z[i]2 + z − x for x = {1, 2, 3, 4, 5, 6} and z =
{4, 5, 6, 7, 8, 9} . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52
6.9 Optimized kernel graph of x2 + x. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52
6.10 Timing diagram for s = x2 + x . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 53

A.1 Set session and host names. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 55

A.2 Accept the defaults. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 56
A.3 Chose to place on desktop (or not). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 56

E.1 Right-click on your project and open Team->ShareProject. . . . . . . . . . . . . . . . 67

E.2 Choose SVN and click Next. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 67
E.3 Select Create New... and click Next. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 68
E.4 Enter the URL for the repository and click Next. . . . . . . . . . . . . . . . . . . . . . . 68
E.5 Select Use Project Name... and click Next. . . . . . . . . . . . . . . . . . . . . . . . . 69
E.6 Enter a suitable commit comment and click Finish. . . . . . . . . . . . . . . . . . . . . . 69
E.7 Right-click on your project and open Team->Commit. . . . . . . . . . . . . . . . . . . . 70
E.8 Enter a suitable comment and click OK. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 70
E.9 Eclipse indicates the SVN version next to each file in the Project Explorer. . . . . . . . 71
E.10 Eclipse indicates out of date files with a brown decoration on related Project Explorer
icons. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 71
E.11 Committing the project files changes the version number of a1CpuCode.c as seen in the
Project Explorer. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 72

Page viii of 81 ~/NIU/courses/532/2015-fa/book/openspl/./book.tex

[email protected] 2015-10-20 15:01:52 -0500 v1.0-79-g838c34b
FPGA

1
Introduction

Dataflow computing was popularized by a number of researchers in the 1980’s, especially J.

B. Dennis. In the dataflow approach an application is considered as a dataflow graph of the
executable actions; as soon as the operands for an action are valid, the action is executed and
the result is forwarded to the next action in the graph. There are no load or store instructions
as the operational node contains the relevant data. Creating a generalized interconnection
among the action nodes proved to be a significant limitation to dataflow realizations in the
1980’s. Over recent years the extraordinary improvement in transistor array density allowed
emulations of the application dataflow graph. The Maxeler dataflow implementations are a
generalization of the earlier work employing static, synchronous dataflow with an emphasis
on data streaming. Indeed “multiscale” dataflow incorporates vector and array processing
to offer a multifaceted parallel compute platform.[1]

1.1 The FPGA

An FPGA (Field Programmable Gate Array) is a type of integrated circuit (as opposed to a
computing system consisting of many parts) that, as its name implies, can be programmed to perform
various functions.

The fact that it is field programmable means that it can be programmed after it has left the factory.
Being a gate array, it is programmed by specifying the manner in which the its gates are to be
interconnected.

1.1.1 Evolution

Over the years FPGA manufacturers have improved upon the operations that the so-called “gates” can
perform to the point where the more advanced devices are far from containing just simple logic gates.
In spite of the continued presence of the word “gate” in their name, an FPGA is an array of CLBs
(Configurable Logic Blocks) that range from simple logic to complex truth tables (called LUTs) and

~/NIU/courses/532/2015-fa/book/openspl/./intro/chapter.tex Page 1 of 81
[email protected] 2015-10-20 15:01:52 -0500 v1.0-79-g838c34b
CHAPTER 1. INTRODUCTION

register simple memories and a plurality of other types of Hard IP1 such as mathematical units of various types,
specialized control units (for accessing large memories), communication units (for Ethernet links, PCIe,
and others) and even whole multicore CPUs such as the Sparc or ARM.

ý Fix Me: A simplified CLB block diagram is shown in Figure 1.1. The LUT contains a truth table with as
Add line showing direct output many rows as the quantity of input bits enumerate. The clock signal and associated D-Latch comprise
from the LUT too.
a register that is used to store/remember the last value that was “looked up” in the truth table.2
ý Fix Me:
Note that the a LUT and a latch
each takes a nonzero period of
time to emit a result after the
input bit(s) have changed.

Figure 1.1: Configurable Logic Block

Programming an FPGA consists of specifying 1) the values of the bits in the LUTs and 2) a network
map (referred to as a netlist) that describes which signals (the bits) that flow out of one block (a CLB
or some Hard IP) and into another. Ultimately the signals originate on some of the pins of the FPGA
chip and terminate at others which are, in turn, connected to other devices such as an Ethernet, large
memory chip(s) and/or the PCIe signals in a PC so that the FPGA can interact with the rest of the
world and perform useful work.

Implementing a function in an FPGA is therefore a task of expressing it in the form of a netlist. To

create a netlist by hand would be an outrageous task as a modern FPGA would contain many millions
of connections. Therefore a specialized high level language is used.

1.1.2 HDL Programming

Since the late ’80s, languages such as Verilog and VHDL, both known as HDLs (Hardware Descrip-
tion Language), have been used to program FPGAs. These languages are akin to using an assembly
language to program a CPU. (If we continue this analogy downward then creating a netlist by hand
would be akin to typing in CPU machine code in binary.) While assembly code is necessary for some
specific functions and can often result in the most efficient execution of a program on some CPUs, the
1 Hard IP (Intellectual Property) refers to commonly used functions that might have been historically programmed
into the FPGA by using multiple CLBs. But are more efficiently built by dedicating part of the silicon of the chip to a
specific purpose.
2 Actual CLBs include additional components like a Full Adder because implementing them using LUTs would be

notably slower and consume more power.

Page 2 of 81 ~/NIU/courses/532/2015-fa/book/openspl/./intro/chapter.tex
[email protected] 2015-10-20 15:01:52 -0500 v1.0-79-g838c34b
1.1. THE FPGA

additional effort required and lack of portability often drive programmers to use simpler higher level
languages like C or Java. . . at the expense of (possibly) ending up with slower-performing code.

Where FPGA programming is concerned, the next “higher-level” languages are being invented and
discovered right now. One such language is called OpenSPL and is the subject of the rest of this book.
OpenSPL is expressed as a combination of C and Java.

1.1.3 The Machine

The applications discussed in this book were designed to execute on a Maxeler DFE (Data Flow
Engine) board. An DFE is what is generally known as a co-processor or application accelerator
because it is connected to a traditional computer3 in the form of a peripheral device similar to a hard
drive or audio interface as seen in Figure 1.2

Figure 1.2: ICSA Block Diagram

As a PC peripheral device, the DFE board is connected to the PCIe (Peripheral Component Inter-
connect Express) bus. The PCIe bus is a set of high-speed serial lanes. The DFE upon which this
text focuses is the Maxeler ICSA. The ICSA has eight lanes.

As an eight-lane PCIe device, the ICSA DFE can exchange up to eight simultaneous streams of data
with the main memory of a host PC. These streams represent one of the types of I/O that an OpenSPL
program can use. Other types of I/O include various types of memory and serial interfaces such as
Ethernet that can be connected directly to an FPGA.

Note that from the perspective of an FPGA even memory starts to “look” and act like a peripheral
device in that it requires the application to read from and write to it!

Each of the (on the order of) 1,000,000 CLBs operates independently, providing a great deal of fine-
grained parallelism.
3 For sake of completeness it is important to point out that FPGA (stand-alone) applications do not require connection

to a computer. Other applications are implemented using FPGAs that include an entire CPU within the FPGA.

~/NIU/courses/532/2015-fa/book/openspl/./intro/chapter.tex Page 3 of 81
[email protected] 2015-10-20 15:01:52 -0500 v1.0-79-g838c34b
CHAPTER 1. INTRODUCTION

pipeline By connecting CLBs together to create complex functions and connecting the output of one function to
the input of another, one or more chains or pipelines can be created that can receive/read one or more
streams of data, process it in some way and then transmit/write one or more resulting streams of data.
OpenSPL is well suited for implementing solutions to such a “streaming” application.

1.2 Some Words From The Marketing Department

On Monday, June 1, 2015 computer processor company Intel announced that it will buy
Altera, one of the two primary vendors of field programmable gate arrays (FPGAs). Intel
will spend nearly $17 billion in cash for Altera. The two companies already have a working
relationship as Altera builds some of its FPGAs in Intel semiconductor fabrication facilities.
Generally, Intel’s processor chips and Altera’s special programmable (in a different way)
circuits are quite different and are used in different, though possibly adjoining, spaces. Gate
arrays can perform functions 10 times as fast as instruction sequences running through clocked
processors, but the processors are far more flexible than FPGAs. FPGAs are a nice middle-
ground between instruction-set processors and hard-wired circuitry, but they come at a cost
of high price and high power, each with mitigating conditions [2]

From a June 2015 press release:

Stratix® 10 FPGAs and SoCs combine the industry’s highest performance (2X), and high-
est density (5.5MLEs) with advanced embedded processing capabilities (quad-core ARM®
Cortex® -A53), GPU-class floating-point computation performance of up to 10 Tera floating-
point operations per second (TFLOPS), heterogeneous 3D system-in-package (SiP) integra-
tion, and the most advanced security capabilities in a high-performance FPGA.

From a June 2014 white paper:

Single-precision floating-point performance on popular high-speed platforms [3]

• Texas Instruments’ TMS320C667x DSP = .16 TFLOPs.

• NVIDIA Tesla K20 GPU = 3.520 TFLOPs.
• Altera high-end Stratix 10 FPGA = 10 TFLOPs.

Searching the Internet for maximum performance numbers on Intel processors is tough since there are
so many variations available. As of Q4 2014, it appears that the fastest Intel CPUs are capable of
approximately 1 TFLOPs.

Keep in mind that all of these “maximum speeds” are theoretical and are not likely to be achieved
unless one is extremely careful about designing and writing code to suite the needs of each of the
specific devices.

Page 4 of 81 ~/NIU/courses/532/2015-fa/book/openspl/./book.tex
[email protected] 2015-10-20 15:01:52 -0500 v1.0-79-g838c34b
2
Dataflow Computing

This chapter is an introduction to the concepts of data flow programming.

A fairly obvious conclusion which can be drawn at this point is that the effort expended on
achieving high parallel processing rates is wasted unless it is accompanied by achievements
in sequential processing rates of very nearly the same magnitude.[4]
—Gene M. Amdahl, 1967
IBM

The rephrased version of the above statement is known as Amdahl’s law:

The speedup of a program using multiple processors in parallel computing is limited by the time needed
for the sequential fraction of the program.1

Slotnick’s Law:

The parallel approach to computing does require that some original thinking be done about
numerical analysis and data management in order to secure efficient use.
In an environment which has represented the absence of the need to think as the highest
virtue this is a decided disadvantage.
—Daniel Slotnick, 1967
Chief Architect
Illiac IV

It is the purpose of this text to discuss some original thinking about numerical analysis and data
management while keeping an eye on the requirements of sequential processing in order to maximize the
performance of an application.

1 See [5, Section 7.12] for a discussion of the pitfalls of improperly interpreting Amdahl’s Law.

~/NIU/courses/532/2015-fa/book/openspl/./introDataFlow/chapter.tex Page 5 of 81
[email protected] 2015-10-20 15:01:52 -0500 v1.0-79-g838c34b
CHAPTER 2. DATAFLOW COMPUTING

RTL 2.1 An Example Problem

Register Transfer
Language—seeRTL
Timing Diagram We are all familiar with writing computer programs for conventional computing systems that are executed
Waveform in a serial manner on a general purpose CPU. With that in mind let us consider a simple C function
Diagram—seeTiming
Diagram that implements: y[i] = x[i]2 + z[i]2 . (See [6] for a discussion of a Hadamard product and matrix
ALU multiplication.)
Arithmetic Logic
Unit—seeALU In C this can be implemented as shown in Listing 2.1
Listing 2.1: exampleFunction.c
A C implementation of y = x[i]2 + z[i]2
1 void f ( float *x , float *y , float *z , int length )
2 {
3 int i ;
4 for ( i =0; i < length ; ++ i )
5 {
6 y [ i ] = x [ i ]* x [ i ] + z [ i ]* z [ i ];
7 }
8 }

ý Fix Me: As a CPU iterates over the body of the loop, there are multiple operations that take place to compute
Consider replacing the example the right side of the assignment statement, they all must complete before the assignment is made on line
problem here with one that can be
implemented with operations that 6, and the assignment must be made before proceeding to the next iteration of the loop.
can each be implemented with a
single CLB. This will provide a
more natural progression into
creating pipelines within pipelines.
2.2 Problem Solving With a CPU

Focusing only on the body of the loop, we know that a CPU will perform following operations one at a
time (described in Figure 2.1 using RTL (Register Transfer Language[7]) notation.)2

t1 ← x[i]
t1 ← t1 × t1
t2 ← z[i]
t2 ← t2 × t2
t2 ← t1 + t2
y[i] ← t2
i←i+1

Figure 2.1: RTL description of y[i] = x[i]2 + z[i]2

Using a timing diagram3 we can see how and when the ALU (Arithmetic Logic Unit) and the
memory interface units of the CPU are used over the course of time while the CPU executes the RTL in
Figure 2.1. A timing diagram shows what operations take place in each functional unit over a continuum
of time. When a unit is performing a useful task, such as squaring the number a, that particular operation
2 RTL is commonly used as an intermediate language in compilers
3 For more information on timing diagrams see: http://en.wikipedia.org/wiki/Digital_timing_diagram

Page 6 of 81 ~/NIU/courses/532/2015-fa/book/openspl/./introDataFlow/chapter.tex
[email protected] 2015-10-20 15:01:52 -0500 v1.0-79-g838c34b
2.2. PROBLEM SOLVING WITH A CPU

is indicated with a × a . Note that, as depicted here, any input(s) to an operation are read/sampled once loop unrolling
at the beginning of the time period and the results of the operation are provided at the end and are
held stable until the next output value is generated. An idle unit is indicated with a gray time period:
.4 The width of an item in a timing diagram is proportional the amount of time used to execute
the specified operation. The position on the diagram’s horizontal axis represents the span of time over
which the operation takes place.

The background colors in the timing diagram in this book have been chosen to indicate the type of
operation being performed. ALU operations are displayed in blue. Memory transfers, in amber.

In the simplest case, where a CPU can only do one thing at a time. Let us assume that each of the 7
operations in our loop body take the same amount of time to complete. Each iteration of the loop body
results in the execution of the same operations in the same order. Figure 2.2 is a timing diagram showing
how the the first two iterations of our loop perform a total of 7 × 2 = 14 operations, each consuming one
unit of time. That our example CPU can only do one thing at a time is made evident by that fact that
only one of the units is not idle at any point in time.

t1 ← x[i] t2 ← z[i] y[i] ← t2 t1 ← x[i] t2 ← z[i] y[i] ← t2

mem

t1 ← t1 × t1 t2 ← t2 × t2 t2 ← t1 + t2 i←i+1 t1 ← t1 × t1 t2 ← t2 × t2 t2 ← t1 + t2 i←i+1
ALU
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14

Figure 2.2: Two unoptimized iterations of y[i] = x[i]2 + z[i]2

If a CPU is capable of exchanging data with memory at the same time that it is performing an operation ý Fix Me:
with its ALU and it can “look ahead” in the instruction stream then it is possible for it to optimize the Find a reference that discusses
how this is done. This should be
use of its functional units by scheduling more than one at the same time.[9] As long as care is taken to discussed in [5] and/or [8]
ensure that the data required for any given operation is present when the operation starts, operating
the units in parallel will reduce the time that it takes to execute the body of our loop from 7 to 5 units
per-iteration. Figure 2.3 shows how a total of 5 × 2 = 10 units of time are used to to complete two
iterations of our loop when CPU schedules its functional units in parallel. The performance improvement
is evident as the same 14 total units of time are allocated to the same 14 operations. The only difference
is when they have been scheduled to take place.

t1 ← x[i] t2 ← z[i] y[i] ← t2 t1 ← x[i] t2 ← z[i] y[i] ← t2

mem

t1 ← t1 × t1 t2 ← t2 × t2 t2 ← t1 + t2 i←i+1 t1 ← t1 × t1 t2 ← t2 × t2 t2 ← t1 + t2 i←i+1
ALU
0 1 2 3 4 5 6 7 8 9 10

Figure 2.3: Two RTL-parallel iterations of y[i] = x[i]2 + z[i]2

We can, however, change the RTL to better suit our needs. Using a technique called loop unrolling [10, ý Fix Me:
p. 735] we can rewrite our program showing the iterations of our loop in the form of one long instruction Check Aho reference.
stream. To better illustrate what is happening we will now consider three iterations of our loop body
and introduce an additional variable k that we will use along with i as our index counter as shown in
Figure 2.4.
4 The notion of any part of a machine being “idle” is a misnomer. Unless the power is removed, nothing actually

stops per se. When used in the context of a timing diagram or pipeline, idle literally indicates that the specified unit’s
activities are not consequential because its output will go unused during the indicated idle period. As an optimization,
modern processors will dispatch specific instructions that are known to consume the least amount of power during such
idle periods. It is easy to demonstrate the results of this by detecting the temperature (and fan speed) changes of a laptop
when its activity changes from idle to busy.

~/NIU/courses/532/2015-fa/book/openspl/./introDataFlow/chapter.tex Page 7 of 81
[email protected] 2015-10-20 15:01:52 -0500 v1.0-79-g838c34b
CHAPTER 2. DATAFLOW COMPUTING

Out-of-order execution
k ← i+1 i ← k+1 k ← i+1
t1 ← x[i] t1 ← x[k] t1 ← x[i]
t1 ← t1 × t1 t1 ← t1 × t1 t1 ← t1 × t1
t2 ← z[i] t2 ← z[k] t2 ← z[i]
t2 ← t2 × t2 t2 ← t2 × t2 t2 ← t2 × t2
t2 ← t1 + t2 t2 ← t1 + t2 t2 ← t1 + t2
y[i] ← t2 y[k] ← t2 y[i] ← t2

Figure 2.4: Three iterations of RTL unrolled loop version of y[i] = x[i]2 + z[i]2

Using two counter registers (i and k) instead of one (i) and interleaving which is used during each of the
original loop bodies we can now see how the operations can be reordered to better suit the capabilities
of our two-unit CPU. Advanced CPUs are capable looking far enough ahead in the instruction stream to
implement this type of optimization using what is called out-of-order execution.[9] The CPU now can
use the ALU during the first time unit in each loop body to perform the counter increment for the next
loop body and then it can relocate the t1 ← x[i]. As shown in Figure 2.4, out of order execution allows
the first iteration of our loop to complete in 5 and the remaining iterations in 4 units of time completing
3 loops in less time that it took the original CPU design to do 2. The cost for this optimization is that
the CPU has to allocate an additional register.

Using an extra register as a way to save time is a trade-off between space (silicon to make/use another
register) and time (additional cycles required to perform operations that can not be scheduled to occur
at other times).

t1 ← x[i] t2 ← z[i] t1 ← x[k] y[i] ← t2 t2 ← z[k] t1 ← x[i] y[k] ← t2 t2 ← z[i] t1 ← x[k] y[i] ← t2
mem

k ←i+1 t1 ← t1 × t1 t2 ← t2 × t2 t2 ← t1 + t2 i←k+1 t1 ← t1 × t1 t2 ← t2 × t2 t2 ← t1 + t2 k ←i+1 t1 ← t1 × t1 t2 ← t2 × t2 t2 ← t1 + t2 i←k+1

ALU
0 1 2 3 4 5 6 7 8 9 10 11 12 13

Figure 2.5: Three iterations of out-of-order RTL-parallel execution of y[i] = x[i]2 + z[i]2

The ALU is now saturated with work. Therefore we have gone as far as we can. . . with a CPU that has
only two functional units.

Further optimization would require that we either eliminate some of the operations or add more functional
units to further distribute the work. For example we could add an additional ALU and another path to
access the memory in the system.

The extent to which adding functional units is helpful depends on how many of the operations must be
completed before others can begin as well as the ability of the CPU to schedule the instructions across
all of the units in an efficient manner.

While all of this is possible, some types of problems are more easily optimized by using a pipeline than
a CPU.

Page 8 of 81 ~/NIU/courses/532/2015-fa/book/openspl/./introDataFlow/chapter.tex
[email protected] 2015-10-20 15:01:52 -0500 v1.0-79-g838c34b
2.3. PROBLEM SOLVING WITH PIPELINED COMPUTING

2.3 Problem Solving With Pipelined Computing Pipeline

Kernel Graph

Amdahl’s law states that gaining efficiency by performing operations in parallel is limited by the amount
of time that is required to execute the longest single serial task. Serial, in this context, refers to the
dependency path through the sequence of events that must performed in order.

2.3.1 The Kernel Graph

To better understand the dependency path in our application let us express it in the form of a network
where edges will represent dependencies and nodes represent operations. Let us call our diagram a
Kernel graph and draw it as shown in Figure 2.6. The multiplication units (the top two blue circles)
have their input data flowing in from the x and z (inverted orange house) data-source units and their
results flowing out to an adder unit (the bottom blue circle) that, in turn, has its output flowing to the
y (orange house) data-sink unit. This notation was adopted from [1, p. 23].

x z

* *

Figure 2.6: A Kernel graph for y = x2 + z 2

Expressed in this form we can see that the x and z data items can be (theoretically) fetched simultane-
ously because they have no direct or indirect dependencies shown in the graph. The two multiplication
operations can also happen at the same time as long as they have data delivered from x and z. The
addition can not start until after both the multiplications have completed because the data items that
the addition requires flows in from the multipliers. Finally, the delivery of the sum to the y data-sink
can not start until the addition has completed.

2.3.2 Timing Diagrams for y = x2 + z 2

At first glance it appears that 4 units of time is the best we can do. But, as Slotnick pointed out, “some
original thinking” might offer additional opportunity.

Allocating and dedicating a functional unit for each node in our kernel graph and parallelizing them might

~/NIU/courses/532/2015-fa/book/openspl/./introDataFlow/chapter.tex Page 9 of 81
[email protected] 2015-10-20 15:01:52 -0500 v1.0-79-g838c34b
CHAPTER 2. DATAFLOW COMPUTING

Pipeline!Fill yield a machine that operates as shown in Figure 2.7 for the case when x = {1, 2, 3} and z = {4, 5, 6}.
Pipeline!Flush For now let us continue to assume that i is (somehow) initialized to 0.

1 2 3
t1 ← x[i]
4 5 6
t2 ← z[i]
1 4 9
t1 ← t1 × t1
16 25 36
t2 ← t2 × t2
17 29 45
t2 ← t1 + t2
17 29 45
y[i] ← t2
1 2 3
i←i+1
0 1 2 3 4 5 6 7 8 9 10 11 12

Figure 2.7: A naive pipelined implementation of y[i] = x[i]2 + z[i]2 for x = {1, 2, 3} and z = {4, 5, 6}.

Our timing diagram now has a stair-step characteristic due to the chain of dependencies in the operations
that can not be parallelized.

Note that our notation of labeling the timing diagram has changed. Since each “row” now repre-
sents a single-purpose dedicated functional unit, we can now indicate the one and only operation
that each performs along the left edge of our diagram as opposed to within the time periods
as was the case earlier when each of the units were used for different operations at different times.

We can take advantage of this situation by now indicating the values of the data that are
output by each functional unit at every point in time.

However, trading space for time by adding some more temporary registers we can see that the stair-steps
can be collapsed as shown in Figure 2.8. This time we assume that i and k are initialized to 0 and that
x = {1, 2, 3, 4, 5, 6, 7, 8, 9} and z = {4, 5, 6, 7, 8, 9, 10, 11, 12}.

This time we use a separate counter for the x and y inputs (k) than we do for the y output (i) because,
while they count the same things, they now have to do so at different times. We have also added enough
temporary registers so that there is now one for every edge in our Kernel graph.

By not reusing any registers for more than one specific purpose, we have eliminated the need for
any functional unit to wait on any other unless the two have a problem-specific data-dependency
(represented by an edge in the Kernel graph) between them.

We can now see that the first “iteration” of our loop takes 4 time units and the rest each take 1. The
first 3 time units in Figure 2.8 represent what is called filling the pipeline. The last 3 time units
represent what is called flushing the pipeline.

What originally required 7 × 3 = 21 time units now takes 3 + 1 × 3 = 6.

Page 10 of 81 ~/NIU/courses/532/2015-fa/book/openspl/./introDataFlow/chapter.tex
[email protected] 2015-10-20 15:01:52 -0500 v1.0-79-g838c34b
2.4. OBSERVATIONS

1 2 3 4 5 6 7 8 9
t1 ← x[ k ]
4 5 6 7 8 9 10 11 12
t2 ← z[ k ]
1 2 3 4 5 6 7 8 9
k ←k+1
1 4 9 16 25 36 49 64 81
t3 ← t1 × t1
16 25 36 49 64 81 100 121 144
t4 ← t2 × t2
17 29 45 65 89 117 149 185 225
t5 ← t3 + t4
17 29 45 65 89 117 149 185 225
y[ i ] ← t5
1 2 3 4 5 6 7 8 9
i←i+1
0 1 2 3 4 5 6 7 8 9 10 11 12

Figure 2.8: A properly pipelined version of y[i] = x[i]2 + z[i]2 for x = {1, 2, 3, 4, 5, 6, 7, 8, 9} and z =
{4, 5, 6, 7, 8, 9, 10, 11, 12}.

2.4 Observations

The following observations may now be made:

1. The more data elements we process the greater the advantage gained by our pipelined implemen-
tation due to amortization of the pipeline fill and flush costs. Therefore: complexityn→∞ = O(n).
2. The complexity of the function implemented will determine the number of stages required in the
pipeline.
3. The number of stages in the pipeline will define the latency in our design. Latency is the amount
of time between the arrival of data element(s) at the input unit(s) and the corresponding result
leaving the output unit(s).
4. The duration of one time unit is equal to the latency divided by the number of stages in our
pipeline.

5. Each stage in a pipeline runs at the same speed.

6. The period of time used for the tick rate of a pipeline is defined by the slowest stage.
7. For each tick of the pipeline one input value is consumed and one output value is produced.[1, p. ?]
8. In this particular case, once filled (and prior to flushing), every unit in the pipeline is doing useful
work all the time.

~/NIU/courses/532/2015-fa/book/openspl/./book.tex Page 11 of 81
[email protected] 2015-10-20 15:01:52 -0500 v1.0-79-g838c34b
CHAPTER 2. DATAFLOW COMPUTING

Page 12 of 81 ~/NIU/courses/532/2015-fa/book/openspl/./introOpenSPL/chapter.tex
[email protected] 2015-10-20 15:01:52 -0500 v1.0-79-g838c34b
Data Flow
Engine—seeDFE
DFE
OpenSPL
FPGA
Field Programmable Gate
Array—seeFPGA

3
Application Accelerator
MaxJ

OpenSPL Basics

This chapter provides a brief overview of OpenSPL as a system for creating pipelines for dataflow
computing.

With a basic understanding one can navigate the Maxeler IDE, run example programs, write a “Hello
World!” application, simulate its execution and deploy applications on a DFE (Data Flow Engine).

3.1 Introduction

OpenSPL is an open specification for a Spatial Programming Language. The operations of a spatial
program exists in space rather than as a sequence of operations over time. This means that all of the
operations of program can happen at once and that the notion of what it means to execute a program
is more about getting the data in and out of the system as opposed to the sequence of events that take
place in typical procedural languages.

OpenSPL applications tend to manifest themselves in the form of one or more pipelines that are deployed
using an FPGA (Field Programmable Gate Array) that, when acting in this fashion, is referred to
as a DFE.

3.2 An Accelerator Architecture

An application that uses a DFE to improve its performance does so by using a it as an application
accelerator. In doing so the code executes sequentially and that which executes spatially is written
using two different styles and languages.

The code that runs sequentially can be written in a language like C and runs on a CPU in the manner
that any C programmer is accustomed. The code that runs spatially is written in a variation of Java
called MaxJ and ultimately runs on an FPGA on a DFE.

The coordination of compiling everything can be performed by the MaxIDE (Eclipse) and consists of

~/NIU/courses/532/2015-fa/book/openspl/./introOpenSPL/chapter.tex Page 13 of 81
[email protected] 2015-10-20 15:01:52 -0500 v1.0-79-g838c34b
CHAPTER 3. OPENSPL BASICS

MaxCompiler executing the MaxCompiler to compile the MaxJ code into a .max file suitable for configuring an
SLiC FPGA and a standard C compiler to create an executable program for the CPU. See Figure 3.1.
Simple Live
CPU—seeSLiC

Figure 3.1: Compiling an OpenSPL application.[1, p. 20]

When the C application runs it can implicitly or explicitly configure and use the DFE with one or more
.max files to process data streams.

3.2.1 CPU Code

CPU code can call functions that are generated by the MaxCompiler. Supported languages include C,
Python, Matlab and R. This text will focus on the use of C code on the CPU.

The CPU code is, no different than any other application you might write. The CPU portion of an
OpenSPL application requires adding as little as one #include and one function-call statement to
exchange data streams with the DFE.

To the C code, the data stream exchanges with the DFE are regions of memory like an array or a buffer
created by calling the malloc(3) library function.

The above-mentioned header file to include and the function(s) to call to use the DFE are generated by
the MaxCompiler when it builds the MaxJ files. The generated function(s) use the SLiC (Simple Live
CPU) library interface that is part of the OpenSPL system. The SLiC interface provides the low-level
services needed for the CPU to configure and exchange data with one or more DFEs.

Page 14 of 81 ~/NIU/courses/532/2015-fa/book/openspl/./introOpenSPL/chapter.tex
[email protected] 2015-10-20 15:01:52 -0500 v1.0-79-g838c34b
3.2. AN ACCELERATOR ARCHITECTURE

3.2.2 MaxJ Code

The MaxJ part of an application is comprised of two components:

• One or more Kernels (pipelines) that are responsible for processing data streams.
• A single Manager that tends to the movement of data between the host system memory,
Kernels, State Machines and memories on the DFE.

The MaxJ code is written in an extended version of Java which adds operator overloading to the base
Java language. MaxJ source files have a .maxj file extension to differentiate them from pure Java.[1,
p. 20] The operator overloading makes expressing mathematical operations used in Kernel pipelines
easier to write.

There is a subtlety hiding in the box labeled Hardware Build or Simulation in Figure 3.1. As
part of the build process, your MaxJ/Java application is actually executed. The output of the
Java application is used to create the netlist that is ultimately deployed on the DFE.

The ultimate output of the MaxCompiler is a .max file that contains the executable DFE code and a
.h file that contains generated function declarations and constants required to compile the CPU code.
Thus the MaxJ code is compiled before the C code because the MaxJ code is the origin of that which
defines the interface to the DFE code.1

The life-cycle of the DFE code is similar to that of any executable program. . . as long as the responsibility
of the operating system are taken into account: [1, p. 112]

1. Load - A .max file is loaded onto a DFE by the CPU code. The DFE card is now exclusively owned
by the calling CPU process. Loading the .max file takes in the order of 100ms to 1s.

2. Run - The CPU code calls SLiC functions to execute actions on the DFE. A loaded .max file should
be utilized for long enough to justify having waited up to a second to load it.

3. Unload - The DFE is released by the CPU process that returns it to the pool of DFEs managed
by MaxelerOS for use by other applications.

4. Free - The .max file is deallocated.

The Basic Static2 SLiC interface implicitly loads the .max file onto the DFE when the first SLiC
function is called, and then unloads the DFE and frees the .max file when the CPU code terminates.

This means that your application will stall if/when the DFE card(s) are in use by another application
until that application terminates (or otherwise explicitly releases the DFE).

Note that a single application may serially reuse one DFE card by handling the loading and unloading
of multiple .max files by using the SLiC API.
1 This creates a chicken-and-egg problem when it comes to writing the CPU application because the order of the

arguments in the DFE-generated functions is not known until it has been compiled. To address that problem stub-in a call
to the function and leave out its parameters, compile the application, it will fail on the call with incorrect arguments, look
at the generated header-file (or use the IDE ‘insight’ to see what they are), add them and recompile. Empirical evidence
suggests that the ordering is reliably reproducible and sorted alphabetically by type.
2 One of three SLiC interfaces discussed in ??

~/NIU/courses/532/2015-fa/book/openspl/./introOpenSPL/chapter.tex Page 15 of 81
[email protected] 2015-10-20 15:01:52 -0500 v1.0-79-g838c34b
CHAPTER 3. OPENSPL BASICS

Stream
Once the code for Kernels and the Manager are combined they form a complete dataflow program.
The execution of this program results in either the generation of a dataflow engine configuration
file (.max file), or the execution of a DFE simulation. In either case, the MaxCompiler always
generates an include file to go with a .max file.[1, p. 31]

3.2.3 Kernel

An OpenSPL application will contain one or more Kernels. A Kernel that implements the Kernel graph
shown in Figure 2.6 would contain logic like that shown in Listing 3.1.
Listing 3.1: KernelBody.maxj
The body of a simple kernel.
1 DFEVar xs = io . input ( " x " , dfeFloat (8 , 24) ) ; // A float stream called x
2 DFEVar zs = io . input ( " z " , dfeFloat (8 , 24) ) ; // A float stream called z
3

4 DFEVar sum = xs * xs + zs * zs ; // sum = xs ^2 + zs ^2

6 io . output ( " y " , sum , dfeFloat (8 , 24) ) ; // A float stream called y

As can be seen the Kernel defines the name and type of data for each stream that it will process along
with the operations it will perform on the data stream. In this case it will sum the squares of the
elements in the x and z input streams and write the result to an output stream named y.

The data types of each of the three streams is identical and set to dfeFloat(8, 24). This is the
OpenSPL way of defining what would appear in a C program as a float.

3.2.4 Manager

An OpenSPL application will contain one and only one Manager. The Manager coordinates the data
flow between the CPU, Kernels, the DFE’s memory and other devices depending on the particular type
of DFE card(s) in the system.[1, p. 20] Each of these dataflows are are called a stream.

The simplest of all Managers is one that connects all of I/O defined in a single Kernel to the CPU and
is shown in Listing 3.2.[11, p. 41]
Listing 3.2: SimpleManager.maxj
The simplest of Managers.
1 public static void main ( String [] args )
2 {
3 EngineParameters params = new EngineParameters ( args ) ;
4 Manager manager = new Manager ( params ) ;
5 Kernel kernel = new SimpleKernel ( manager . makeKernelParameters () ) ;
6 manager . setKernel ( kernel ) ;
7 manager . setIO ( IOType . ALL_CPU ) ;
8 manager . createSLiCinterface () ;
9 manager . build () ;
10 }

This Manager makes boiler-plate calls to initialize the OpenSPL environment in lines 3 and 4.

Page 16 of 81 ~/NIU/courses/532/2015-fa/book/openspl/./introOpenSPL/chapter.tex
[email protected] 2015-10-20 15:01:52 -0500 v1.0-79-g838c34b
3.2. AN ACCELERATOR ARCHITECTURE

The Kernel is created in line 5 and the default parameters are passed to the Kernel object’s constructor.

The kernel is then linked to the manager in line 6.

All of the scalar and stream I/O variables are routed to the CPU application in line 7. This means that
they will appear in the generated C-callable function in the generated .h file and will be named based
on how the Kernel named them in the io.input() and io.output() calls such as those in Listing 3.1
on lines 1, 2 and 6.

~/NIU/courses/532/2015-fa/book/openspl/./book.tex Page 17 of 81
[email protected] 2015-10-20 15:01:52 -0500 v1.0-79-g838c34b
CHAPTER 3. OPENSPL BASICS

Page 18 of 81 ~/NIU/courses/532/2015-fa/book/openspl/./introMaxIDE/chapter.tex
[email protected] 2015-10-20 15:01:52 -0500 v1.0-79-g838c34b
4
Maxeler IDE

This chapter will present the Maxeler IDE to orient the reader before writing a first program.

4.1 Accessing the OpenSPL Environment

There are two ways to access the Maxeler IDE at NIU. Note that some documentation resources mention
the availability of a web-based IDE. This is not available at NIU.

4.1.1 Installation of VM

Download and install the Maxeler VM from the University program web site and execute it using a lab
PC or your own. In this configuration you will be limited to only development and testing applications
using a simulation environment.

As you will see the simulated environment is where you will do the majority of your work. You will want
to use this.

See Appendix B for details on installing and using VMware on Linux, Mac and Windows systems.

4.1.2 hermes.niu.edu

Using the software on hermes.niu.edu will allow for deployment of applications on real DFE hardware
for final release testing and timing analysis.

Accessing hermes.niu.edu requires an ssh client and X windows server.

See Appendix A for details on using NX to improve performance of X windows.

~/NIU/courses/532/2015-fa/book/openspl/./introMaxIDE/chapter.tex Page 19 of 81
[email protected] 2015-10-20 15:01:52 -0500 v1.0-79-g838c34b
CHAPTER 4. MAXELER IDE

4.2 Maxeler First Time Use

To start the Maxeler IDE:

1 [ winans@hermes ~] $ maxide &

Alternatively, the MaxIDE icon on the desktop (on the VM) may be clicked.

Figure 4.1: VM Desktop Icons.

Note that the Maxeler IDE is based on Eclipse. See http://www.eclipse.org/ for general information
about the Eclipse IDE.

4.2.1 Set up Your Workspace

When started the first time the IDE will present a Workspace Launcher window (see Figure 4.2) asking
you to where to put all of your files. Accepting the default of ~/workspace should be suitable.

Figure 4.2: Create a new workspace

4.2.2 Documentation

When the IDE is started and there are no projects to display (as is the case when running it for the first
time) a Welcome window is displayed (Figure 4.3).

Page 20 of 81 ~/NIU/courses/532/2015-fa/book/openspl/./introMaxIDE/chapter.tex
[email protected] 2015-10-20 15:01:52 -0500 v1.0-79-g838c34b
4.2. MAXELER FIRST TIME USE

4.2.2.1 Documentation Available in the IDE

Note: When running the IDE on hermes.niu.edu you may not be able to view any of the help documents.

The Welcome window presents a number of tutorials on how to use OpenSPL. These documents are
very useful. It is recommended that they be skimmed early on in order to familiarize yourself with what
is there so that help can be located down the road when it is needed.

Figure 4.3: The MaxIDE Welcome Window

4.2.3 Importing an Example Project

Also appearing on the Welcome Window is a link to a set of example projects that are discussed in the
tutorial documents.

Select the Auto-import MaxCompiler tutorial projects link in the Welcome window (Figure 4.3).

Select MaxCompiler Dataflow Programming Tutorial from the menu box and check the examples
box and then click finish in the Import MaxCompiler Projects window (Figure 4.4). ý Fix Me:
rework this to describe and show
how to only import
chap03-example1 and/or whatever
~/NIU/courses/532/2015-fa/book/openspl/./introMaxIDE/chapter.tex Page 21 of 81 else we end up using rather than
[email protected] 2015-10-20 15:01:52 -0500 v1.0-79-g838c34b all examples.
CHAPTER 4. MAXELER IDE

Figure 4.4: Import MaxCompiler Projects

This will import every project discussed in the Open MaxCompiler Dataflow Programming Tutorial.

Once completed the IDE will replace the Welcome window with the Project Explorer panel and list all
of the imported projects.

Open a project by selecting tutorial-chap03-example1-movingaveragesimple from the Select Project

panel (Figure 4.5) in the IDE menu bar.

Figure 4.5: Select Project

Once a project is open, navigate around to see what is in there.

Page 22 of 81 ~/NIU/courses/532/2015-fa/book/openspl/./introMaxIDE/chapter.tex
[email protected] 2015-10-20 15:01:52 -0500 v1.0-79-g838c34b
4.2. MAXELER FIRST TIME USE

Figure 4.6: Open Example Project

Click on the Select Run/simulation box over the project explorer and chose simulation. ý Fix Me:
This needs more screen caps with
arrows pointing at these unnamed
Then click the play button icon to run it. things

ý Fix Me:
It will build and run your application. Add arrow naming the play button

If you receive a pop-up/warning about the simulator being started outside the IDE, select the force
reset option.

~/NIU/courses/532/2015-fa/book/openspl/./introMaxIDE/chapter.tex Page 23 of 81
[email protected] 2015-10-20 15:01:52 -0500 v1.0-79-g838c34b
CHAPTER 4. MAXELER IDE

Figure 4.7: Build and Run Simulation console window messages

Output from the run shows up in the terminal window below the build messages.

Figure 4.8: Simulation output messages

Once things have run, you can open the Run Rules > Simulation > Final Kernel Graph to see a
diagram of the dataflow for the kernel and/or manager as seen in Figure 4.9.

Page 24 of 81 ~/NIU/courses/532/2015-fa/book/openspl/./introMaxIDE/chapter.tex
[email protected] 2015-10-20 15:01:52 -0500 v1.0-79-g838c34b
4.3. MAXIDE PROBLEMS

Figure 4.9: Example kernel Graph

Note that generated C headers appear in the workspace area under the CPU Code part of the project
tree.

In the Engine Code part of a project we find the .maxj files.

4.3 MaxIDE Problems

If your connection to the server fails and/or the MaxIDE crashes, you may find that your workspace
has become corrupted. If starting the MaxIDE results in displaying a broken workspace, terminate it
immediately and start it again. Additionally, will want to backup copies of your source code files early
and often.

See Appendix E for instructions on how to use SVN to backup and hand in your work.

~/NIU/courses/532/2015-fa/book/openspl/./book.tex Page 25 of 81
[email protected] 2015-10-20 15:01:52 -0500 v1.0-79-g838c34b
CHAPTER 4. MAXELER IDE

Page 26 of 81 ~/NIU/courses/532/2015-fa/book/openspl/./firstprog/chapter.tex
[email protected] 2015-10-20 15:01:52 -0500 v1.0-79-g838c34b
5
Your First OpenSPL Program

5.1 Introduction

This chapter presents the details of creating a new project that will implement an application that
calculates si = x2i + 30a.

The goals are to learn how to create new projects using the IDE, and alter the CPU, Manager and
Kernel code to add/remove items from the SLiC interface.

5.1.1 Create a New Project

To create a new application from scratch we will use the IDE to create an application from a template
based on the type of application we wish to develop and then change it to suit our needs.

In this example we will create an application with a Standard Manager using CPU Streams that will be
compiled for the Icsa DFE hardware we will be using.

The stub application template created by the IDE will implement si = xi + yi + a. The template code
will be altered in order to implement si = x2i + 30a.

~/NIU/courses/532/2015-fa/book/openspl/./firstprog/chapter.tex Page 27 of 81
[email protected] 2015-10-20 15:01:52 -0500 v1.0-79-g838c34b
CHAPTER 5. YOUR FIRST OPENSPL PROGRAM

5.1.1.1 Create New MaxCompiler Project

Figure 5.1: Creating a new MaxCompiler project.

Begin by clicking the New icon in the upper-left of the IDE and select MaxCompiler Project from
the menu as shown in Figure 5.1.

This will open a window that will prompt you for the details needed to create your project.

Page 28 of 81 ~/NIU/courses/532/2015-fa/book/openspl/./firstprog/chapter.tex
[email protected] 2015-10-20 15:01:52 -0500 v1.0-79-g838c34b
5.1. INTRODUCTION

5.1.1.2 Name the Project

Figure 5.2: Name the new project.

In the window that opens enter a suitable name for the new project, set the Manager Templates to CPU
Stream (Vector Addition) and click Next (Figure 5.2).

The Project name field is used as the name that will appear in the Project Explorer tab on the left of
the IDE.

ý Fix Me:
Should we mention something
about what the Vector Addition
option means?

~/NIU/courses/532/2015-fa/book/openspl/./firstprog/chapter.tex Page 29 of 81
[email protected] 2015-10-20 15:01:52 -0500 v1.0-79-g838c34b
CHAPTER 5. YOUR FIRST OPENSPL PROGRAM

5.1.1.3 Set the DFE Hardware Type

Figure 5.3: Set DFE hardware type.

Set the DFE Model to the type of hardware you are targeting with your application (hermes.niu.edu
has a Icsa MAX4AB24B), chose a Standard Manager, provide a Stem Name for your manager
and kernel and then click Next (Figure 5.3).

Page 30 of 81 ~/NIU/courses/532/2015-fa/book/openspl/./firstprog/chapter.tex
[email protected] 2015-10-20 15:01:52 -0500 v1.0-79-g838c34b
5.1. INTRODUCTION

5.1.1.4 Select the SLiC Interface Type

Figure 5.4: Select the SLiC interface type.

Provide a suitable name for the file that will contain your C source code (the default is suitable), set the
SLiC Interface type to Basic Static and click Finish (Figure 5.4).

~/NIU/courses/532/2015-fa/book/openspl/./firstprog/chapter.tex Page 31 of 81
[email protected] 2015-10-20 15:01:52 -0500 v1.0-79-g838c34b
CHAPTER 5. YOUR FIRST OPENSPL PROGRAM

CPUCode 5.1.1.5 Inspect the Project Template Stub Files

Figure 5.5: Open the template stub files.

Note: Have a look at the template files by opening the project in the Project Explorer tab and navigate to
At the moment we are not your CPU Code and Engine Code Manager and Kernel files. Double-clink on TestStreamCpu-
interested in the
TestStreamEngineParameters Code.c, TestStreamKernel.maxj, and TestStreamManager.maxj files to see them in an editor
file. tab (Figure 5.5).

5.1.1.6 CPU Code Template

Listing 5.1: workspace/FirstProject/CPUCode/TestStreamCpuCode.c

Generated CPUCode stub.
1 # include < math .h >
2 # include < stdio .h >
3 # include < stdlib .h >
4

5 # include " Maxfiles . h "

6 # include " MaxSLiCInterface . h "
7

8 int main ( void )

9 {
10 const int size = 384;
11 int sizeBytes = size * sizeof ( int32_t ) ;
12 int32_t * x = malloc ( sizeBytes ) ;
13 int32_t * y = malloc ( sizeBytes ) ;
14 int32_t * s = malloc ( sizeBytes ) ;
15

16 // TODO Generate input data

17 for ( int i = 0; i < size ; ++ i ) {

Page 32 of 81 ~/NIU/courses/532/2015-fa/book/openspl/./workspace/FirstProject/CPUCode/TestStreamCpuCode.c
[email protected] 2015-10-20 15:01:52 -0500 v1.0-79-g838c34b
5.1. INTRODUCTION

18 x [ i ] = random () % 100; EngineCode

19 y [ i ] = random () % 100; Kernel
20 } Manager
21

22 printf ( " Running on DFE .\ n " ) ;

23 int scalar = 3;
24 TestStream ( scalar , size , x , y , s ) ;
25

26 // TODO Use result data

27 for ( int i = 0; i < size ; ++ i )
28 if ( s [ i ] != x [ i ] + y [ i ] + scalar )
29 return 1;
30

31 printf ( " Done .\ n " ) ;

32 return 0;
33 }

5.1.1.7 Kernel Engine Code Template

Listing 5.2: workspace/FirstProject/EngineCode/src/teststream/TestStreamKernel.maxj

Kernel Engine Code Template.
1 package teststream ;
2

3 import com . maxeler . maxcompiler . v2 . kernelcompiler . Kernel ;

4 import com . maxeler . maxcompiler . v2 . kernelcompiler . KernelParameters ;
5 import com . maxeler . maxcompiler . v2 . kernelcompiler . types . base . DFEType ;
6 import com . maxeler . maxcompiler . v2 . kernelcompiler . types . base . DFEVar ;
7

8 class TestStreamKernel extends Kernel {

10 private static final DFEType type = dfeInt (32) ;

12 protected TestStreamKernel ( KernelParameters parameters ) {

13 super ( parameters ) ;
14

15 DFEVar x = io . input ( " x " , type ) ;

16 DFEVar y = io . input ( " y " , type ) ;
17 DFEVar a = io . scalarInput ( " a " , type ) ;
18

19 // TODO replace with your computation

20 DFEVar sum = x + y + a ;
21

22 io . output ( " s " , sum , type ) ;

23 }
24

25 }

5.1.1.8 Manager Code Template

Listing 5.3: workspace/FirstProject/EngineCode/src/teststream/TestStreamManager.maxj

Manager Code Template.

Page 33
~/NIU/courses/532/2015-fa/book/openspl/./workspace/FirstProject/EngineCode/src/teststream/TestStreamManager.maxj of 81
[email protected] 2015-10-20 15:01:52 -0500 v1.0-79-g838c34b
CHAPTER 5. YOUR FIRST OPENSPL PROGRAM

1 package teststream ;
2

3 import static com . maxeler . maxcompiler . v2 . managers . standard . Manager . link ;

5 import com . maxeler . maxcompiler . v2 . kernelcompiler . Kernel ;

6 import com . maxeler . maxcompiler . v2 . managers . BuildConfig ;
7 import com . maxeler . maxcompiler . v2 . managers . engine_interfaces . CPUTypes ;
8 import com . maxeler . maxcompiler . v2 . managers . engine_interfaces . EngineInterface ;
9 import com . maxeler . maxcompiler . v2 . managers . engine_interfaces . InterfaceParam ;
10 import com . maxeler . maxcompiler . v2 . managers . standard . IOLink . IODestination ;
11 import com . maxeler . maxcompiler . v2 . managers . standard . Manager ;
12

13 public class TestStreamManager {

15 private static final String s_kernelName = " TestStreamKernel " ;

17 public static void main ( String [] args ) {

18 T e s t S t r e a m E n g i n e P a r a m e t e r s params = new T e s t S t r e a m E n g i n e P a r a m e t e r s ( args
);
19 Manager manager = new Manager ( params ) ;
20 Kernel kernel = new TestStreamKernel ( manager . makeKernelParam e t e r s (
s_kernelName ) ) ;
21 manager . setKernel ( kernel ) ;
22 manager . setIO (
23 link ( " x " , IODestination . CPU ) ,
24 link ( " y " , IODestination . CPU ) ,
25 link ( " s " , IODestination . CPU ) ) ;
26

27 manager . createSLiCinterface ( interfaceDefault () ) ;

29 configBuild ( manager , params ) ;

31 manager . build () ;
32 }
33

34 private static EngineInterface interfaceDefault () {

35 EngineInterface engine_interface = new EngineInterface () ;
36 CPUTypes type = CPUTypes . INT32 ;
37 int size = type . sizeInBytes () ;
38

39 InterfaceParam a = engine_interface . addParam ( " A " , CPUTypes . INT ) ;

40 InterfaceParam N = engine_interface . addParam ( " N " , CPUTypes . INT ) ;
41

42 engine_interface . setScalar ( s_kernelName , " a " , a ) ;

44 engine_interface . setTicks ( s_kernelName , N);

45 engine_interface . setStream ( " x " , type , N * size ) ;
46 engine_interface . setStream ( " y " , type , N * size ) ;
47 engine_interface . setStream ( " s " , type , N * size ) ;
48 return engine_interface ;
49 }
50

51 private static void configBuild ( Manager manager , T e s t S t r e a m E n g i n e P a r a m e t e r s

params ) {
52 manager . s e t E n a b l e S t r e a m S t a t u s B l o c k s ( false ) ;
53 BuildConfig buildConfig = manager . getBuildConfig () ;

Page 34 ~/NIU/courses/532/2015-fa/book/openspl/./workspace/FirstProject/EngineCode/src/teststream/TestStreamManager.maxj
of 81
[email protected] 2015-10-20 15:01:52 -0500 v1.0-79-g838c34b
5.1. INTRODUCTION

54 buildConfig . s e t M P P R C o s t T a b l e S e a r c h R a n g e ( params . getMPPRStartCT () , params

. getMPPREndCT () ) ;
55 buildConfig . setMPPRParallelism ( params . getMPPRThreads () ) ;
56 buildConfig . s e t M P P R R e t r y N e a r M i s s e s T h r e s h o l d ( params .
g e t MPPR Retry Thre shold () ) ;
57 }
58 }

~/NIU/courses/532/2015-fa/book/openspl/./firstprog/chapter.tex Page 35 of 81
[email protected] 2015-10-20 15:01:52 -0500 v1.0-79-g838c34b
CHAPTER 5. YOUR FIRST OPENSPL PROGRAM

5.1.1.9 Build & Simulate Your Project

Figure 5.6: Build & Run your project in simulation.

To test your application you can quickly build it for simulation and run it without using the DFE. In
simulation, your code will compile quickly and run slowly. But only a simulation can be debugged using
watches in the IDE. (Compiling for the DFE takes at least 20 minutes and can not be easily debugged
while running!)

Building a project for simulation can be done by right-clicking on its name in the Project Explorer tab.

Right-click on FirstProject. Then navigate the menu to Run As and click on Simulation. (Fig-
ure 5.6).

Page 36 of 81 ~/NIU/courses/532/2015-fa/book/openspl/./firstprog/chapter.tex
[email protected] 2015-10-20 15:01:52 -0500 v1.0-79-g838c34b
5.1. INTRODUCTION

5.1.1.10 Build and Run Messages

Figure 5.7: Build and run messages in the console tab.

While a project is building or running, messages will appear in the Console tab in the IDE (Figure 5.7).

Recall that the C application prints “Running on DFE.” and “Done.” from lines 22 and 31, respectively,
in Listing 5.1. We see those lines appearing timestamped at “Thu 19:17” in the Console tab thus
verifying that the application has executed.

Listing 5.4: FirstProject.out

Output from IDE-generated skeleton project code.
1 Thu 19:17: # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # #
2 Thu 19:17: Running
3 Thu 19:17: # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # #
4 Thu 19:17: Executing command :
5 Thu 19:17: ’/ home / winans / max / workspace / FirstProject / RunRules / Simulation /
binaries / TestStream ’
6 Thu 19:17: Command output :
7 Thu 19:17: Running on DFE .
8 Thu 19:17: Done .
9 Thu 19:17: Process terminated with exit code 0.

~/NIU/courses/532/2015-fa/book/openspl/./firstprog/chapter.tex Page 37 of 81
[email protected] 2015-10-20 15:01:52 -0500 v1.0-79-g838c34b
CHAPTER 5. YOUR FIRST OPENSPL PROGRAM

5.1.1.11 Original Kernel Graph

Figure 5.8: The original kernel graph.

The IDE generates a graphic version of the pipeline created by the kernel. You view it by clicking on
Original Kernel Graph in the Project Explorer tab.

The original kernel graph represents the kernel as described by the java code.

Page 38 of 81 ~/NIU/courses/532/2015-fa/book/openspl/./firstprog/chapter.tex
[email protected] 2015-10-20 15:01:52 -0500 v1.0-79-g838c34b
5.2. CONVERT TEMPLATE CODE TO DESIRED APPLICATION

5.1.1.12 Final Kernel Graph

Figure 5.9: The final kernel graph.

The IDE generates a graphic version of the pipeline created by the kernel. You view it by clicking on
Original Kernel Graph in the Project Explorer tab.

The final kernel graph shows how the DFE will actually process the dataflow. It shows the optimizations
that are applied and indicates when and how buffering is used for temporal alignment.

In this trivial kernel, we can see that a three-input adder has been created to perform the kernel operation
rather than a cascade of two-input adders. We do not see any temporal alignment because the kernel is
trivial enough not to require any.

5.2 Convert Template Code to Desired Application

The skeleton application implements this:

si = xi + yi + a (5.2.1)

But we want an application that does this:

si = x2i + 30a (5.2.2)

. . . and it would also be nice to see the input and output data streams so that we can hand-verify our
code.

~/NIU/courses/532/2015-fa/book/openspl/./firstprog/chapter.tex Page 39 of 81
[email protected] 2015-10-20 15:01:52 -0500 v1.0-79-g838c34b
CHAPTER 5. YOUR FIRST OPENSPL PROGRAM

To accomplish these changes, we will add some printing logic, remove the y input stream from the SLiC
interface, change the computation performed by the kernel and change the verification logic to match
the new kernel computation.

5.2.1 Add Display Logic

To see what is going on we add printing logic to dump the input and output data streams. We need
not use large data sets to test this application, so we will also reduce the number of elements in the I/O
streams to 96 to minimize the noise-level.

Note the addition of printInt32Vector() on line 8, the calls to it on lines 34–41 in Listing 5.5.
Listing 5.5: workspace/FirstProject2/CPUCode/TestStreamCpuCode.c
Add print logic to the CPU code.
1 # include < math .h >
2 # include < stdio .h >
3 # include < stdlib .h >
4

5 # include " Maxfiles . h "

6 # include " MaxSLiCInterface . h "
7

8 static void printInt32Vector ( int count , int32_t * v )

9 {
10 int i ;
11 for ( i =0; i < count ; ++ i )
12 {
13 printf ( " % d " , v [ i ]) ;
14 }
15 printf ( " \ n " ) ;
16 }
17

18 int main ( void )

19 {
20 const int size = 96;
21 int sizeBytes = size * sizeof ( int32_t ) ;
22 int32_t * x = malloc ( sizeBytes ) ;
23 int32_t * y = malloc ( sizeBytes ) ;
24 int32_t * s = malloc ( sizeBytes ) ;
25

26 // TODO Generate input data

27 for ( int i = 0; i < size ; ++ i ) {
28 x [ i ] = random () % 100;
29 y [ i ] = random () % 100;
30 }
31

32 printf ( " Running on DFE .\ n " ) ;

33 int scalar = 3;
34 printf ( " a = % d \ n " , scalar ) ;
35 printf ( " Input x \ n " ) ;
36 printInt32Vector ( size , x ) ;
37 printf ( " Input y \ n " ) ;
38 printInt32Vector ( size , y ) ;
39 TestStream ( scalar , size , x , y , s ) ;
40 printf ( " Output x \ n " ) ;
41 printInt32Vector ( size , s ) ;

Page 40 of 81 ~/NIU/courses/532/2015-fa/book/openspl/./workspace/FirstProject2/CPUCode/TestStreamCpuCode.c
[email protected] 2015-10-20 15:01:52 -0500 v1.0-79-g838c34b
5.2. CONVERT TEMPLATE CODE TO DESIRED APPLICATION

43 // TODO Use result data

44 for ( int i = 0; i < size ; ++ i )
45 if ( s [ i ] != x [ i ] + y [ i ] + scalar )
46 return 1;
47

48 printf ( " Done .\ n " ) ;

49 return 0;
50 }

Running the new version of the application renders the output shown in Listing 5.6.

Listing 5.6: FirstProject2.out

Output with display logic to dump the streams.
1 Tue 14:55: # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # #
2 Tue 14:55: Running
3 Tue 14:55: # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # #
4 Tue 14:55: Executing command :
5 Tue 14:55: ’/ home / winans / max / workspace / FirstProject2 / RunRules / Simulation /
binaries / TestStream ’
6 Tue 14:55: Command output :
7 Tue 14:55: Running on DFE .
8 Tue 14:55: a = 3
9 Tue 14:55: Input x
10 Tue 14:55: 83 77 93 86 49 62 90 63 40 72 11 67 82 62 67 29 22 69 93 11 29 21 84
98 15 13 91 56 62 96 5 84 36 46 13 24 82 14 34 43 87 76 88 3 54 32 76 39 26
94 95 34 67 97 17 52 1 86 65 44 40 31 97 81 9 67 97 86 6 19 28 32 3 70 8 40
96 18 46 21 79 64 41 93 34 24 87 43 27 59 32 37 75 74 58 29
11 Tue 14:55: Input y
12 Tue 14:55: 86 15 35 92 21 27 59 26 26 36 68 29 30 23 35 2 58 67 56 42 73 19 37
24 70 26 80 73 70 81 25 27 5 29 57 95 45 67 64 50 8 78 84 51 99 60 68 12 86
39 70 78 1 2 92 56 80 41 89 19 29 17 71 75 27 56 53 65 83 24 71 29 19 68 15
49 23 45 51 55 88 28 50 0 64 14 56 91 65 36 51 28 7 21 95 37
13 Tue 14:55: Output x
14 Tue 14:55: 172 95 131 181 73 92 152 92 69 111 82 99 115 88 105 34 83 139 152 56
105 43 124 125 88 42 174 132 135 180 33 114 44 78 73 122 130 84 101 96 98
157 175 57 156 95 147 54 115 136 168 115 71 102 112 111 84 130 157 66 72 51
171 159 39 126 153 154 92 46 102 64 25 141 26 92 122 66 100 79 170 95 94 96
101 41 146 137 95 98 86 68 85 98 156 69
15 Tue 14:55: Done .
16 Tue 14:55: Process terminated with exit code 0.

5.2.2 Replace the Manager Template With a Simple Manager

Open the TestStreamManager.maxj file and delete line 22 though the end of the file with lines xx-yy
shown in Listing 5.7 (which can be copied from the MovingAverageSimpleManager.maxj tutorial source)
shown (thus replacing 38 lines of code with 5).

Listing 5.7: TestStreamManager.maxj

Simplified version of TestStreamManager.
1 package teststream ;
2

3 import static com . maxeler . maxcompiler . v2 . managers . standard . Manager . link ;

~/NIU/courses/532/2015-fa/book/openspl/./TestStreamManager.maxj Page 41 of 81
[email protected] 2015-10-20 15:01:52 -0500 v1.0-79-g838c34b
CHAPTER 5. YOUR FIRST OPENSPL PROGRAM

5 import com . maxeler . maxcompiler . v2 . kernelcompiler . Kernel ;

13 public class TestStreamManager {

15 private static final String s_kernelName = " TestStreamKernel " ;

17 public static void main ( String [] args ) {

18 T e s t S t r e a m E n g i n e P a r a m e t e r s params = new T e s t S t r e a m E n g i n e P a r a m e t e r s ( args
);
19 Manager manager = new Manager ( params ) ;
20 Kernel kernel = new TestStreamKernel ( manager . makeKernelParam e t e r s (
s_kernelName ) ) ;
21 manager . setKernel ( kernel ) ;
22 manager . setIO ( IOType . ALL_CPU ) ; // Connect all kernel ports to the CPU
23 manager . createSLiCinterface () ;
24 manager . build () ;
25 }
26 }

5.2.3 Change the Computation in the Kernel

Remove the extra input stream be deleting the DFEVar y definition on line 16 in TestStreamKernel.maxj
and change the assignment on line 20 as shown in Listing 5.8.

Listing 5.8: workspace/FirstProject3/EngineCode/src/teststream/TestStreamKernel.maxj

Final version of TestStreamKernel.
1 protected TestStreamKernel ( KernelParameters parameters ) {
2 super ( parameters ) ;
3

4 DFEVar x = io . input ( " x " , type ) ;

5 // DFEVar y = io . input (" y " , type ) ; // REMOVE THIS LINE
6 DFEVar a = io . scalarInput ( " a " , type ) ;
7

8 // TODO replace with your computation

9 DFEVar sum = x * x + 30* a ; // CHANGE THIS LINE
10

11 io . output ( " s " , sum , type ) ;

12 }

5.2.4 Update the CPU Code

Remove the y stream and change the verification logic in the CPU code to match the new kernel. See
changes on lines 23, 29, 37-38 and 45 in Listing 5.5

Page 42 of 81 ~/NIU/courses/532/2015-fa/book/openspl/./firstprog/chapter.tex
[email protected] 2015-10-20 15:01:52 -0500 v1.0-79-g838c34b
5.2. CONVERT TEMPLATE CODE TO DESIRED APPLICATION

Listing 5.9: workspace/FirstProject3/CPUCode/TestStreamCpuCode.c

Final version of CPU code.
1 int main ( void )
2 {
3 const int size = 96;
4 int sizeBytes = size * sizeof ( int32_t ) ;
5 int32_t * x = malloc ( sizeBytes ) ;
6 // int32_t * y = malloc ( sizeBytes ) ; // REMOVE THIS
7 int32_t * s = malloc ( sizeBytes ) ;
8

9 // TODO Generate input data

10 for ( int i = 0; i < size ; ++ i ) {
11 x [ i ] = random () % 100;
12 // y [ i ] = random () % 100; // REMOVE THIS
13 }
14

15 printf (" Running on DFE .\ n ") ;

16 int scalar = 3;
17 printf (" a = % d \ n " , scalar ) ;
18 printf (" Input x \ n ") ;
19 printInt32Vector ( size , x ) ;
20 // printf (" Input y \ n ") ; // REMOVE THIS
21 // printInt32Vector ( size , y ) ; // REMOVE THIS
22 TestStream ( size , scalar , x , s ) ; // CHANGE THIS
23 printf (" Output x \ n ") ;
24 printInt32Vector ( size , s ) ;
25

26 // TODO Use result data

27 for ( int i = 0; i < size ; ++ i )
28 if ( s [ i ] != x [ i ]* x [ i ] + scalar *30) // CHANGE THIS
29 return 1;
30

31 printf (" Done .\ n ") ;

32 return 0;
33 }

5.2.5 Final Program Output

Run your program one more time and see the final output in Listing 5.10

Listing 5.10: FirstProject3.out

Output of final application.
1 Tue 15:22: # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # #
2 Tue 15:22: Running
3 Tue 15:22: # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # #
4 Tue 15:22: Executing command :
5 Tue 15:22: ’/ home / winans / max / workspace / FirstProject3 / RunRules / Simulation /
binaries / TestStream ’
6 Tue 15:22: Command output :
7 Tue 15:22: Running on DFE .
8 Tue 15:22: a = 3
9 Tue 15:22: Input x
10 Tue 15:22: 83 86 77 15 93 35 86 92 49 21 62 27 90 59 63 26 40 26 72 36 11 68 67
29 82 30 62 23 67 35 29 2 22 58 69 67 93 56 11 42 29 73 21 19 84 37 98 24

~/NIU/courses/532/2015-fa/book/openspl/./FirstProject3.out Page 43 of 81
[email protected] 2015-10-20 15:01:52 -0500 v1.0-79-g838c34b
CHAPTER 5. YOUR FIRST OPENSPL PROGRAM

15 70 13 26 91 80 56 73 62 70 96 81 5 25 84 27 36 5 46 29 13 57 24 95 82 45
14 67 34 64 43 50 87 8 76 78 88 84 3 51 54 99 32 60 76 68 39 12
11 Tue 15:22: Output x
12 Tue 15:22: 6979 7486 6019 315 8739 1315 7486 8554 2491 531 3934 819 8190 3571
4059 766 1690 766 5274 1386 211 4714 4579 931 6814 990 3934 619 4579 1315
931 94 574 3454 4851 4579 8739 3226 211 1854 931 5419 531 451 7146 1459 9694
666 315 4990 259 766 8371 6490 3226 5419 3934 4990 9306 6651 115 715 7146
819 1386 115 2206 931 259 3339 666 9115 6814 2115 286 4579 1246 4186 1939
2590 7659 154 5866 6174 7834 7146 99 2691 3006 9891 1114 3690 5866 4714 1611
234
13 Tue 15:22: Done .
14 Tue 15:22: Process terminated with exit code 0.

Figure 5.10: Final kernel graph of s = x2 + 30a.

Page 44 of 81 ~/NIU/courses/532/2015-fa/book/openspl/./book.tex
[email protected] 2015-10-20 15:01:52 -0500 v1.0-79-g838c34b
Kernel

6
The Kernel

The Kernels present in an OpenSPL application are responsible for the “data processing.”

6.1 Introduction
Maximum performance in a Maxeler solution is achieved through a combination of deep-
pipelining and exploiting both inter- and intra-Kernel parallelism. The high I/O-bandwidth
required by such parallelism is supported by flexible high-performance memory controllers
and a highly parallel memory system.[1, p. 20]

The computation-to-data ratio, which describes how many mathematical operations are per-
formed per item of data moved, is a key metric for estimating the performance of the final
dataflow implementation. Code that requires large amounts of data to be moved and then
performs only a few arithmetic operations poses higher balancing challenges than code with
significant localized arithmetic activity.[1, p. 22]

6.2 Widening the Pipeline

A simple and straight forward method of improving the performance of and OpenSPL application is to
do as much as can be done with the data that is present in one kernel.

Multiple streams (related or not) can flow through a kernel pipeline at the same time.

6.2.1 Overloading a Kernel

In some situations the same data streams are needed for multiple different calculations. For example an
application might require that the following calculations:

~/NIU/courses/532/2015-fa/book/openspl/./kernel/chapter.tex Page 45 of 81
[email protected] 2015-10-20 15:01:52 -0500 v1.0-79-g838c34b
CHAPTER 6. THE KERNEL

a = x2 + y 2
b = x2 + z 2
c=x+y+z
d=x∗y∗z
e = x2 + y 2 + z 2

Putting all four of these functions into a single kernel will allow the x, y and z streams to be transferred
into the DFE once and only once as opposed to the five times that it would otherwise take if each
function were implemented in a separate kernel and invoked serially. See the graph for this kernel in
Figure 6.1.

y x z

* * * + *

+ + d + c

a e b

Figure 6.1: Multiple outputs from the same data stream.

This kernel will generate all five output streams in the same amount of time that would be required to
execute to execute only one of the above equations as seen in Figure 6.2.

6.2.2 An N-fold kernel

Another variation on the theme of doing more data processing at the same time is to observe that some
times it is not that different operations are performed on same data but, rather, the same operations are
performed on different data.

For example, consider the following operations:

a=o+p+q+r
b=s+t+u+v

Page 46 of 81 ~/NIU/courses/532/2015-fa/book/openspl/./kernel/chapter.tex
[email protected] 2015-10-20 15:01:52 -0500 v1.0-79-g838c34b
6.2. WIDENING THE PIPELINE

x y z

* * *

Figure 6.2: An inefficient use of data (in contrast to Figure 6.1).

. . . that would produce the kernel graph in Figure 6.3. To implement such an application, one kernel
with eight input streams can be created or a fancy manager could be used to create and connect two
copies of a four-input stream kernel to eight input streams.

~/NIU/courses/532/2015-fa/book/openspl/./kernel/chapter.tex Page 47 of 81
[email protected] 2015-10-20 15:01:52 -0500 v1.0-79-g838c34b
CHAPTER 6. THE KERNEL

o p q r s t u v

+ + + +

+ +

a b

Figure 6.3: A primitive way to implement an N-fold kernel.

6.3 Temporal Alignment

So far we have only considered kernels that can be implemented without any internal delays. Some
functions require the same data to be applied to different parts of the same pipeline. . . at different
times.

6.3.1 y = x2 + z 2 + z

Solving y = x2 + z 2 was straight forward due to the clean symmetry of its kernel graph (see Figure 2.6.)
If we add z to the sum of the squares, something interesting happens.

Note the addition of the red edge in Figure 6.4 indicating the need to add z to x2 and z 2 .

The problem with this graph is that when one considers the pipeline expressed in Figure 2.8, the z values
have to be made available to the adder at a different time than when they have to be delivered to the
multiplier.

In order to address this issue, the z can be delayed using a FIFO buffer as shown in Figure 6.5.

Notice that the FIFO is drawn as a green table with a number in the center that represents the number
of elements (and therefore units of time) that the FIFO contains. When writing an element into a FIFO
with length 1, it will come out one unit of time later.

A timing diagram for Figure 6.5 is shown in Figure 6.6. Note that this diagram is identical to Figure 2.8
except for those rows highlighted on the left in yellow.

The pipeline latency has not changed because we can implement it using an adder with three inputs.
The extent to which we can add additional inputs to any type of operational unit depends on the type
of FPGA and versions of the compilers we use.

Page 48 of 81 ~/NIU/courses/532/2015-fa/book/openspl/./kernel/chapter.tex
[email protected] 2015-10-20 15:01:52 -0500 v1.0-79-g838c34b
6.3. TEMPORAL ALIGNMENT

x z

* *

Figure 6.4: A (Broken) Kernel graph for y = x2 + z 2 + z

x z

* * 1

Figure 6.5: A Kernel graph for y = x2 + z 2 + z

~/NIU/courses/532/2015-fa/book/openspl/./kernel/chapter.tex Page 49 of 81
[email protected] 2015-10-20 15:01:52 -0500 v1.0-79-g838c34b
CHAPTER 6. THE KERNEL

1 2 3 4 5 6
t1 ← x[k]
4 5 6 7 8 9
t2 ← z[k]
0 1 2 3 4 5
k ←k+1
1 4 9 16 25 36
t3 ← t1 × t1
16 25 36 49 64 81
t4 ← t2 × t2
4 5 6 7 8 9
t6 ← t2
21 34 51 72 97 126
t5 ← t3 + t4 + t6
21 34 51 72 97 126
y[i] ← t5
0 1 2 3 4 5
i←i+1
0 1 2 3 4 5 6 7 8 9 10

Figure 6.6: Six pipelined iterations of y[i] = x[i]2 +z[i]2 +z for x = {1, 2, 3, 4, 5, 6} and z = {4, 5, 6, 7, 8, 9}

Page 50 of 81 ~/NIU/courses/532/2015-fa/book/openspl/./kernel/chapter.tex
[email protected] 2015-10-20 15:01:52 -0500 v1.0-79-g838c34b
6.3. TEMPORAL ALIGNMENT

6.3.2 y = x2 + z 2 + z − x

The intention of this example is to illustrate a situation where a FIFO is required to provide a delay
of more than one element as shown in Figure 6.7 with the corresponding timing diagram in Figure 6.8.
The items added or changed from Figure 6.6 are highlighted on the left in yellow.

x z

* * 1

2 +

Figure 6.7: A Kernel graph for y = x2 + z 2 + z − x

Note that along with adding a 2-element FIFO, the increased complexity of the mathematical function
in this example has also changed the pipeline latency from 3 to 4.1

The end-result still consumes one set of inputs per-tick and provides one output per-tick (once the
pipeline has been filled). But we can start to see the impact of what it means to trade space for time
as the six pipeline-stages that perform actual computation (shown in blue) are starting to be rivaled by
the number of stages that just hold or move data (shown in yellow and green.)

This situation illustrates a limitation of dataflow computing.

1 Note that the problem presented here is to illustrate the use of FIFOs. Because we can create a 3-input adder, it

would have been more efficient (lower latency) if the solution negated x and then fed it into a 3-input adder rather than
through a FIFO and into a subtractor.

~/NIU/courses/532/2015-fa/book/openspl/./kernel/chapter.tex Page 51 of 81
[email protected] 2015-10-20 15:01:52 -0500 v1.0-79-g838c34b
CHAPTER 6. THE KERNEL

1 2 3 4 5 6
t1 ← x[k]
4 5 6 7 8 9
t2 ← z[k]
1 2 3 4 5 6
k ←k+1
1 4 9 16 25 36
t3 ← t1 × t1
16 25 36 49 64 81
t4 ← t2 × t2
4 5 6 7 8 9
t6 ← t2
21 29 51 72 97 126
t5 ← t3 + t4 + t6
1 2 3 4 5 6
t7 ← t1
1 2 3 4 5 6
t8 ← t7
20 27 48 68 92 120
t9 ← t5 − t8
20 27 48 68 92 120
y[i] ← t9
1 2 3 4 5 6
i←i+1
0 1 2 3 4 5 6 7 8 9 10

Figure 6.8: Six pipelined iterations of y[i] = x[i]2 + z[i]2 + z − x for x = {1, 2, 3, 4, 5, 6} and z =
{4, 5, 6, 7, 8, 9}

6.3.3 Reality Check (There’s a Pipeline in my Pipeline!)

There is more to temporal alignment that we have let on so far. Looking at an optimized kernel graph
for x2 + x we see that a FIFO is created with a delay of two (rather than one!)

When calculating x2 + x, the time it takes for the value of x2 to appear at the output of the multiplier
depends on the type of FPGA and data representation used. In this example, x is a 32-bit integer on
an Altera Stratix V FPGA (that is on a Maxeler Icsa board).

Figure 6.9: Optimized kernel graph of x2 + x.

Page 52 of 81 ~/NIU/courses/532/2015-fa/book/openspl/./kernel/chapter.tex
[email protected] 2015-10-20 15:01:52 -0500 v1.0-79-g838c34b
6.3. TEMPORAL ALIGNMENT

Note that the number ‘2’ in the FIFO in Figure 6.9. This implies that the multiplier operation is itself
implemented in a two-stage pipeline that is abstracted out of the kernel graph but it shown in timing
diagram in Figure 6.10.

Note that the actual times for any FIFOs required for temporal alignments only appear in optimized
kernel graphs as their presence and size are dependent on the optimizations performed by the compiler.

1 2 3 4 5 6 7 8 9
x
1 2 3 4 5 6 7 8 9
x[−1]
1 2 3 4 5 6 7 8 9
x[−2]

x2 1 4 9 16 25 36 49 64 81

x2 + x 1 2 3 4 5 6 7 8 9

1 4 9 16 25 36 49 64 81
s
0 1 2 3 4 5 6 7 8 9 10 11 12 13

Figure 6.10: Timing diagram for s = x2 + x

~/NIU/courses/532/2015-fa/book/openspl/./book.tex Page 53 of 81
[email protected] 2015-10-20 15:01:52 -0500 v1.0-79-g838c34b
CHAPTER 6. THE KERNEL

Page 54 of 81 ~/NIU/courses/532/2015-fa/book/openspl/./NX/chapter.tex
[email protected] 2015-10-20 15:01:52 -0500 v1.0-79-g838c34b
A
Installing and Using NX

NX is a computer program that provides access to remote X Window Systems.

A.1 Download and Install the NX Client

Navigate to http://hermes.niu.edu/nxclient/ and click on your platform.

A.1.1 Linux

ý Fix Me:
This screen-shot sequence is wrong
and it never worked right for me.

Figure A.1: Set session and host names.

~/NIU/courses/532/2015-fa/book/openspl/./NX/chapter.tex Page 55 of 81
[email protected] 2015-10-20 15:01:52 -0500 v1.0-79-g838c34b
APPENDIX A. INSTALLING AND USING NX

Figure A.2: Accept the defaults.

Figure A.3: Chose to place on desktop (or not).

A.1.2 Windows

XXX need screenshots of installer

A.1.3 Mac

XXX need screenshots of installer

A.2 Setting up NX

(From Kyle Gilgan Internship report 2014)

1. Run the installer

2. Enter hermes.niu.edu into the host field
3. Set the port to 22
4. Select your type of Internet connection by sliding the bar to the option that corresponds to your
setup

Page 56 of 81 ~/NIU/courses/532/2015-fa/book/openspl/./NX/chapter.tex
[email protected] 2015-10-20 15:01:52 -0500 v1.0-79-g838c34b
A.2. SETTING UP NX

5. Leave Unix in the first box and in the second box change KDE to custom

6. Click on the settings box and change ”Run the console” to ”Run the following command” and
enter ”MaxIDE” into the box
7. Leave the options as a ”floating window”
8. Create a desktop shortcut if you would like one

9. Enter your username and password

10. MaxIDE will launch automatically
11. Once MaxIDE has launched, click on the window tab at the top of the screen in the toolbar

12. Click on preferences in the drop down

13. Expand the General tab
14. Expand the Appearance Tab
15. Click on the Colors and Fonts tab

16. This is where you can change the size of the MaxIDE font to a more readable size
17. Highlight ”Text Font” in the box displayed and click on the bod that says ”Edit...”
18. Select the font size that is most comfortable and click ok, then apply the results and click

19. once more

20. Once you are ready to run your first program, expand the project in the left hand column
21. Expand the run rules
22. Expand the simulation

23. Right click on the simulation tab

24. Click ”Edit Run Rules”
25. Towards the bottom of the page there are four tabs labeled ”Common”, ”Max Files”, ”CPUCode”,
and ”Simulator”

26. Click on the tab that says ”Simulator”

27. Under Advanced Options change the socket name to your z-ID followed by another z (example
z1234567z)

~/NIU/courses/532/2015-fa/book/openspl/./book.tex Page 57 of 81
[email protected] 2015-10-20 15:01:52 -0500 v1.0-79-g838c34b
APPENDIX A. INSTALLING AND USING NX

Page 58 of 81 ~/NIU/courses/532/2015-fa/book/openspl/./vmware/chapter.tex
[email protected] 2015-10-20 15:01:52 -0500 v1.0-79-g838c34b
VMware
VM

B
Running MaxIDE on VMware

B.1 Installing VMware

You can download a free VMware Player application for your operating system.

B.1.1 Linux

https://my.vmware.com/web/vmware/free#desktop_end_user_computing/vmware_player/7_0

B.1.2 Windows

https://my.vmware.com/web/vmware/free#desktop_end_user_computing/vmware_player/7_0

B.1.3 Mac

The Mac version of the VMware Player is called VMware Fusion.

http://www.vmware.com/products/fusion/fusion-evaluation.html

B.2 Loading the .vmx File

The .vmx is an image of an entire virtual machine that has been configured to run The Maxeler IDE in
simulation mode on Centos 6.3.

Run the VM, click on the .vmx file. A Linux system will run contained in a window. Once started you
can launch the Maxeler IDE as discussed in chapter 4.

~/NIU/courses/532/2015-fa/book/openspl/./book.tex Page 59 of 81
[email protected] 2015-10-20 15:01:52 -0500 v1.0-79-g838c34b
APPENDIX B. RUNNING MAXIDE ON VMWARE

Page 60 of 81 ~/NIU/courses/532/2015-fa/book/openspl/./virtualbox/chapter.tex
[email protected] 2015-10-20 15:01:52 -0500 v1.0-79-g838c34b
VirtualBox
VM
Virtual Machine—seeVM

C
Running MaxIDE on VirtualBox

C.1 Installing VirtualBox

You can download the free VirtualBox application for your operating system.1

Installations and configuration instructions for Windows, OS X, Linux and Solaris are all available from
Oracle here:

https://www.virtualbox.org/wiki/Downloads

C.2 Loading the .vmx File

The .vmx is an image of an entire virtual machine that has been configured to run The Maxeler IDE in
simulation mode on Centos 6.3.

Run the VM, click on the .vmx file. A Linux system will run contained in a window. Once started you ý Fix Me:
can launch the Maxeler IDE as discussed in chapter 4. Install and verify how to start
VirtualBox to run the .vmx file.

1 At time of writing this text, the Maxeler VM version 2015 is known to run on VirtualBox version 5.0.6.

~/NIU/courses/532/2015-fa/book/openspl/./book.tex Page 61 of 81
[email protected] 2015-10-20 15:01:52 -0500 v1.0-79-g838c34b
APPENDIX C. RUNNING MAXIDE ON VIRTUALBOX

Page 62 of 81 ~/NIU/courses/532/2015-fa/book/openspl/./java/chapter.tex
[email protected] 2015-10-20 15:01:52 -0500 v1.0-79-g838c34b
D
Java Resources

XXX Add a lecture on Java

D.1 Web Resources

D.1.1 MIT OpenCourseware

MIT offers the lecture notes for a number of courses for free on the web. MIT Course Number 6.092
Introduction to Programming in Java may be of interest to the new Java programmer that already has
some programming experience:

This course is an introduction to software engineering, using the Java(TM) programming

language. It covers concepts useful to 6.005. Students will learn the fundamentals of Java.
The focus is on developing high quality, working software that solves real problems.
The course is designed for students with some programming experience, but if you have none
and are motivated you will do fine. Students who have taken 6.005 should not take this
course. Each class is composed of one hour of lecture and one hour of assisted lab work.
This course is offered during the Independent Activities Period (IAP), which is a special
4-week term at MIT that runs from the first week of January until the end of the month.

http://ocw.mit.edu/courses/electrical-engineering-and-computer-science/6-092-introduction-to-programming-in-
lecture-notes/

D.1.2 Introduction to Programming Using Java

The Seventh Edition of Introduction to Programming Using Java is a free on-line textbook on introduc-
tory programming, which uses Java as the language of instruction. This book is directed mainly towards

~/NIU/courses/532/2015-fa/book/openspl/./java/chapter.tex Page 63 of 81
[email protected] 2015-10-20 15:01:52 -0500 v1.0-79-g838c34b
APPENDIX D. JAVA RESOURCES

beginning programmers, although it might also be useful for experienced programmers who want to learn
something about Java.

http://math.hws.edu/javanotes/

D.1.3 A Primer on Java

A few google searches turned up A Primer on Java that appears to be an easy read and may be of
interest to anyone that hasn’t written programs in a while.

This book describes itself as:

A gentle introduction to the basics of Java. No prior programming experience is necessary

to read this text. My purpose was to write an easy but non-verbose introduction to the
language.
The book however is not intended as a reference work or for a full time Java programmer
since it does not have an exhaustive topic coverage.

https://leanpub.com/aprimeronjava

D.2 References From Multiscale Dataflow Programming[1]

The following is a list of references from the Maxeler documentation. XXXXX

For further information on the Java language we recommend the following resources:

http://docs.oracle.com/javase/tutorial/java/index.html

An overview of Java and an introduction to its major syntactical features.

http://docs.oracle.com/javase/tutorial/collections/index.html

An overview of the Java ”Collections” API which is used often in MaxCompiler interfaces.

http://docs.oracle.com/javase/6/docs/api/

API documentation for the standard Java libraries.

http://www.java-tips.org/java-se-tips/java.lang/using-the-varargs-language-feature.html

Introduction to using variable-argument methods in Java which are also common in MaxCompiler in-
terfaces.

Page 64 of 81 ~/NIU/courses/532/2015-fa/book/openspl/./book.tex
[email protected] 2015-10-20 15:01:52 -0500 v1.0-79-g838c34b
SVN—seeSubversion
Subversion

E
Managing Projects With Subversion

Subversion is a version control system is a database that records and tracks changes to files over time.
When configured on a server machine one can use it to share files between multiple users and machines.

There are many resources on the Internet that discuss how to use Subversion.1 Here we will discuss
accessing and using an existing repository with the MaxIDE.

E.1 Introduction

Given the vintage of the Eclipse version used in the Maxeler IDE the only version control system
supported out-of-the-box is “SVN”.2

We can use Subversion to store files for backup and to copy them between a virtual machine on a personal
laptop for development, simulation and debugging, transferring them to hermes.niu.edu for building
and running on the DFE card and for handing in homework.

In the simplest of terms Subversion manages database called a repository or repo that provides mecha-
nisms for inserting and retrieving files. When one or more files are inserted into the repo the action is
called a commit or a check-in. When a file is retrieved from the repo the action is called a checkout or
an update.

The Subversion database contains a copy of every version of every file that was ever committed over the
course of the file’s evolution. This means that the database has the ability to retrieve was was committed
to the repo one hour ago, four days ago, 14 months ago and so on.

1 Forlinks to Subversion documentation see: https://en.wikipedia.org/wiki/Apache_Subversion

2 Technically,SVN is the name of a command that is part of the Subversion package. Subversion can be used as the
back-end of a WebDAV server or in a stand-alone mode. Subversion provides the services of a version control system in
a manner well suited to the WebDAV protocol. It is a common mistake to consider WebDAV, Subversion and SVN as
synonymous with each other. Each provides a very different role.

~/NIU/courses/532/2015-fa/book/openspl/./svn/chapter.tex Page 65 of 81
[email protected] 2015-10-20 15:01:52 -0500 v1.0-79-g838c34b
APPENDIX E. MANAGING PROJECTS WITH SUBVERSION

Because the repo database has every version of every file, it can also be used to ask to show what
has changed between two or more versions of the same file. This is very useful when you break
some code and forgot what you did since the last working version!

Of course, you have to remember to commit your files every now and then or else the repo will
not have copies of the evolving versions of your files to make this possible.

E.2 Creating a Subversion Repository

To use Subversion a repository must be present. You may create one named MyRepo in the current
directory by running run the following command:

1 [ winans@hermes ~] $ svnadmin create MyRepo

Once created, the repository directory and any files within it should never be touched directly. They
are only accessed using a client application like svn or an IDE such as Eclipse or MaxIDE.

For CSCI 532 a repo has already been created and assigned for your use on hermes.niu.edu. In order
to reference your repo a URL is used that looks like this:

svn+ssh://hermes.niu.edu/home/repos/z1234567

Substitute your Z-number for the last portion of the path.

E.3 Checking a Project Into a Repository

An svn client capable of checking projects into and out of a Subversion repo is built into MaxIDE.

To check a project into a subversion repository:

Page 66 of 81 ~/NIU/courses/532/2015-fa/book/openspl/./svn/chapter.tex
[email protected] 2015-10-20 15:01:52 -0500 v1.0-79-g838c34b
E.3. CHECKING A PROJECT INTO A REPOSITORY

Figure E.1: Right-click on your project and open Team->ShareProject.

Figure E.2: Choose SVN and click Next.

~/NIU/courses/532/2015-fa/book/openspl/./svn/chapter.tex Page 67 of 81
[email protected] 2015-10-20 15:01:52 -0500 v1.0-79-g838c34b
APPENDIX E. MANAGING PROJECTS WITH SUBVERSION

Figure E.3: Select Create New... and click Next.

Figure E.4: Enter the URL for the repository and click Next.

Page 68 of 81 ~/NIU/courses/532/2015-fa/book/openspl/./svn/chapter.tex
[email protected] 2015-10-20 15:01:52 -0500 v1.0-79-g838c34b
E.3. CHECKING A PROJECT INTO A REPOSITORY

Figure E.5: Select Use Project Name... and click Next.

When checking into a shared repo the URL would have to include an additional directory for the student
ID after the SvnRepo directory shown in Figure E.5

Figure E.6: Enter a suitable commit comment and click Finish.

At this point you will have created a place for your project in the specified repository. You now have to
proceed to commit the current version of your files.

~/NIU/courses/532/2015-fa/book/openspl/./svn/chapter.tex Page 69 of 81
[email protected] 2015-10-20 15:01:52 -0500 v1.0-79-g838c34b
APPENDIX E. MANAGING PROJECTS WITH SUBVERSION

Figure E.7: Right-click on your project and open Team->Commit.

Figure E.8: Enter a suitable comment and click OK.

At this point you should note that each of the files that are in your repository are displayed with a
revision number indicating the when they were last committed to the repository. This number is simply
incremented each time that anything is committed into the repository. As seen in Figure E.9 the version
is 12.

Page 70 of 81 ~/NIU/courses/532/2015-fa/book/openspl/./svn/chapter.tex
[email protected] 2015-10-20 15:01:52 -0500 v1.0-79-g838c34b
E.3. CHECKING A PROJECT INTO A REPOSITORY

Figure E.9: Eclipse indicates the SVN version next to each file in the Project Explorer.

At this point the project files are all in the repository. Any editing of the files will (as one would expect)
make them out of date with the version in the repository. Eclipse will indicate that a file is out of date
by placing a small brown indicator in the icon that represents the edited file(s) as well as in the parent
directory icons up to the root project level. See Figure E.10.

Figure E.10: Eclipse indicates out of date files with a brown decoration on related Project Explorer
icons.

After making changes to any files files, repeat the commit process as described beginning in Figure E.7.
Again, enter a suitable comment before clicking OK. The comments that are entered when committing
files should accurately describe the changes that were made to the out of date files so that it is possible
to identify and understand the various versions of the files as they are edited over time. Should a project
stop working, an older version of the file(s) can be recovered from the repository. At times like this the
easiest way to locate the desired version of a file is to plan ahead and properly annotate changes as they
occur over time.

~/NIU/courses/532/2015-fa/book/openspl/./svn/chapter.tex Page 71 of 81
[email protected] 2015-10-20 15:01:52 -0500 v1.0-79-g838c34b
APPENDIX E. MANAGING PROJECTS WITH SUBVERSION

Figure E.11: Committing the project files changes the version number of a1CpuCode.c as seen in the
Project Explorer.

After committing a new version of a file, the version number will change in the Project Explorer view.
In the example shown in Figure E.11 committing changes to the file named a1CpuCode.c has changed
the version from 12 to 14 at 6:31 PM on July 16th by ‘winans’.

E.4 Checking Files Out

ý Fix Me: Once files are committed/checked into a repository they can then be checked out elsewhere. This feature
Add discussion on how to check can be used to keep copies of a project’s files on multiple machines in sync and to hand in your homework.
out the latest version of files from
a Subversion server to the IDE.

Page 72 of 81 ~/NIU/courses/532/2015-fa/book/openspl/./book.tex
[email protected] 2015-10-20 15:01:52 -0500 v1.0-79-g838c34b
F
IEEE-754 Floating Point Number
Representation

This appendix discusses the 32-bit IEEE-754 Floating Point format.

• Note that the place values for integer binary numbers are:

... 128 64 32 16 8 4 2 1

• We can extend this to the right in binary similar to the way we do in decimal:

... 128 64 32 16 8 4 2 1 . 1/2 1/4 1/8 1/16 1/32 1/64 1/128 ...

Note that the ‘.’ in a binary number is a binary point, not a decimal point.
• Scientific notation as in 27 × 10−47 is used when either small fractions or large numbers when we
are concerned with fewer digits than are necessary to represent the entire number.
• The format of a number in scientific notation is mantissa × baseexponent
• In binary we have mantissa × 2exponent
• For simplicity sake, IEEE–754 format requires binary numbers to be normalized to 1.signif icand×
2exponent where the significand is the part of the mantissa that is to the right of the binary–point.

– The unnormalized binary value of −2.625 is 10.101

– The normalized value of −2.625 is 1.0101 × 21

• We need not store the ‘1.’ because all normalized floating point numbers will start that way. Thus
we can save memory by simply remembering that that first bit is always there and that is supposed
to be a 1.

31 30 23 22 0
1 1 0 0 0 0 0 0 0 0 1 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
sign exponent significand

~/NIU/courses/532/2015-fa/book/openspl/./float/chapter.tex Page 73 of 81
[email protected] 2015-10-20 15:01:52 -0500 v1.0-79-g838c34b
APPENDIX F. IEEE-754 FLOATING POINT NUMBER REPRESENTATION

• −((1 + 1
4 + 1
16 )
5
× 2128−127 ) = −(1 16 × 21 ) = −(1.3125 × 21 ) = −2.625

• −((1 + 1
4 + 1
16 ) × 2128−127 ) = −((1 + 1
4 + 1
16 ) × 21 ) = −(2 + 1
2 + 18 ) = −(2 + .5 + .125) = −2.625

• IEEE754 formats:

IEEE754 32–bit IEEE754 64–bit

sign 1 bit 1 bit
exponent 8 bits (excess–127) 11 bits (excess-1023)
mantissa 23 bits 52 bits
max exponent 127 1023
min exponent -126 -1022

• When the exponent is all ones, the mantissa is all zeros, and the sign is zero, the number represents
positive infinity.

• When the exponent is all ones, the mantissa is all zeros, and the sign is one, the number represents
negative infinity.

• Note that the binary representation of an IEEE754 number in memory can be compared for mag-
nitude with another one using the same logic as for comparing sign–magnitude numbers because
the magnitudes of the IEEE number’s magnitude grows upward and downward in the same fashon
as sign-magnitude integers. This is why we use excess notation for the exponent. . . numbers with
larger exponents look larger than numbers with smaller exponents even when incorrectly inter-
preted as sign–magnitude integers.

• Note that zero is a special case number. This is why the exponent of all–zeros is not used to
represent the smallest possible exponent value. Zero is represented by an exponent of all–zeros
and a mantissa of all–zeros. This allows for a positive and a negative zero if we observe that the
sign can be either 1 or 0.

• On the numberline, numbers between zero and the smallest fraction in either direction are in the
underflow areas.

• On the numberline, numbers greater than the mansissa of all–ones and the largest exponent allowed
are in the overflow areas.

• Note that numbers have a higher resolution on the number–line when the exponent is smaller.

Page 74 of 81 ~/NIU/courses/532/2015-fa/book/openspl/./float/chapter.tex
[email protected] 2015-10-20 15:01:52 -0500 v1.0-79-g838c34b
F.1. FLOATING POINT NUMBER ACCURACY

F.1 Floating Point Number Accuracy

Due to the finite number of bits used to store the value of a floating point number, it is not possible to
represent every one of the infinite values on the real number line. The following C programs illustrate
this point.

F.1.1 Powers Of Two

Just like the integer numbers, the powers of two that have bits to represent them can be represented
perfectly. . . as can their sums:

1 # include < stdio .h >

2 # include < stdlib .h >
3 # include < unistd .h >
4

5 union floatbin
6 {
7 unsigned int i;
8 float f;
9 };
10 int main ()
11 {
12 union floatbin x, y;
13 int i;
14

15 x . f = 1.0;
16 while ( x . f > 1.0/1024.0)
17 {
18 y . f = -x . f ;
19 printf ( " %25.10 f = %08 x %25.10 f = %08 x \ n " , x .f , x .i , y .f , y . i ) ;
20 x . f = x . f /2.0;
21 }
22

23 }

1 1.0000000000 = 3 f800000 -1.0000000000 = bf800000

2 0.5000000000 = 3 f000000 -0.5000000000 = bf000000
3 0.2500000000 = 3 e800000 -0.2500000000 = be800000
4 0.1250000000 = 3 e000000 -0.1250000000 = be000000
5 0.0625000000 = 3 d800000 -0.0625000000 = bd800000
6 0.0312500000 = 3 d000000 -0.0312500000 = bd000000
7 0.0156250000 = 3 c800000 -0.0156250000 = bc800000
8 0.0078125000 = 3 c000000 -0.0078125000 = bc000000
9 0.0039062500 = 3 b800000 -0.0039062500 = bb800000
10 0.0019531250 = 3 b000000 -0.0019531250 = bb000000

~/NIU/courses/532/2015-fa/book/openspl/./float/chapter.tex Page 75 of 81
[email protected] 2015-10-20 15:01:52 -0500 v1.0-79-g838c34b
APPENDIX F. IEEE-754 FLOATING POINT NUMBER REPRESENTATION

F.1.2 Clean Decimal Numbers

When dealing with decimal values, you will find that they don’t map simply into binary floating point
values (the same holds true for binary integer numbers).

Note how the decimal numbers are not accurately represented as they get larger. The decimal number
10 can be perfectly represented in IEEE format. The problem that arises after the 11th loop iteration
is not because the prior number was not multiplied by 10. It is due to the fact that the prior number
can not be represented accurately in IEEE format. Therefore its least significant bits were truncated in
a best-effort attempt at rounding the value off. Once this happens, the value of x.f may not be what a
programmer expects.

1 # include < stdio .h >

2 # include < stdlib .h >
3 # include < unistd .h >
4

5 union floatbin
6 {
7 unsigned int i;
8 float f;
9 };
10 int main ()
11 {
12 union floatbin x, y;
13 int i;
14

15 x . f = 10;
16 while ( x . f <= 10000000000000.0)
17 {
18 y . f = -x . f ;
19 printf ( " %25.10 f = %08 x %25.10 f = %08 x \ n " , x .f , x .i , y .f , y . i ) ;
20 x . f = x . f *10.0;
21 }
22 }

1 10.0000000000 = 41200000 -10.0000000000 = c1200000

2 100.0000000000 = 42 c80000 -100.0000000000 = c2c80000
3 1000.0000000000 = 447 a0000 -1000.0000000000 = c47a0000
4 10000.0000000000 = 461 c4000 -10000.0000000000 = c61c4000
5 100000.0000000000 = 47 c35000 -100000.0000000000 = c7c35000
6 1000000.0000000000 = 49742400 -1000000.0000000000 = c9742400
7 10000000.0000000000 = 4 b189680 -10000000.0000000000 = cb189680
8 100000000.0000000000 = 4 cbebc20 -100000000.0000000000 = ccbebc20
9 100 0000 000.0 0000 00000 = 4 e6e6b28 -1000000000.0000000000 = ce6e6b28
10 10 00 00 000 00 .0 000 00 00 00 = 501502 f9 -10000000000.0000000000 = d01502f9
11 99 99 99 979 52 .0 000 00 00 00 = 51 ba43b7 -99999997952.0000000000 = d1ba43b7
12 9 99 99 9 99 5 90 4. 0 00 0 00 00 0 0 = 5368 d4a5 -999999995904.0000000000 = d368d4a5
13 9 9 99 9 9 9 82 7 9 6 8. 0 0 0 00 0 0 0 00 = 551184 e7 -9999999827968.0000000000 = d51184e7

Page 76 of 81 ~/NIU/courses/532/2015-fa/book/openspl/./float/chapter.tex
[email protected] 2015-10-20 15:01:52 -0500 v1.0-79-g838c34b
F.1. FLOATING POINT NUMBER ACCURACY

F.1.3 Accumulation of Error

This effect of rounding errors can be exaggerated if the number we multiply the x.f value by is itself
something that can not be accurately represented in IEEE form.
1
If we multiply our x.f value by 10 each time, we can never be accurate and we start accumulating errors
immediately.

1 # include < stdio .h >

2 # include < stdlib .h >
3 # include < unistd .h >
4

5 union floatbin
6 {
7 unsigned int i;
8 float f;
9 };
10 int main ()
11 {
12 union floatbin x, y;
13 int i;
14

15 x . f = .1;
16 while ( x . f <= 2.0)
17 {
18 y . f = -x . f ;
19 printf ( " %25.10 f = %08 x %25.10 f = %08 x \ n " , x .f , x .i , y .f , y . i ) ;
20 x . f += .1;
21 }
22 }

1 0.1000000015 = 3 dcccccd -0.1000000015 = bdcccccd

2 0.2000000030 = 3 e4ccccd -0.2000000030 = be4ccccd
3 0.3000000119 = 3 e99999a -0.3000000119 = be99999a
4 0.4000000060 = 3 ecccccd -0.4000000060 = becccccd
5 0.5000000000 = 3 f000000 -0.5000000000 = bf000000
6 0.6000000238 = 3 f19999a -0.6000000238 = bf19999a
7 0.7000000477 = 3 f333334 -0.7000000477 = bf333334
8 0.8000000715 = 3 f4cccce -0.8000000715 = bf4cccce
9 0.9000000954 = 3 f666668 -0.9000000954 = bf666668
10 1.0000001192 = 3 f800001 -1.0000001192 = bf800001
11 1.1000001431 = 3 f8cccce -1.1000001431 = bf8cccce
12 1.2000001669 = 3 f99999b -1.2000001669 = bf99999b
13 1.3000001907 = 3 fa66668 -1.3000001907 = bfa66668
14 1.4000002146 = 3 fb33335 -1.4000002146 = bfb33335
15 1.5000002384 = 3 fc00002 -1.5000002384 = bfc00002
16 1.6000002623 = 3 fcccccf -1.6000002623 = bfcccccf
17 1.7000002861 = 3 fd9999c -1.7000002861 = bfd9999c
18 1.8000003099 = 3 fe66669 -1.8000003099 = bfe66669
19 1.9000003338 = 3 ff33336 -1.9000003338 = bff33336

~/NIU/courses/532/2015-fa/book/openspl/./float/chapter.tex Page 77 of 81
[email protected] 2015-10-20 15:01:52 -0500 v1.0-79-g838c34b
APPENDIX F. IEEE-754 FLOATING POINT NUMBER REPRESENTATION

F.2 Reducing Accumulation of Errors

In order to use floating point numbers in a program without causing undesirable results, you can consider
redesigning your algorithm so that an accumulation of errors is eliminated. This example is similar to
the previous one, but this time we recalculate the desired value from known–accurate integer values.
Thus we might see some rounding errors, but they can not accumulate.
1 # include < stdio .h >
2 # include < stdlib .h >
3 # include < unistd .h >
4

5 union floatbin
6 {
7 unsigned int i;
8 float f;
9 };
10 int main ()
11 {
12 union floatbin x, y;
13 int i;
14

15 i = 1;
16 while ( i <= 20)
17 {
18 x . f = i /10.0;
19 y . f = -x . f ;
20 printf ( " %25.10 f = %08 x %25.10 f = %08 x \ n " , x .f , x .i , y .f , y . i ) ;
21 i ++;
22 }
23 return (0) ;
24 }

1 0.1000000015 = 3 dcccccd -0.1000000015 = bdcccccd

2 0.2000000030 = 3 e4ccccd -0.2000000030 = be4ccccd
3 0.3000000119 = 3 e99999a -0.3000000119 = be99999a
4 0.4000000060 = 3 ecccccd -0.4000000060 = becccccd
5 0.5000000000 = 3 f000000 -0.5000000000 = bf000000
6 0.6000000238 = 3 f19999a -0.6000000238 = bf19999a
7 0.6999999881 = 3 f333333 -0.6999999881 = bf333333
8 0.8000000119 = 3 f4ccccd -0.8000000119 = bf4ccccd
9 0.8999999762 = 3 f666666 -0.8999999762 = bf666666
10 1.0000000000 = 3 f800000 -1.0000000000 = bf800000
11 1.1000000238 = 3 f8ccccd -1.1000000238 = bf8ccccd
12 1.2000000477 = 3 f99999a -1.2000000477 = bf99999a
13 1.2999999523 = 3 fa66666 -1.2999999523 = bfa66666
14 1.3999999762 = 3 fb33333 -1.3999999762 = bfb33333
15 1.5000000000 = 3 fc00000 -1.5000000000 = bfc00000
16 1.6000000238 = 3 fcccccd -1.6000000238 = bfcccccd
17 1.7000000477 = 3 fd9999a -1.7000000477 = bfd9999a
18 1.7999999523 = 3 fe66666 -1.7999999523 = bfe66666
19 1.8999999762 = 3 ff33333 -1.8999999762 = bff33333
20 2.0000000000 = 40000000 -2.0000000000 = c0000000

Page 78 of 81 ~/NIU/courses/532/2015-fa/book/openspl/./book.tex
[email protected] 2015-10-20 15:01:52 -0500 v1.0-79-g838c34b
Bibliography

[1] Maxeler Technologies, Multiscale Dataflow Programming, Feb 2014. Version 2013.3b. v, vii, 1, 9,
11, 14, 15, 16, 45, 64

[2] T. Starnes and J. Handy, “Intel really set to buy altera fpgas,” Electronic Design, Jun 2015. http:
//electronicdesign.com/fpgas/intel-really-set-buy-altera-fpgas. 4

[3] M. Parker, “Understanding peak floating-point performance claims,” tech. rep., Altera Corporation,
Jun 2014. http://design.altera.com/HFP_White_Paper. 4

[4] G. M. Amdahl, “Validity of the single processor approach to achieving large scale computing ca-
pabilities,” in Proceedings of the April 18-20, 1967, Spring Joint Computer Conference, AFIPS ’67
(Spring), (New York, NY, USA), pp. 483–485, ACM, 1967. 5

[5] D. A. Patterson and J. L. Hennessy, Computer Organization and Design, Fourth Edition: The
Hardware/Software Interface (The Morgan Kaufmann Series in Computer Architecture and Design).
San Francisco, CA, USA: Morgan Kaufmann Publishers Inc., 4th ed., 2008. 5, 7

[6] Wikipedia, “Hadamard product (matrices).” https://en.wikipedia.org/wiki/Hadamard_

product_%28matrices%29. 6

[7] J. W. Davidson and C. W. Fraser, “The design and application of a retargetable peephole optimizer,”
ACM Trans. Program. Lang. Syst., vol. 2, pp. 191–202, Apr. 1980. 6

[8] A. S. Tanenbaum and J. R. Goodman, Structured Computer Organization. Upper Saddle River,
NJ, USA: Prentice Hall PTR, 4th ed., 1998. 7

[9] R. M. Tomasulo, “An efficient algorithm for exploiting multiple arithmetic units,” IBM J. Res.
Dev., vol. 11, pp. 25–33, Jan. 1967. 7, 8

[10] A. V. Aho, M. S. Lam, R. Sethi, and J. D. Ullman, Compilers: Principles, Techniques, and Tools
(2Nd Edition). Boston, MA, USA: Addison-Wesley Longman Publishing Co., Inc., 2006. 7

[11] Maxeler Technologies, MaxCompiler Manager Compiler Tutorial, 2013. Version 2013.3f. 16

[12] Maxeler Technologies, MaxCompiler Kernel Numerics Tutorial, 2013. Version 2013.3f.

[13] Maxeler Technologies, Acceleration Tutorial Loops and Pipelining, 2013. Version 2013.3f.

[14] Maxeler Technologies, Dataflow Programming for Networking, 2013. Version 2013.3f.

[15] Maxeler Technologies, MaxCompiler State Machine Tutorial, 2013. Version 2013.3f.

[16] A. Einstein, “The foundation of the general theory of relativity,” Annalen Phys., vol. 49, pp. 769–
822, 1916.

~/NIU/courses/532/2015-fa/book/openspl/./book.bbl Page 79 of 81
[email protected] 2015-10-20 15:01:52 -0500 v1.0-79-g838c34b
BIBLIOGRAPHY

[17] C. G. Bell and A. C. Newell, Computer Structures: Readings and Examples (McGraw-Hill Computer
Science Series). McGraw-Hill Pub. Co., 1971.
[18] J. A. Darringer, The Description, Simulation, and Automatic Implementation of Digital Computer
Processors. PhD thesis, Carnegie Mellon University, Pittsburgh, PA, USA, 1969. AAI6919088.

Page 80 of 81 ~/NIU/courses/532/2015-fa/book/openspl/./book.tex
[email protected] 2015-10-20 15:01:52 -0500 v1.0-79-g838c34b
Index

A S
ALU, 6 Simple Live CPU, see SLiC
Application Accelerator, 13 SLiC, 14
Arithmetic Logic Unit, see ALU Stream, 16
Subversion, 65
C SVN, see Subversion
CPUCode, 32
T
D Timing Diagram, 6
Data Flow Engine, see DFE
DFE, 13 V
Virtual Machine, see VM
E VirtualBox, 61
EngineCode, 33 VM, 59, 61
VMware, 59
F
Field Programmable Gate Array, see FPGA W
FPGA, 1, 13 Waveform Diagram, see Timing Diagram

K
Kernel, 33, 45
Kernel Graph, 9

L
loop unrolling, 7

M
Manager, 33
MaxCompiler, 14
MaxJ, 13

O
OpenSPL, 13
Out-of-order execution, 8

P
Pipeline, 9
Fill, 10
Flush, 10
pipeline, 4

R
register, 2
Register Transfer Language, see RTL
RTL, 6

~/NIU/courses/532/2015-fa/book/openspl/./book.ind Page 81 of 81
[email protected] 2015-10-20 15:01:52 -0500 v1.0-79-g838c34b

Clusterguide-V4 0
No ratings yet
Clusterguide-V4 0
112 pages
Vlsi Companies in Europe and Uk
50% (2)
Vlsi Companies in Europe and Uk
12 pages
C-SKY Tools V3 User Guide-Link
No ratings yet
C-SKY Tools V3 User Guide-Link
140 pages
Guide To OSPL
No ratings yet
Guide To OSPL
74 pages
The GNU Linker: Steve Chamberlain Ian Lance Taylor
No ratings yet
The GNU Linker: Steve Chamberlain Ian Lance Taylor
142 pages
Clusterguide-V3 0
No ratings yet
Clusterguide-V3 0
80 pages
The Gnu Linker
No ratings yet
The Gnu Linker
132 pages
ECE254 OS & System Programming Lab
No ratings yet
ECE254 OS & System Programming Lab
114 pages
Book Riscv Rev2
No ratings yet
Book Riscv Rev2
114 pages
Book Riscv Rev1
No ratings yet
Book Riscv Rev1
110 pages
xv6: OS Concepts for Students
No ratings yet
xv6: OS Concepts for Students
112 pages
Project 3: Writing A Kernel From Scratch: 15-410 Operating Systems
No ratings yet
Project 3: Writing A Kernel From Scratch: 15-410 Operating Systems
44 pages
Plecs Manual
No ratings yet
Plecs Manual
852 pages
xv6: A Simple, Unix-Like Teaching Operating System: Russ Cox Frans Kaashoek Robert Morris September 5, 2022
No ratings yet
xv6: A Simple, Unix-Like Teaching Operating System: Russ Cox Frans Kaashoek Robert Morris September 5, 2022
114 pages
Manual HSPICE
0% (1)
Manual HSPICE
458 pages
Xilinx System Generator For DSP User Guide
100% (1)
Xilinx System Generator For DSP User Guide
424 pages
NEORV32 UserGuide-nightly
No ratings yet
NEORV32 UserGuide-nightly
48 pages
System V Application Binary Interface: AMD64 Architecture Processor Supplement (With LP64 and ILP32 Programming Models)
No ratings yet
System V Application Binary Interface: AMD64 Architecture Processor Supplement (With LP64 and ILP32 Programming Models)
145 pages
Book xv6 Riscv Rev4
No ratings yet
Book xv6 Riscv Rev4
114 pages
FPGA Tutorial
100% (1)
FPGA Tutorial
424 pages
Sysgen User
No ratings yet
Sysgen User
424 pages
x86 64 Abi 20210928
No ratings yet
x86 64 Abi 20210928
149 pages
Manual de Usuario PLECSIM 4.2
No ratings yet
Manual de Usuario PLECSIM 4.2
756 pages
System Generator For DSP: Reference Guide
No ratings yet
System Generator For DSP: Reference Guide
606 pages
GHDL
No ratings yet
GHDL
92 pages
XuanTie C910 C920 UserManual
No ratings yet
XuanTie C910 C920 UserManual
415 pages
Plecs Manual
No ratings yet
Plecs Manual
882 pages
Plecsmanual PDF
No ratings yet
Plecsmanual PDF
766 pages
Teaching SMP with HAL9000 OS
No ratings yet
Teaching SMP with HAL9000 OS
149 pages
Multiprocessing Wiki 20150330
No ratings yet
Multiprocessing Wiki 20150330
96 pages
SCALE-MAMBA v1.14: Documentation Programming
No ratings yet
SCALE-MAMBA v1.14: Documentation Programming
170 pages
OpenSPARC Internals 1st Edition David L. Weaver Digital Version 2025
No ratings yet
OpenSPARC Internals 1st Edition David L. Weaver Digital Version 2025
156 pages
ModelSim ME v10.5c User PDF
No ratings yet
ModelSim ME v10.5c User PDF
808 pages
The GNU Linker: Steve Chamberlain Ian Lance Taylor
No ratings yet
The GNU Linker: Steve Chamberlain Ian Lance Taylor
162 pages
C166/ST10 v8.7: C Cross Compiler User's Manual
No ratings yet
C166/ST10 v8.7: C Cross Compiler User's Manual
557 pages
Plecs Manual
No ratings yet
Plecs Manual
926 pages
Plecs Manual
No ratings yet
Plecs Manual
704 pages
FRAM Utilities UsersGuide
No ratings yet
FRAM Utilities UsersGuide
70 pages
Micropython Esp8266 PDF
No ratings yet
Micropython Esp8266 PDF
152 pages
ARM Assembly Language Programming
No ratings yet
ARM Assembly Language Programming
156 pages
ARM AssyLang
100% (1)
ARM AssyLang
156 pages
Perfbook-1c 2023 06 11a
No ratings yet
Perfbook-1c 2023 06 11a
970 pages
ARM AssyLang PDF
100% (1)
ARM AssyLang PDF
0 pages
Assembly Book
No ratings yet
Assembly Book
512 pages
Plecs Manual
No ratings yet
Plecs Manual
672 pages
Programmers Guide
No ratings yet
Programmers Guide
476 pages
Think Os
No ratings yet
Think Os
99 pages
DSP65 C Cross-Compiler
No ratings yet
DSP65 C Cross-Compiler
475 pages
Is Parallel Programming Hard
No ratings yet
Is Parallel Programming Hard
662 pages
Asm LNK User Manual
No ratings yet
Asm LNK User Manual
94 pages
Simulation Manual For Configurable Mapreduce Accelerator: Gheorghe M. S Tefan
No ratings yet
Simulation Manual For Configurable Mapreduce Accelerator: Gheorghe M. S Tefan
78 pages
Vivado Intro Fpga Design Hls
No ratings yet
Vivado Intro Fpga Design Hls
92 pages
Sysgen User
No ratings yet
Sysgen User
416 pages
AMD-K5 Processor Technical Reference Manual (November 1996)
No ratings yet
AMD-K5 Processor Technical Reference Manual (November 1996)
406 pages
Os Dev
No ratings yet
Os Dev
77 pages
PS/2 Core Guide for FPGA Designers
No ratings yet
PS/2 Core Guide for FPGA Designers
8 pages
Low Latency 100G Ethernet Intel Stratix 10 FPGA IP Design Example User Guide
No ratings yet
Low Latency 100G Ethernet Intel Stratix 10 FPGA IP Design Example User Guide
20 pages
DE2115 User Manualv2
No ratings yet
DE2115 User Manualv2
5 pages
Alt Em10g32 ED IS
No ratings yet
Alt Em10g32 ED IS
249 pages
My First Nios II Software Tutorial: 101 Innovation Drive San Jose, CA 95134
No ratings yet
My First Nios II Software Tutorial: 101 Innovation Drive San Jose, CA 95134
22 pages
Altera De0-Nano
No ratings yet
Altera De0-Nano
153 pages
Datasheet ECMCarrier 5024 2pv1 A80515 Screen
No ratings yet
Datasheet ECMCarrier 5024 2pv1 A80515 Screen
2 pages
Fpga Timeline & Applications: Fpgas Past, Present & Future
No ratings yet
Fpga Timeline & Applications: Fpgas Past, Present & Future
39 pages
XXX - Cyclone V Product Table
No ratings yet
XXX - Cyclone V Product Table
2 pages
MAX+plus II Beginner's Tutorial
No ratings yet
MAX+plus II Beginner's Tutorial
8 pages
What Is Intel Agilex FPGA
No ratings yet
What Is Intel Agilex FPGA
6 pages
EDA Tools
No ratings yet
EDA Tools
13 pages
DE1-SoC HPS-FPGA Tutorial Guide
No ratings yet
DE1-SoC HPS-FPGA Tutorial Guide
18 pages
Glossary
No ratings yet
Glossary
16 pages
DE1-SoC User Manual v06-1214842
No ratings yet
DE1-SoC User Manual v06-1214842
114 pages
Intel FPGA Download Cable User Guide: Subscribe Send Feedback UG-USB81204 - 2020.03.11 PDF HTML
No ratings yet
Intel FPGA Download Cable User Guide: Subscribe Send Feedback UG-USB81204 - 2020.03.11 PDF HTML
17 pages
WWW - Universityquestions.in: Department of Electronics and Communication Engineering
No ratings yet
WWW - Universityquestions.in: Department of Electronics and Communication Engineering
10 pages
How Big Is The FPGA Market
No ratings yet
How Big Is The FPGA Market
14 pages
De1-Soc User Manual: January 28, 2019
No ratings yet
De1-Soc User Manual: January 28, 2019
118 pages
TT My First Nios SW
No ratings yet
TT My First Nios SW
20 pages
Balancing PAC IPC PLC PDF
100% (1)
Balancing PAC IPC PLC PDF
5 pages
FPGA Design with VHDL and LUTs
No ratings yet
FPGA Design with VHDL and LUTs
51 pages
Ubga 484
No ratings yet
Ubga 484
3 pages
Direct RF Wideband Product Brief
No ratings yet
Direct RF Wideband Product Brief
2 pages
RSoC 9 Assessment
No ratings yet
RSoC 9 Assessment
7 pages
Electonics Products & Design
No ratings yet
Electonics Products & Design
42 pages
Intel Fpga Product Catalog 22 2
No ratings yet
Intel Fpga Product Catalog 22 2
85 pages
EPE - Arria10 Ug 01164
No ratings yet
EPE - Arria10 Ug 01164
58 pages
Serial Digital Interface (SDI) Release Notes
No ratings yet
Serial Digital Interface (SDI) Release Notes
2 pages