Field Programmable Gate Arrays Explained
Field Programmable Gate Arrays Explained
Gate Arrays
Explained
A high-level introduction to FPGAs
CONTENTS
Appendix ............................................................................................................... 95
©2024
Chapter One: An Introduction to FPGAs
This Handbook provides a high-level introduction into FPGAs and is split into four main parts:
Definition and Overview; Historical Evolution; Applications and Use Cases; and Advantages and
Limitations. This section aims to give a general idea of what FPGAs are, where they come from, and
where they are used.
Field-Programmable Gate Arrays (FPGAs) represent a versatile and powerful class of integrated
circuits that offer a unique blend of flexibility and performance. Unlike traditional
Application-Specific Integrated Circuits (ASICs), FPGAs are programmable at the hardware level after
manufacturing. This characteristic allows users to configure the chip's functionality to suit specific
application requirements. FPGAs consist of an array of configurable logic blocks interconnected by
programmable routing resources. These logic blocks can be customized to perform a wide range of
digital functions, making FPGAs well-suited for tasks such as digital signal processing, image and
video processing, networking, and more.
The design process for FPGAs involves creating a hardware description using Hardware Description
Languages (HDLs). Commonly used HDLs include VHDL and Verilog. This hardware description is
then synthesized and implemented using specialized tools, generating a configuration bitstream
that defines the interconnections and functionality of the FPGA. This ability to reconfigure hardware
dynamically makes FPGAs ideal for rapid prototyping, iterative design, and applications where
adaptability is critical.
At the core of an FPGA's versatility is its architecture, typically composed of an array of Configurable
Logic Blocks (CLBs), Input/Output Blocks (IOBs), programmable interconnects, and other essential
components. Configurable Logic Blocks contain Look-Up Tables (LUTs) and flip-flops, allowing users
to define and implement digital logic functions. IOBs manage input and output connections,
enabling seamless interaction with external devices. This is illustrated in Figure 1 below.
Programmable Interconnects
Figure 1: An abstract view of an FPGA; Control Logic Blocks are embedded in a general routing [1]
FPGAs consist of a versatile internal architecture designed for digital circuit implementation. At the
core are CLBs housing Logic Elements (LEs) capable of both combinational and sequential logic
operations. Interconnects form a grid-like structure, incorporating a Switching Matrix for flexible
signal routing across CLBs. IOBs interface with external signals, supporting various standards, while
Block RAM (BRAM) provides both distributed and dedicated memory resources. Dedicated Digital
Signal Processing (DSP) Blocks, equipped with specialized Multiply-Accumulate (MAC) units,
optimize the implementation of signal processing algorithms. Clock management features,
The key distinguishing feature of FPGAs is their programmability. Designers use hardware
description languages such as VHDL or Verilog to create a hardware-level description of the desired
digital circuit. Through a series of design steps, including synthesis and place-and-route processes,
this description is translated into a configuration bitstream. This bitstream, when loaded onto the
FPGA, effectively "programs" the device, defining its internal connections and functionality.
FPGAs find applications across a broad spectrum of industries due to their adaptability and
performance. They are particularly valuable in prototyping and development stages of electronic
systems, where rapid iteration and modification are essential. Additionally, FPGAs play a crucial role
in applications requiring parallel processing, real-time signal processing, and tasks demanding
hardware acceleration. FPGAs were not always as advanced and complex as the ones widely
available today. The next section gives an interesting look at the historical evolution which lead to
the FPGA we know today.
Historical Evolution
This chapter looks deeper into the historical evolution of these dynamic devices. Building upon the
foundations laid in the introductory chapter, we will navigate through pivotal moments and key
milestones that have shaped the trajectory of FPGA development. From the rudimentary origins of
programmable logic arrays to the emergence of Configurable Logic Blocks (CLBs) and the birth of
true FPGA architecture, this chapter unfolds the narrative of innovation and adaptation. By tracing
the footsteps of industry leaders, exploring technological breakthroughs, and understanding the
driving forces behind each evolutionary leap, the aim is to provide a comprehensive narrative that
not only captures the historical nuances but also sheds light on the transformative impact of FPGAs
on the digital landscape. There is a rich history to be unravelled, exploring the threads that have
woven together to create the sophisticated programmable devices we know today.
The historical evolution of FPGAs reflects a journey from basic programmable logic concepts to
sophisticated, highly configurable devices that play a pivotal role in modern digital systems. As
technology advances, FPGAs are poised to remain at the forefront of innovation, adapting to new
challenges and unlocking possibilities in diverse fields.
Automotive Electronics
In the automotive industry, FPGAs contribute to advanced driver assistance systems (ADAS),
in-vehicle infotainment, and control systems. Their adaptability allows for the implementation of
evolving standards, and their parallel processing capabilities enhance real-time processing for
safety-critical applications.
The low-latency nature of FPGAs makes them well-suited for real-time processing in fields like
communications and control systems. Additionally, FPGAs can be power-efficient when optimized
for specific functions, and their integration of IP cores expedites development by incorporating
pre-designed functional blocks. However, FPGAs also come with limitations, including finite
resources that must be carefully managed in complex designs. Cost considerations, programming
complexity, and a potential learning curve for hardware description languages (HDLs) can pose
challenges. While FPGAs are adept at parallel tasks, they may not be as efficient for purely
sequential operations, and security concerns regarding bitstream protection need to be addressed.
Long development cycles and vendor dependence on specific tools and libraries are additional
factors that should be considered when choosing FPGAs for a particular application.
In the next chapter, we're going to take a close look at how FPGAs are put together. After discussing
their history, uses, and pros and cons, we're now going to explore how these flexible devices actually
work. The chapter will explain the different parts of FPGAs, like CLBs, interconnects, IOBs, Block
Ram (BRAM), and more. We'll go through the pathways that can be programmed, understand the
logic elements, and look at special blocks like Digital Signal Processing (DSP) units. By focusing on
the basic structure, this chapter aims to help readers understand how FPGAs turn digital designs
into physical results. Keep reading as we uncover the details of FPGA architecture, revealing the
complexities that make these devices important in the world of programmable logic.
In this chapter, we embark on a detailed exploration of FPGA architecture to get a closer look at the
intricacies that define these reconfigurable marvels.
It is essential to note that this abstract view serves as a simplified representation. FPGA
architectures are significantly more complex, featuring additional elements, specialized resources,
and advanced functionalities. This chapter seeks to take a detailed look the intricacies of the basic
components – CLBs, IOBs, and interconnects together with other very important parts in modern
FPGAs – providing an understanding of their roles and interactions within the broader FPGA
framework.
In the landscape of FPGAs, Configurable Logic Blocks (CLBs) have a pivotal role in modifying digital
logic to suit specific applications.
According to digital logic fundamentals, any computation can be articulated as a Boolean equation,
and in certain instances, as a Boolean equation where inputs rely on prior results—fear not, as FPGAs
can indeed retain state. Consequently, every Boolean equation finds expression in a truth table.
From these foundational principles, structures can be built to perform arithmetic operations like
addition and multiplication as well as decision-making processes that assess conditional
statements, exemplified by the classic if-then-else structure. By amalgamating these elements, we
can articulate complex algorithms succinctly through the utilization of truth tables.
The LUT possesses the capability to compute any function of N inputs by programming the lookup
table with the truth table corresponding to the desired function. As depicted in Figure 3,
implementing a 3-input exclusive-or (XOR) function with a 3-input LUT (often denoted as a 3-LUT)
involves assigning values to the lookup table memory in a manner that aligns the pattern of select
bits with the correct row's "answer." Consequently, each "row" produces a result of 0, except in the
four instances where the XOR of the three select lines yields 1.
Figure 3: A 3-LUT schematic (a) and the corresponding 3-LUT symbol and truth table (b)
for a logical XOR. [1]
Of course, more complicated functions – and functions of a larger number of inputs – can be
implemented by aggregating several lookup tables together. For example, one can organize a single
3-LUT into an 8×1 ROM, and if the values of the lookup table are reprogrammable, an 8×1 RAM – but
the basic building block, the lookup table, remains the same.
Our logic block now adopts a configuration resembling that depicted in Figure 4. The output
multiplexer makes a choice between the result derived from the function generated by the lookup
table and the stored bit in the D flip-flop. A multiplexer, often abbreviated as "MUX," is a digital circuit
component that plays a crucial role in data routing and selection within electronic systems. It is
designed to take multiple input data lines and selectively route a particular input to the output based
on control signals. In practice, this logic block closely mirrors those found in certain commercial
FPGAs.
Most modern FPGAs are composed not of a single LUT, but of groups of LUTs and registers
(flip-flops) with some local interconnect between them. Figure 5 illustrates a CLB with multiple LUTs.
There has been research and ongoing debate over logic blocks containing groups of LUTs and their
respective shapes and forms. Regarding the density and the speed produced by the CLBs. A
particular study has shown that an FPGA containing two-thirds 4-input LUTs and one-third 2-input
LUTs reduced the number of bits within the LUTs by 22% and the number of logic block pins by 10%
when compared to FPGAs with only 4-input LUTs. [3]
Within our logic block, the 4-LUT comprises 16 SRAM bits, with each dedicated to an individual
output; the multiplexer utilizes a solitary SRAM bit, and the initialization value for the D flip-flop can
also be stored in a single SRAM bit. The way these SRAM bits are initialized within the broader
context of the FPGA will be covered in subsequent chapters.
These elements are used by all slices to provide logic, arithmetic, and ROM functions. In addition,
some slices support two additional functions: storing data using distributed RAM and shifting data
with 32-bit registers. Slices that support these additional functions are called SLICEM; others are
called SLICEL [4]. The complete schematic of the slices can be seen in [4].
A CLB element contains a pair of slices, and each slice is composed of four 6-input LUTs
and eight storage elements.
• SLICE(0) – slice at the bottom of the CLB and in the left column
• SLICE(1) – slice at the top of the CLB and in the right column
These two slices do not have direct connections to each other, and each slice is organized as
a column. Each slice in a column has an independent carry chain.
IOBs are programmable, meaning you can customize their behaviour based on the needs of your
specific application. They're designed to adapt to various electrical standards and interface with
different external devices, making FPGAs highly flexible and suitable for a wide range of projects.
Today’s FPGAs provide support for dozens of I/O standards thus providing the ideal interface bridge
in your system. I/O in FPGAs is grouped in banks with each bank independently able to support
different I/O standards. Today’s leading FPGAs provide over a dozen I/O banks, thus allowing
flexibility in I/O support.
IOB implementations vary from FPGA to FPGA and vendor to vendor, and thus in the next section a
look at how 7-series FPGAs handle I/Os is given.
SelectIO
The I/O system on 7-series FPGA is called SelectIO™ and is defined in AMD User Guide UG471. The
7-series FPGAs have different types of I/O banks. There are the High-Performance (HP) ones and the
High-Range (HR) ones. The HP banks are made to work better with fast memory and inter-chip
connections. They handle voltages up to 1.8V. On the other hand, the HR banks support a broader
range of input/output standards and can handle voltages up to 3.3V. So, depending on what you
need, you can choose the type of I/O bank that fits best for your project. It is important to note that
IO bank voltage depends on the application and the circuitry surrounding the FPGA. Some
development boards may support changing the voltage on some banks, while other have a specific
use-case and therefore use fixed bank voltages.
Programmable Interconnect
In the world of FPGAs, the programmable interconnect is like the wiring system that connects
different parts of the FPGA. It's a network of pathways that you can configure and adjust based on
what your electronic project needs.
Imagine it as the roads in a city. You can change the routes to connect different places, and you can
do the same with the programmable interconnect in an FPGA. This flexibility is what makes FPGAs
powerful. Instead of having a fixed layout like in traditional circuits, FPGAs allow you create your own
pathways, allowing you to build custom electronic circuits tailored to your specific requirements. In
the city analogy, FPGA programmable interconnect is like a customizable road system, giving the
freedom to create and these roads in a fashion that is the most efficient, allows the most volume, or
less distance, etc.
Now that we have an idea of how logic computation is achieved in FPGAs, we will go through the
programmable interconnect and its functional description within FPGAs. Figure 9 below illustrates
the current most popular implementation architecture in FPGAs, commonly called island-style
architecture.
In this plan, there are puzzle-like building blocks scattered in a two-dimensional pattern and
connected in a certain way. These building blocks are like islands, and they kind of float in a network
of connections.
This layout lets us do calculations in a partitioned way on the FPGA. Big calculations are split into
smaller pieces the size of a 4-LUT (a basic logic element) and put into these physical building blocks.
The connections are set up to guide signals between the building blocks in the right way. If we have
enough of these building blocks, we can make our FPGAs do any kind of calculation we want. It's like
having many small parts that work together to create a big and powerful system. Looking at Figure
9 we can deduce that this is simply a visual placeholder to give us an idea of the internal structure of
the FPGA. The actual internal architecture of FPGAs is more complex. In this section interconnect
structures present in today’s FPGAs are introduced.
Nearest Neighbour
Nearest-neighbour communication is as straightforward as it sounds. Imagine a 2x2 arrangement of
logic blocks, just like in Figure 10. In this setup, each logic block only needs to connect with its
immediate neighbours in four directions: north, south, east, and west. This means that every logic
block can directly talk to the ones right next to it.
Figure 10 shows one of the simplest routing architectures possible. Even though it might seem
basic, some older commercial FPGAs actually used this approach. However, this simple design has
its drawbacks. It has issues with connectivity delays. Think of it this way: if instead of a small 2x2
setup, you had a huge 1024x1024 array, the delay would increase as you move further away. The
signal must travel through many cells and switches to reach its final destination, causing delays and
connectivity problems.
Here is where the need to bypass logic blocks arises. Without the ability to bypass logic blocks in the
routing structure, all routes that are more than a single hop away require traversing a logic block.
With just one pair of connections that work in both directions, there's a restriction on how many
signals can cross in and out. Signals that are moving through must not interfere with signals that are
actively being used and generated.
Because of these limitations, the nearest-neighbour structure isn't commonly used all by itself.
However, it's almost always included in current FPGAs. Usually, it's combined with other techniques
to overcome the challenges posed by its simplicity. One of the techniques that the
nearest-neighbour structure is combined with is the segmented structure.
Segmented
Most of today's FPGA designs are less like Figure 10 and more like Figure 11. In Figure 11, we bring in
what's called a Connection Block (CB) and a Switch Box (SB). This makes the routing structure more
versatile and mesh-like.
Figure 11: Illustration of a traditional island-style (mesh based) FPGA architecture with CLBs; the CLBs
are “islands in a sea of routing interconnects”. The horizontal and vertical routing tracks are
interconnected through switch boxes (SB) and connection boxes (CB) connect logic blocks in the
programmable routing network, which connects to I/O blocks. [2]
The switch block appears where horizontal and vertical routing tracks converge as shown in Figure
13. In the most general sense, it is simply a matrix of programmable switches that allow a signal on
a track to connect to another track. SBs are placed at the intersection points of vertical and
horizontal routing channels. Routing a net from a CLB source to the target CLB sink necessitates
passing through multiple tracks and SBs, in which an entering signal from a certain side can connect
to any of the other three directions based on the switch matrix (matrix of SBs) topology. The popular
SB topologies in commercial FPGA architectures are Wilton, Disjoint, and Universal which are shown
in Figure 14.
Figure 15: The structure of uni (left) and bi (right) Universal SM.
The connections between different parts of the FPGA can either be one-way (unidirectional) or
two-way (bidirectional), and you can see examples of both in Figure 15. However, in modern FPGAs,
the main setup is with one-way tracks. These one-way tracks can be either short or long. For
instance, a wire that spans two Configurable Logic Blocks (CLBs) is a two-segment wire. Longer
wires might take a bit more time to get through the multiplexer (SB) but are good for connecting
things globally across the FPGA. On the other hand, shorter tracks have less delay, making them
better for connecting things that are close by. So, depending on whether you need to connect things
far or near, you might choose longer or shorter tracks.
Hierarchical
Here's another way to make long wires faster: a hierarchical approach. Look at the structure in Figure
16. At the lowest level, we group together 2x2 arrays of logic blocks into a single cluster. Inside this
cluster, the routing is limited to local, nearest-neighbour connections. Now, we create a higher level
by forming a 2x2 cluster of these smaller clusters, making a group of 16 logic blocks. At this level,
longer wires at the edges of the smaller 2x2 clusters connect each cluster of four logic blocks to the
other clusters in the higher-level group. We keep repeating this pattern at even higher levels, with
larger clusters and even longer wires.
This interconnect design relies on the idea that a well-designed (and well-placed) circuit mostly has
local connections, and only a few connections need to travel long distances. By offering fewer
resources at the higher levels, this design stays efficient in terms of space while still having some
longer wires to speed up signals that need to cross large distances.
Hard Cores
Many new FPGA devices come with added features like specialized building blocks such as memory
blocks (single or dual-port RAMs), multipliers and other arithmetic operations, and Digital Signal
Processors (DSPs). These DSP and other dedicated blocks are designed and built into the devices to
make it easier to implement specific functions. Without these specialized blocks, you would need a
much larger number of Look-Up Tables (LUTs) to achieve the same functionality. They also provide a
way to handle applications with high memory requirements.
However, certain design choices, like how often certain blocks are repeated in the architecture (as
shown in Figure 17), are crucial. This repetition frequency is a key design parameter that affects the
overall performance and energy efficiency of the FPGA. The way these architectural elements are
configured plays a significant role in determining how well the FPGA can perform specific tasks and
how efficiently it uses energy.
Embedded Resources
Even though CLBs are a very important and powerful tool in FPGAs, they can easily be overused
when trying to implement structures such as memory, shift registers and arithmetic operations.
That is why all modern FPGAs have specific embedded resources that target these challenges.
Data Storage
Data storage is very common and important in digital system design. Apart from the SLICEMs in the
CLBs of 7-series FPGAs, which can be used as memories or shift registers, FPGAs also have
something called Block RAMs (BRAM) embedded in their hardware. These are bigger storage parts.
In the 7-series, all the parts have 36 Kb BRAM, each of which can be divided into two 18 Kb BRAMs.
The table below shows how much BRAM is in the parts on the suggested development boards.
These BRAMs are not exclusively found in 7-series FPGAs but are common to all modern FPGAs. The
following table lists the BRAM resources available on two Digilent boards.
The information stored in BRAMs can be set up upon initialization, and you can adjust it using a file
or a special part in the code. This comes in handy when creating things like ROMs or setting up initial
conditions.
In 7-series devices, the BRAMs also have some built-in logic to create FIFOs (First-In First-Out). This
is useful because it helps save resources in the CLBs, and it makes the design process smoother by
avoiding some technical issues.
All 36 Kb BRAMs come with something called Error Correction Code (ECC) functions. This is more
about ensuring things work reliably, like in medical, automotive, or space applications. However, we
won’t get into the details of that in this handbook.
In addition to the embedded BRAMs 7-series FPGAs also offer an on-chip high speed memory
interface which goes up to 1,866 Mb/s on Virtex-7 FPGAs.
FPGAs have special parts called DSP blocks or slices. These DSP blocks help speed up common tasks
like fast Fourier transforms (FFTs) and finite impulse response filtering (FIR), which are related to
You can also perform operations such as multiplication using regular logic (LUTs and flip-flops), but
it uses up a lot of resources. Using the special DSP blocks for multiplication makes sense because it’s
better for performance and using logic efficiently. That’s why even small FPGAs set aside space for
DSP blocks.
DSP48E1
FPGAs excel in digital signal processing (DSP) tasks because they can use special, fully parallel
methods that are customized for specific needs. DSP operations often involve a lot of binary
multiplication and accumulation, and FPGAs have dedicated parts called DSP slices that are perfect
for these tasks. In the 7-series FPGAs, there are plenty of these custom-designed, low-power DSP
slices that are fast, compact, and still flexible for designing different systems. The figure below [5]
illustrates the basic DSP48E1 Slice functionality in 7-series FPGAs.
Peripherals
Peripherals in FPGAs refer to external devices or components that can be connected to the FPGA to
enhance its functionality. These peripherals can include input/output interfaces, communication
ports, memory modules, sensors, and other hardware components that extend the capabilities of
the FPGA. They enable the FPGA to interact with the external world, process data from various
sources, and perform specific tasks based on the application’s requirements. Integrating peripherals
allows FPGAs to be customized for a wide range of applications and makes them adaptable to
different tasks and environments. Modern FPGA chips are also incorporating peripherals so that
implementation of certain functions such as inter chip communications. These do not have to be
implemented using the available logic in the FPGA.
Connectivity
Connectivity in modern digital processing platforms, particularly in 7-series FPGAs, extends beyond
the chip itself to encompass peripheral circuitry connected to FPGA I/Os. In the realm of digital
design, establishing communication with external devices is often facilitated by incorporating
components like Ethernet PHYs and USB controllers. These peripherals offer simpler interfaces to
the FPGA, enabling seamless connectivity with various devices. For instance, utilizing a USB
controller can simplify the implementation of interfaces like USBUART serial communication,
reducing the burden on FPGA resources. This strategic offloading of functionalities to dedicated
peripheral circuitry not only streamlines the design process but also optimizes the utilization of
valuable FPGA resources for more complex and specialized tasks. In essence, the integration of
these peripheral components enhances the overall connectivity of the FPGA-based system,
fostering efficient communication with the outside world.
One very important aspect to consider when designing with FPGAs is the clock skew. Clock skew
refers to the variation in arrival times of a clock signal at different points within a digital system. In
other words, it's the difference in time it takes for the clock signal to reach different parts of a circuit.
In synchronous digital systems, various components rely on the same clock signal to coordinate
their operations. However, due to factors such as differences in wire lengths, routing paths, and
environmental conditions, the clock signal may not reach all components simultaneously.
Clock skew can lead to timing issues and negatively impact the reliability and performance of a
digital circuit. Excessive clock skew may result in some components latching data at different times,
causing data corruption and errors. Designers use techniques like careful routing, buffer insertion,
and clock tree synthesis to minimize clock skew and ensure that the clock signal reaches different
parts of the circuit as simultaneously as possible. Minimizing clock skew is particularly important in
high-performance digital systems to maintain accurate synchronization.
It is relevant to mention here that in the context of AMD Vivado, "negative slack" refers to a critical
Negative slack occurs when the design fails to meet timing constraints, meaning that certain paths
in the design are not meeting the required timing specifications. This can be problematic because it
may lead to incorrect functionality, reduced performance, or even complete failure of the design.
FPGA clock management resources when used correctly enable the designer to meet timing
constraints and make sure to minimize these effects in FPGA systems. Sometimes designers tend to
not prioritize the clock effects in their design. But the effects of mismanaged clock signals
throughout an FPGA can lead to designs being ineffective and show intermittent errors which are
not desirable. In the following sections the clock management tools available in modern FPGAs are
discussed.
Clock Sources
Internal oscillators within FPGAs offer on-chip solutions for generating clock signals, eliminating
the need for external clock sources and simplifying the design process. These on-chip oscillators
provide stable and precise clock signals with low jitter and configurable frequencies to meet the
timing requirements of digital circuits within the FPGA. They contribute to power efficiency and play
a pivotal role in defining clock domains, enabling heterogeneous designs with varied clock
frequencies. While internal oscillators provide a convenient option for many applications, it's
important to note that in certain cases, especially those with stringent timing requirements or
specialized needs, most FPGAs still utilize external oscillators for greater precision or specific
frequency characteristics. The choice between internal and external oscillators depends on the
specific design considerations and performance criteria of the FPGA application.
External oscillators for FPGAs serve as standalone clock sources positioned outside the FPGA
device itself. These oscillators are preferred in applications that demand a high degree of precision
and stability in clock signals. Selected based on specific frequency requirements, external oscillators
are characterized by low jitter and accurate frequency control, making them suitable for scenarios
where standard on-chip oscillators may not meet the desired frequencies or specialized clock
characteristics. Often taking the form of crystal oscillators, these external sources utilize crystal
resonators to generate highly stable clock signals. They are frequently integrated into the broader
clock distribution network, providing a common reference for multiple FPGA devices or other
components within a larger system. External oscillators play a crucial role in applications requiring
synchronized clocks across different components or boards, contributing to coherent operation in
systems with distributed timing needs. Configuration options allow designers to tailor external
oscillators to the specific frequency and settings requirements of their FPGA application. While
internal oscillators within FPGAs offer simplicity and integration, the choice between internal and
external oscillators depends on the precision and customization demands of the particular
Phased-Locked Loops
A Phase-Locked Loop (PLL) is an electronic feedback system designed to control the phase of an
output signal in relation to a reference signal. It is commonly used in electronics and communication
systems for tasks such as clock synchronization, frequency synthesis, and demodulation. PLLs offer
clock multiplication and division, phase shifting, programmable duty cycle, and external clock
outputs, allowing system-level clock management and skew control.
Clock Buffers
Clock buffers are essential components in digital systems, tasked with the efficient distribution of
clock signals across various components of a circuit. Their primary function is to replicate an input
clock signal and deliver it to multiple output locations within a system. This is particularly critical in
large-scale digital designs where synchronous operation is paramount. Clock buffers are designed
to handle multiple outputs, a property known as fanout, without compromising the integrity of the
clock signal. They come in different types, including non-inverting and differential buffers, each
serving specific purposes. Non-inverting buffers maintain the same logic level as the input, while
differential buffers transmit the clock signal as a complementary signal pair, offering enhanced noise
immunity. Clock buffers play a crucial role in minimizing clock skew, the variation in arrival times of
the clock signal at different points in the circuit, ensuring synchronized operation. Additionally, they
may provide control over edge rates, influencing the speed of transitions between logic levels in the
output signals. In FPGA and ASIC designs, where precise clock control is essential, clock buffers
contribute to meeting timing requirements and facilitating reliable digital system operation.
Clock Regions
Clock regions in FPGAs serve as designated areas within the device where clock resources are
organized and managed. As FPGAs comprise numerous programmable logic cells and dedicated
Consider the implementation of DDR (Double Data Rate) memory interfaces using Memory
Interface Generator (MIG) in AMD Vivado. DDR memory interfaces, commonly used in many
applications for higher data transfer rates, have stringent timing requirements. These interfaces
demand precise control over clock signals to avoid errors in tooling and maintain correct design
functionality. In DDR interfaces, the data is transferred on both the rising and falling edges of the
clock signal, doubling the effective data transfer rate. This introduces challenges related to clock
skew, where the arrival times of clock signals at different points in the system need to be tightly
controlled to meet the timing constraints imposed by the DDR standard.
In the context of DDR timing, mismanagement of clock resources can lead to errors in the tooling
process, incorrect design functionality, and compromised performance. New users, in particular,
By effectively utilizing FPGA clock management resources, designers can navigate the intricacies of
DDR timing, optimize performance, and ensure a robust and error-free implementation of memory
interfaces. This example underscores the practical significance of mastering clock management in
FPGA design, especially for applications with stringent timing constraints like DDR interfaces.
Each 7-series FPGA provides six different types of clock lines (BUFG, BUFR, BUFIO, BUFH, BUFMR,
and the high-performance clock) to address the different clocking requirements of high fanout,
short propagation delay, and extremely low skew.
In every 7-series FPGA (except XC7S6 and XC7S15), there are 32 global clock lines with the widest
reach, capable of extending to every flip-flop clock, clock enable, set/reset, and many logic inputs.
Within each clock region, governed by horizontal clock buffers (BUFH), there are 12 global clock
lines. Each BUFH can be independently enabled or disabled, offering the flexibility to turn off clocks
within a specific region, providing precise control over power consumption. These global clock lines
can be connected to global clock buffers, which have the additional capabilities of glitchless clock
multiplexing and managing clock enable functions. The source of global clocks is often the Clocking
Management Tile (CMT), which has the potential to entirely eliminate basic clock distribution delays.
Regional clocks can drive all clock destinations in their region. A region is defined as an area that is
50 I/O and 50 CLB high and half the chip wide. 7-series FPGAs have between two and twenty-four
regions. There are four regional clock tracks in every region. Each regional clock buffer can be driven
from any of four clock-capable input pins, and its frequency can optionally be divided by any integer
from 1 to 8. To repeat, it’s rarely advised to use signals that are not driven by clocking-related
resources as if they are clocks!
For more details on the 7-series FPGAs clocking resources take a look at UG472 [6].
In FPGA terminology, configuration refers to the process of initializing the FPGA with a specific set
of instructions. These instructions, stored in a configuration memory, dictate the interconnections
and functionalities of the FPGA's internal logic elements. Configuration occurs during startup and is
crucial for defining the FPGA's operational characteristics. Remember FPGAs lose their configuration
on loss of power, so they must be configured upon every power cycle. That is why some type of
1. Define the Objective: Clearly articulate the desired functionality or task the FPGA is intended
to perform. This forms the basis for subsequent programming steps.
2. Hardware Description Language (HDL): Utilize HDL, such as Verilog or VHDL, to describe the
desired circuitry and behavior. HDL serves as the intermediary language between
human-readable code and the low-level hardware description.
3. Compilation Process: The HDL code undergoes synthesis and implementation processes
using specialized tools. Synthesis translates the high-level HDL code into a netlist,
representing the logical structure. Implementation maps this netlist onto the physical
resources of the FPGA, considering factors like timing and resource utilization.
4. Bitstream Generation: The compiled design is converted into a bitstream – a binary file
containing configuration data. This bitstream is analogous to the firmware for configuring
the FPGA.
5. Configuration Upload: The bitstream is loaded onto the FPGA's configuration memory,
effectively programming the device. This step is typically carried out during the power-up
sequence.
A more detailed description of the configuration and programming of FPGAs is given in section 3
FPGA Design Flow.
The Zynq system-on-chip (SoC) is a notable example of a versatile and powerful integrated circuit
that combines the capabilities of both a traditional processor system and programmable logic within
a single chip. Developed by AMD, the Zynq SoC family integrates a Processing System (PS) based on
ARM Cortex-A9 cores with programmable logic (PL) in the form of an FPGA (Field-Programmable
Gate Array). This unique combination enables designers to harness the flexibility of programmable
logic alongside the processing power of traditional CPUs, making it well-suited for a broad range of
applications.
The ARM Cortex-A9 cores within the Zynq SoC handle general-purpose processing tasks, running
operating systems such as Linux or other real-time operating systems (RTOS). These cores are
responsible for executing high-level software applications, interfacing with peripherals, and
managing system-level operations. Concurrently, the programmable logic section of the Zynq chip
provides a customizable hardware platform that can be tailored to specific tasks or applications,
offering a performance boost for parallelizable and compute-intensive operations.
The integration of processing cores and programmable logic in SoCs has become a trend in modern
embedded systems, providing a balance between the flexibility of software and the performance of
dedicated hardware. Designers can optimize their systems by leveraging the strengths of both
components, tailoring solutions to meet specific requirements and achieve a competitive edge in
terms of performance, power efficiency, and adaptability.
The FPGA design flow is a systematic process that transforms a conceptual hardware description
into a fully functional and optimized FPGA implementation. This journey involves a series of
well-defined stages, each contributing to the realization of a digital design within the constraints
and capabilities of an FPGA. From conceptualization to synthesis, place and route, and finally
bitstream generation, the FPGA design flow encompasses various critical steps. Each stage involves
intricate decisions related to architecture, timing, power, and resource utilization, requiring
designers to strike a balance between performance, flexibility, and efficiency. In this section, we
delve into the intricacies of the FPGA design flow, exploring the key processes and considerations
that engineers navigate to bring their digital designs to life on programmable hardware.
Source Code:
The design process begins with the creation of a hardware description using a hardware description
language (HDL) such as Verilog or VHDL. This source code describes the intended functionality of
the digital circuit. An initial design may also be further abstracted above HDLs using other tools,
such as block diagrams or high-level synthesis.
Logic Synthesis:
Logic synthesis is the process of converting the high-level HDL code into a netlist of logical gates
and flip-flops. This stage involves optimizing the design for factors such as performance, area, and
power.
Technology Mapping:
Technology mapping involves mapping the logical gates in the synthesized netlist to specific
Placement:
Placement involves determining the physical location of each logic element on the FPGA. The goal is
to place critical elements close to each other to minimize delays and optimize performance. Proper
placement is crucial for meeting timing requirements.
Routing:
After placement, the routing step involves creating the interconnections (wires) between the placed
logic elements. The router determines the optimal paths for signals, considering factors such as
signal delays, avoiding congestion, and meeting timing constraints.
Bitstream Generation:
Once the design is placed and routed, the final step is to generate the bitstream. The bitstream is a
binary file that contains configuration information for the FPGA. It specifies how the programmable
elements (look-up tables, flip-flops, etc.) should be configured to implement the desired logic.
Throughout the FPGA design flow, designers use various tools, such as synthesis tools,
place-and-route tools, and vendor-specific tools provided by FPGA manufacturers like AMD (or
Altera) The iterative nature of the design flow allows designers to refine and optimize their designs
at each stage, balancing factors like performance, resource utilization, and power consumption.
HDLs support multiple levels of abstraction. Behavioral HDLs focus on specifying the functionality of
a system without detailing its implementation, leveraging constructs like processes and always
blocks. Structural HDLs, on the other hand, allow designers to describe the physical
interconnections and arrangement of hardware components using modules or entities.
Register-Transfer Level (RTL) HDLs, like both Verilog and VHDL, capture the flow of data between
registers and are widely used for describing digital systems. Concurrency is a fundamental feature
of HDLs, allowing designers to model operations happening simultaneously. This concurrency is
expressed through constructs like processes or concurrent signal assignments, enabling a more
natural representation of digital circuit behavior. HDLs support simulation, a crucial aspect of the
design process. Simulation tools allow designers to verify the correctness and performance of their
designs before moving to the physical implementation stage. Additionally, HDLs can be synthesized
into netlists of gates, flip-flops, and other hardware elements. This synthesis process is vital for the
implementation of designs on FPGAs or Application-Specific Integrated Circuits (ASICs).
Standard libraries in HDLs serve as a repository of predefined modules and functions, encompassing
commonly used elements like logic gates, flip-flops, and arithmetic units. This not only expedites
the design process but also promotes code reuse by providing a foundation of well-tested and
established components. In the context of AMD Vivado, designers can further enhance their
efficiency through the utilization of "Language Templates," a feature integrated into the tool. These
templates offer predefined structures for common coding patterns, ensuring consistency and
adherence to best practices. Moreover, Vivado provides a comprehensive suite of tools, including
simulators, synthesizers, and waveform viewers, supporting the entire design flow from
conceptualization to implementation. This integration, coupled with language templates,
streamlines the development process and contributes to the robustness of FPGA and ASIC designs.
In essence, HDLs provide engineers with a powerful set of tools to navigate the intricate landscape
of digital design. They enable the expression of design concepts, facilitate simulation and
verification, and ultimately empower the creation of efficient and effective digital systems. The
choice between Verilog and VHDL, along with the rich features and toolsets they provide,
underscores the significance of HDLs in modern electronic design processes. VHDL will be
discussed in more detail in Chapter Four.
Technology-independent logic optimization removes redundant logic and simplifies logic wherever
possible. The optimized netlist of basic gates is then mapped to look-up tables. Both of these
problems have been extensively studied and good algorithms and tools capable of targeting the
FPGAs we are interested in studying are publicly available, so this book does not study these phases
of the synthesis process.
The third synthesis step in Figure 22 is necessary whenever an FPGA logic block contains more than
a single LUT. Logic block packing groups several LUTs and registers into one logic block, respecting
limitations such as the number of LUTs a logic block may contain, and the number of distinct input
signals and clocks a logic block may contain. The optimization goals in this phase are to pack
connected LUTs together to minimize the number of signals to be routed between logic blocks, and
to attempt to fill each logic block to its capacity to minimize the number of logic blocks used.
This problem is a form of clustering. Clustering and partitioning are essentially the same problem;
divide a netlist into several pieces, such that certain constraints, such as maximum partition size, are
respected, and some goal, such as minimizing the number of connections that cross partitions, is
optimized. When a circuit is to be divided into only a few pieces, the problem is called partitioning.
When a circuit is to be divided into many small pieces in one step (as opposed to recursively
partitioning into a few partitions in each step), the problem is usually called clustering.
Achieving a good placement is crucial for FPGA designs, as a poor placement can hinder successful
routing and result in lower operating speeds and increased power consumption. Finding an optimal
placement is challenging, especially for large commercial FPGAs with around 500,000 functional
blocks, leading to an enormous number of possible placements. Due to the computational
complexity of the problem, exhaustive evaluation of all placement options is impractical.
Consequently, the development of fast and effective heuristic placement algorithms is a significant
area of research.
Figure 24: Placement overview: (a) inputs to the placement algorithm, and (b) placement algorithm
output—the location of each block.
Initially, a placement is created, usually of low quality, by assigning each block to the first legal
location found. Then, it gets better over time by suggesting and checking changes, or "moves." A
move involves relocating a few blocks to new places, and a cost function is used to see how each
move affects the overall arrangement. Moves that make the arrangement better are always
accepted.
Unlike some other methods, VPR doesn't stick to fixed temperatures and cooling rates. Instead, it
figures out its annealing schedule based on the situation during placement. This flexible approach
helps it produce high-quality results for various design sizes, FPGA setups, and cost considerations.
That's why it's often preferred over methods with more rigid, predetermined schedules.
One advantage of this approach is that it allows for the development of a placement algorithm that
can automatically adjust to a broader range of FPGA setups. This is because the algorithm doesn't
rely on too many assumptions about the device-routing architecture in its cost calculations.
However, a downside is that using a router in the cost calculation takes a lot of computer processing
time. Checking the cost after each move is very demanding, making it hard to evaluate enough
moves quickly for large circuits.
PROXI is an example of a timing-driven FPGA placement algorithm that uses a router for its cost
calculations. The cost in PROXI is a weighted sum of the number of nets that haven't been connected
and the delay of the most critical path in the circuit. After each placement change, PROXI
disconnects all the nets connected to blocks that have moved and plans new routes for them using
a fast, directed-search maze router.
Since each section of the circuit ends up in a different area of the FPGA, this approach reduces the
number of connections going out of each area. This indirectly helps optimize the amount of wiring
needed for the design. This method can handle large problems because there are good, efficient
algorithms for splitting the circuit into sections. However, for certain FPGA setups, this method has
a downside. It doesn't directly optimize the timing of the circuit or the amount of routing needed for
the placement. Hierarchical FPGAs are good candidates for partition-based placement, since their
routing architectures create natural partitioning cut lines.
Once the positions for all the logic blocks in a circuit are chosen, a route is planned to decide which
switches in the FPGA should be activated to connect the input and output pins of the logic blocks
needed by the circuit. In FPGA routing, the usual way is to picture the routing structure of the FPGA
as a directed graph. Each wire and each pin on a logic block becomes a point in this graph, and
possible connections become the lines between them. While some past research has treated FPGAs
as undirected graphs, a directed graph is necessary when modeling directional switches like tri-state
buffers and multiplexers accurately.
Routing a connection means finding a path in this graph between the points representing the pins of
the logic blocks that need to be connected. To use as few of the limited number of wires in an FPGA
as possible, the goal is to keep this path short. It's also crucial that the routing for one connection
doesn't use up the routing resources needed by another connection. That's why most FPGA routers
have some method to avoid congestion and resolve conflicts over routing resources. Another goal is
to make connections on or near the critical path speedy by using short paths and fast routing
resources. Routers aiming to optimize timing this way are called timing-driven, while delay-oblivious
routers focus purely on routability. Since most of the delay in FPGAs comes from the programmable
routing, timing-driven routing is important for achieving good circuit speeds.
FPGA routes can be split into two types. Combined global-detailed routers determine a complete
Bitstream Generation
A reconfigurable logic device is a bit like a mix between a fixed hardware device and a programmable
instruction set processor. What sets them apart is how they're set up and programmed. Both use
"software" for programming, but they handle it in different ways.
In an instruction set processor, the programming is a set of binary codes fed into the device while it's
running. These codes make the processor change its internal logic and routing on every cycle based
on the input of these binary codes.
On the flip side, a reconfigurable logic device, like an FPGA, is built differently. It has a
two-dimensional array of programmable logic elements connected by a programmable
interconnection network. The significant difference is that an FPGA is usually programmed as a
complete unit, with all its internal components working together at the same time. Unlike an
instruction set processor, the programming data for an FPGA is loaded into the device's internal
units before it starts operating, and typically, no changes are made to the data while the device is
running.
The data used to program a reconfigurable logic device is commonly called a "bitstream," although
this term is somewhat misleading. Unlike an instruction set processor where the configuration data
are continuously streamed into the internal units, an FPGA usually loads its data only once during
setup. The format of the bitstream is often kept as a trade secret by manufacturers, making it less
accessible for experimentation with new tools and techniques by third parties. While most users of
commercial reconfigurable logic devices are okay with the vendor-supplied tools, those interested in
the internal structure find trade secrecy to be an important issue.
The bitstream is like a map that shows how different small hardware parts come together in a
reconfigurable logic device to create a working digital circuit. While there's no strict limit to the types
of units in a reconfigurable logic device, two basic structures are common in most modern FPGAs:
the lookup table (LUT) and the switch box.
Similar to switch boxes, the configuration bitstream data for Input/Output Blocks (IOBs) consists of
bits that set flip-flops within them to choose specific features. In newer generations of FPGA
devices, there are also special-purpose units like block memory and multiplier units. The actual data
bits may be part of the bitstream, initializing the BlockRAM during power-up. However, to keep the
bitstream size smaller, this data might be absent, and the internal circuitry could be needed to reset
Apart from IOBs, other internal data, like BlockRAM, is connected to the switch boxes in different
ways, and deciding its location and interfacing in the interconnection network is a significant
architectural choice in modern reconfigurable logic device design. Many other features in the FPGA,
such as global control related to configuration and reconfiguration, ID codes, and error-checking
information, have control bits in the bitstream. Implementation of these features can vary widely
among different device families.
Basic control for bit-level storage elements, like flip-flops on the LUT output, is a common feature.
Control bits often set circuit parameters, such as the type of flip-flop (D, JK, T) or the clock edge
trigger type (rising or falling edge). Being able to change the flip-flop into a transparent D-type latch
is also a popular option. Each of these bits contributes to the configuration data, with one set of
flip-flop configuration settings per LUT being typical.
Most FPGAs use a serial way of loading the configuration, but some have a parallel option that uses
eight I/O pins to load the data all at once. This can be useful for designs using an 8-bit memory
device or for applications where the FPGA needs to be reprogrammed often and speed is
crucial—like when it's controlled by another processor. Just like the serial approach, the pins can go
back to regular I/O tasks once the downloading is done. Quad SPI flash devices for boot are common
across most Digilent boards. The programmer essentially loads the bitstream into a flash memory
which the FPGA boots from.
In the factory, during the testing of FPGA devices after they are made, having a high-speed
configuration can be extremely helpful. Testing FPGAs can be expensive because of the time spent
connected to test equipment. So, making the configuration download faster can mean the FPGA
manufacturer needs fewer pieces of test equipment, saving a lot of money during manufacturing.
The need for high-speed download is more about making the testing process more efficient than
meeting any customer requirements for changing how the FPGA works while it's running.
There's also a kind of device that uses non-volatile memory, like Flash-style memory, instead of RAM
and flip-flops for the internal logic and control. These devices, like those from companies such as
Actel, only need to be programmed once and don't need to reload configuration data when
Within the domain of FPGA development, understanding the landscape of vendors, families,
development environments, simulation tools, and programming languages is pivotal for proficient
design and implementation. This section undertakes a detailed examination of these fundamental
components, starting with an exploration of popular FPGA vendors and families. Focusing notably
on AMD and pertinent open-source software that aids in the development for their components,
this subsection offers insights into the prevailing industry standards and the tools available to FPGA
developers. By delving into the specifics of vendor offerings and the associated development
ecosystem, readers will garner a nuanced understanding of the foundational elements shaping
contemporary FPGA design methodologies.
AMD: Formerly known as Xilinx, AMD continues to hold a prominent position in the FPGA market,
offering a comprehensive portfolio of families catering to a wide range of applications. The Spartan
series addresses entry-level and cost-sensitive applications, while the Artix series extends this
versatility to mid-range applications with enhanced performance and logic capacity. The Kintex
series delivers heightened performance and scalability, suitable for industrial and aerospace
applications, while the Virtex series offers unparalleled performance and integration features, ideal
for data center acceleration and high-performance computing. Digilent uses AMD FPGAs.
Intel: As a major competitor to AMD, Intel – formerly Altera – boasts a formidable line-up of FPGA
families. The Stratix series targets high-performance computing and data center applications, while
the Cyclone series caters to cost-sensitive and low-power applications, providing a comprehensive
range of options for FPGA developers.
Lattice Semiconductor: Lattice Semiconductor's ECP and MachXO families target low-power and
compact form-factor applications, providing alternatives to traditional FPGA offerings with a focus
on power efficiency and compactness.
Projects like IceStorm and Project X-Ray are great examples of free tools that help people work with
FPGAs made by companies like Lattice Semiconductor and AMD. Besides these, there are more
tools that are very useful:
Yosys is a free tool that helps turn the design code (Verilog) into something the FPGA can
understand. It's very important for making FPGA designs work.
SymbiFlow wants to be like the GCC (a very popular free software compiler) but for FPGAs. This
means it wants to help people make FPGA designs in a standard way, no matter what type of FPGA
they are using.
NextPNR is a tool that takes the design and fits it onto the actual FPGA chip. It's part of the bigger
SymbiFlow project and is important for making sure the design can work in real life.
Tools like GHDL and Verilator let people test their FPGA designs on a computer before they try them
on a real chip. This is very helpful for finding and fixing mistakes.
Having these tools makes it easier for more people to work with FPGAs, meaning more projects and
ideas can come to life. It also helps students learn better and lets everyone share their knowledge
and help one another. The world of FPGA development is getting a big boost from these free
software projects.
Key Features:
High-Level Synthesis (HLS): Vivado allows designers to model FPGA circuits in higher-level
programming languages such as C, C++, and SystemC, dramatically reducing the design complexity
and time. Intellectual property (IP) cores developed in Vitis HLS can be used in Vivado designs.
IP Integrator: This feature enables the rapid composition of IP cores and custom modules into a
single design canvas, facilitating a block design methodology that enhances productivity.
Logic Simulation: Integrated logic simulators provide the capability to test and verify design
behavior before hardware implementation.
Implementation Tools: Vivado offers advanced place and route algorithms that optimize the
physical layout on the FPGA fabric, ensuring the best possible performance and resource utilization.
Seamless Integration with Vivado: Vitis works hand-in-hand with Vivado, allowing designs and IP
generated in Vivado to be easily imported and utilized within the Vitis environment for software
acceleration.
AI Engine: For AI and machine learning applications, Vitis provides specialized support for AI model
development and deployment, leveraging the adaptable compute architecture of AMD FPGAs.
Digilent boards generally don’t support this, as AI engines are a feature of higher cost silicon.
Comprehensive Software Libraries: Vitis includes optimized libraries for a range of applications,
including data analytics, image processing, and financial computations, allowing developers to
leverage these pre-built modules for rapid application development. Note that some of these
libraries may be difficult to set up on Digilent FPGA development boards, as not all of the required
PetaLinux support is available.
Comprehensive Software Libraries: Vitis includes optimized libraries for a range of applications,
including data analytics, image processing, and financial computations, allowing developers to
leverage these pre-built modules for rapid application development. Note that some of these
libraries may be difficult to set up on Digilent FPGA development boards, as not all of the required
PetaLinux support is available.
Vitis is for writing software to run in an FPGA, and is the combination of a couple of different AMD
tools, including what was AMD SDK, Vivado High-Level Synthesis (HLS), and SDSoC. The
functionality of each of these is now merged together under Vitis. To break each of these down:
1. AMD SDK (Vitis): Write C/C++ to run on a processor in a design you created in Vivado. This
code often ends up being at least partially used to configure and control elements of the
hardware design – it’s easier to rebuild, tweak, and debug than the hardware portion is.
2. Vivado HLS (Vitis HLS): Write C/C++ to be built into a block which you can include in a Vivado
project. This block can often be reused in multiple projects, and even potentially be loaded
up in Vivado for manual optimization.
3. SDSoC (Vitis): Write C/C++ to be built into a block which the tool stitches into a previously
created Vivado design. You take a platform with some I/O built in, and start accelerating certain
data processing functions of your software design by building them into the hardware (while
still writing them in software languages).
Model Composer is a Xilinx tool that integrates with MATLAB. It's generally downloaded with the
Vitis platform and streamlines the design and streamlines the design of digital signal processing
The tool's key features include efficient HDL code generation, comprehensive simulation and
verification capabilities, and support for both fixed-point and floating-point models. This makes it
easier to balance performance, accuracy, and resource use. By simplifying the transition from
conceptual models to hardware-ready designs, Model Composer makes FPGA technology
accessible to a wider range of engineers, enhancing productivity and innovation in DSP applications.
1. Design Entry
Project Creation: Designers initiate the process by creating a new project within Vivado,
specifying project settings such as device family, device part or target board, and simulation
language.
Design Sources: Designers add source files, including HDL (Hardware Description Language)
files, constraints, and IP (Intellectual Property) cores to the project.
2. Synthesis
RTL Synthesis: Vivado synthesizes the RTL (Register Transfer Level) code to generate a
logical netlist, optimizing the design for target performance and area constraints.
High-Level Synthesis (HLS): Optionally, designers can leverage HLS to synthesize C/C++
code into hardware-accelerated functions.
3. Implementation
Place and Route: Vivado performs automated placement and routing, mapping the logical
netlist onto the physical FPGA fabric while meeting timing and resource constraints.
4. Bitstream Generation
Bitstream Creation: Vivado generates the bitstream—a binary file containing configuration
data—representing the finalized FPGA design.
Configuration File: The bitstream, along with other necessary configuration files, is prepared
for deployment onto the target FPGA device.
Timing Analysis: Vivado conducts timing analysis to verify that design requirements, such as
clock frequency and setup/hold times, are met.
Simulation: Designers can simulate the synthesized design using Vivado's built-in simulator
to validate functionality and behaviour.
6. Hardware Definition
Vivado also exports a hardware component called the Hardware Definition File. The hardware
definition file serves as a comprehensive container holding all the necessary information to
construct a platform for any targeted AMD device. Within this container, a key component is the
HWH, or Hardware Handoff File. This file emerges from the execution of output products on a Block
Design, essentially a detailed map showing the interconnection and functionality of various
components. The HWH file's knowledge is confined to the scope of the Block Design.
This HWH file plays a crucial role for software tools, as it encapsulates all the requisite details for
tailoring an application specifically for the device in question. It delineates the architecture of the
device, including the central processing units (CPUs), data pathways (buses), integral components
(IP), and the interfaces for external communication (ports and pins) such as interrupt signals.
In scenarios involving a Zynq Ultrascale+ device (or a Zynq 7-series), the hardware definition file
package is enriched with additional files: psu_init.c/h and tcl scripts. These files are instrumental in
configuring the device's Processing Subsystem according to the parameters defined in Vivado, the
design tool utilized for the project. The psu_init scripts are particularly pivotal during the First Stage
Boot Loader (FSBL), ensuring the system boots up with the correct configurations. The tcl script, on
the other hand, aids in debugging by facilitating the same configuration tasks.
Furthermore, if the project incorporates a block RAM (BRAM) system that is directly addressable in
the Programmable Logic (PL), or in cases involving FPGA setups not integrated into a System on
The hardware definition file is then loaded into Vitis and all the information necessary regarding the
hardware design is loaded. Various header files and functions are then ready to use from the
software part of the project.
While both Vivado and Vitis offer high-level synthesis capabilities, their primary distinction lies in
their target audience and application focus. Vivado is tailored for hardware engineers focusing on
FPGA circuit design, offering detailed control over hardware aspects. In contrast, Vitis targets
software developers and data scientists looking to leverage FPGA acceleration without delving into
the complexities of hardware design. It’s also important to note that Vitis HLS will come up in the
embedded software flow before Vivado block design applies.
Simulation Tools
Simulation constitutes a vital aspect of FPGA development, enabling designers to verify the
functionality and performance of their designs before deployment. AMD offers a suite of simulation
resources integrated within the Vivado Design Suite, providing comprehensive support for
behavioral and timing analysis of FPGA designs.
AMD's simulation resources encompass a range of tools and utilities tailored to meet the diverse
simulation needs of FPGA developers. The Vivado Simulator, also known as Xsim, is a built-in feature
of the Vivado Design Suite, offers advanced capabilities for RTL simulation, enabling designers to
validate their designs at the register-transfer level. With support for industry-standard languages
such as VHDL and Verilog, as well as advanced verification methodologies such as SystemVerilog
Assertions (SVA), the Vivado Simulator facilitates thorough and efficient verification of FPGA
designs.
Additionally, AMD provides support for third-party simulation tools such as ModelSim from Mentor,
a widely-used simulator in the FPGA industry. ModelSim offers advanced debugging features,
In addition to proprietary simulation tools, open-source VHDL simulation frameworks such as GHDL
and Icarus Verilog offer viable alternatives for FPGA developers. GHDL, a free and open-source VHDL
simulator, provides support for IEEE standard VHDL language constructs and offers seamless
integration with popular development environments such as Visual Studio Code (VS Code) through
the use of appropriate extensions. Similarly, Icarus Verilog offers a robust simulation environment
for Verilog designs, with support for mixed-language simulation and advanced debugging features.
By leveraging AMD's simulation resources and exploring alternative simulation tools such as
ModelSim and open-source VHDL simulators, FPGA developers can ensure the thorough validation
of their designs and mitigate potential errors and performance bottlenecks early in the development
process. Through rigorous simulation and verification, designers can enhance the reliability and
robustness of their FPGA-based solutions, ultimately delivering superior performance and
functionality to end-users.
Central to the harnessing of FPGA capabilities lies the mastery of Hardware Description Languages
(HDLs), with VHDL emerging as a preeminent choice among engineers and designers. This chapter
aims to serve as a comprehensive guide to FPGA programming, with a particular emphasis on VHDL.
It will navigate through the fundamental concepts, delve into advanced techniques, and explore
practical applications, ensuring a robust understanding of FPGA development.
Introduction to VHDL
VHDL, an acronym for Very-High-Speed-Integrated-Circuit Hardware Description Language, is a
robust and versatile language instrumental in specifying the behaviour and structure of digital
systems. Originating from the U.S. Department of Defense's Very High-Speed Integrated Circuit
(VHSIC) program, VHDL has ascended to become an industry-standard language for FPGA and ASIC
design. This section initiates the journey into VHDL with a comprehensive exploration of its basic
syntax, data types, control structures, and modelling principles. With VHDL it is always important to
remember that one is describing hardware, and therefore all the issues one may expect of having
with digital hardware circuits, one will also find when implementing VHDL in FPGAs.
Basic Syntax
VHDL employs a structured syntax resembling natural language constructs, facilitating the
expression of complex digital designs in a concise and comprehensible manner. The basic syntax of
VHDL comprises entity-architecture pairs, where entities define the interface of a hardware
component, while architectures describe its internal behaviour. The following VHDL snippet shows
how Module Instantiation is done, which is discussed further in following sections.
This VHDL code defines an AND gate, a fundamental digital logic component that performs a logical
AND operation on its input signals. Let's break down the code for beginners:
Port Declaration: Inside the entity, the Port keyword is used to declare the input and output ports of
the AND gate. A and B are declared as input ports (in), while Y is declared as an output port (out).
STD_LOGIC represents a single-bit signal, which can take on values '0', '1', 'U' (undefined), 'X'
(unknown), 'Z' (high impedance), or 'W' (weak unknown).
Architecture Declaration (Behavioural): The architecture keyword defines the internal behaviour or
functionality of the entity. Behavioural is the name given to this architecture.
Behavioural Code: Inside the architecture, Y <= A and B; is the behavioural code that describes the
AND gate's functionality. The symbol <= is the signal assignment operator, indicating that the
output signal Y is assigned the result of the AND operation between input signals A and B. The and
keyword represents the logical AND operation, which yields '1' (true) only when both inputs are '1'.
Otherwise, it results in '0' (false). This way a designer can implement digital logic circuitry inside a
black called the entity. In subsequent sections the use of multiple entities together is discussed.
Data Types
VHDL offers a rich set of data types catering to various levels of abstraction in digital design. These
data types encompass scalar types, composite types, and enumerated types, each serving distinct
purposes in modelling digital systems. Data types and converting between them can be a common
stumbling-block for those first learning VHDL, as it is a strongly-typed language and requires explicit
conversion between types. For example, the addition operator found in the IEEE numeric standard
package, which is extremely commonly used, takes two signed or unsigned inputs and outputs the
corresponding signed or unsigned type. To mix and match data types, or use the more-portable
standard_logic_vectors commonly used for I/O interfaces, a designer must use type-conversion
functions.
Scalar Types: Scalar types represent single values and include BOOLEAN, INTEGER, and REAL. These
types are essential for expressing individual signals and variables within a design.
Composite Types: Composite types combine multiple values into structured data objects, enabling
the representation of complex data structures. ARRAY and RECORD are common examples of
composite types in VHDL.
-- 1D Array Example
type Word_Array is array(0 to 7) of STD_LOGIC_VECTOR(7 downto 0);
signal word_data : Word_Array;
-- 2D Array Example
type Matrix_Array is array(0 to 3, 0 to 3) of INTEGER;
signal matrix_data : Matrix_Array;
Enumerated Types: Enumerated types facilitate the definition of discrete sets of values, enhancing
code readability and maintainability. Enumerated types are particularly useful for modelling state
machines and finite state systems.
Control Structures
VHDL encompasses a variety of control structures for specifying the flow of execution within a
design. These control structures include sequential statements and concurrent statements, each
serving distinct purposes in describing the behaviour of digital systems.
In this example, the statements inside the process block are executed sequentially. First, the output
signal is assigned the result of the AND operation between input1 and input2. Then, the counter is
incremented by 1.
In this example, the assignments to output and counter occur concurrently. The output signal is
continuously updated based on the AND operation between input1 and input2, while the counter is
incremented only on the rising edge of the clock (clk). Both assignments can happen
simultaneously, reflecting the concurrent nature of VHDL.
The fundamental distinction between sequential and concurrent statements in VHDL lies in their
execution order and timing behaviour. Sequential statements follow a predefined order of execution,
where each statement is processed in sequence as they appear within a process or block. In
contrast, concurrent statements are executed simultaneously, without any predetermined order.
This concurrent execution enables multiple actions to occur concurrently, allowing for parallel
behaviour within the design. This is the real power of the FPGA!! The following sections will explore
this argument with further detail. More particularly section VHDL Data Types and Conversions.
Sensitivity List: A sensitivity list in VHDL specifies the signals that the process is sensitive to. It
essentially tells the simulator or hardware which events should trigger the execution of the process.
The process will execute whenever any of the signals in the sensitivity list experiences a change in
value. For example:
In this example, the process will be triggered whenever either signal1 or signal2 changes.
Signal Assignment: In VHDL, signals represent physical connections between different parts of a
digital circuit. Signal assignment within a process involves updating the value of a signal based on
certain conditions or expressions.
process (clk)
begin
if rising_edge(clk) then
signal_out <= signal_in1 and signal_in2;
end if;
end process;
In this example, the signal_out is updated whenever there is a rising edge on the clock signal (clk).
The value assigned to signal_out is the logical AND of signal_in1 and signal_in2. A very important
thing to keep in mind is that signals do NOT immediately adopt the value assigned to them, they are
updated when the process finishes.
Variable Assignment: Variables in VHDL are used for temporary storage within a process and are
only accessible within the process in which they are declared. Unlike signals, variables are not bound
by the event-driven model of signal changes. Variable assignments occur sequentially within the
process, meaning they execute one after the other and they are updated instantly.
process (reset)
variable count : integer := 0;
begin
if reset = '1' then
count := 0;
elsif rising_edge(clk) then
count := count + 1;
end if;
end process;
In the next chapter sequential statements are discussed. These are the tools available to the
designer in a process.
Sequential Statements
Variables
Variables serve as containers for storing intermediate values between sequential VHDL statements.
They are restricted to processes, procedures, and functions, and remain local to these constructs.
The assignment operator ":=" is utilized when assigning a value to a variable.
Note: Both signals and variables transport data within a design. However, signals are required for
conveying information between concurrent elements of the design. The following are examples of
the most common sequential statements available to the designers.
If-then-else Statement
The following is a high-level example of the If-then-else Statement.
The following includes Boolean expressions commonly implemented in the If-then-else Statement.
process ( a, b, m, n)
begin
if m = n then
r <= a + b;
elsif m > 0 then
r <= a - b;
else
r <= a + 1;
end if;
end process;
Case Statement
The following is a high-level example of the Case Statement.
case sel is
when choice_1 =>
sequential_statements;
when choice_2 =>
sequential_statements;
...
when others =>
sequential_statements;
end case;
The following includes sequential statements commonly implemented in the Case Statement.
case sel is
when "00" =>
r <= a + b;
when "10" =>
r <= a - b;
when others =>
r <= a + 1;
end case;
The following includes sequential statements commonly implemented in the for loop.
While Loop
The following is a high-level example of the while loop.
The following includes sequential statements commonly implemented in the while loop.
with s select
z <= a when "00",
b when "01",
c when "10",
d when others;
Synchronous Logic
Synchronous logic relies on clock signals to synchronize operations, ensuring predictable and
deterministic behaviour. This section explores the foundational concepts of synchronous logic,
including the role of clock signals, clock domain crossing, and the impact of clock skew and jitter on
system performance. Additionally, it discusses techniques for designing synchronous circuits and
mitigating potential timing hazards.
Clocking Considerations
In synchronous logic design, employing good patterns for using clocks is crucial for ensuring reliable
and efficient operation of digital circuits. One fundamental practice is maintaining a single clock
domain throughout the design. By using a single primary clock signal to synchronize all sequential
elements within the FPGA, timing analysis is simplified, and the risk of timing violations is reduced.
entity MyDesign is
Port (
clk : in STD_LOGIC;
reset : in STD_LOGIC;
data_in : in STD_LOGIC_VECTOR(7 downto 0);
data_out : out STD_LOGIC_VECTOR(7 downto 0)
);
end MyDesign;
Additionally, when interfacing between different clock domains, proper clock domain crossing
techniques should be employed to synchronize signals and avoid metastability issues. For instance,
the following VHDL code demonstrates a double synchronization technique using two flip-flops to
transfer data safely between two clock domains:
process (clk2)
begin
if rising_edge(clk2) then
data_sync2 <= data_sync1;
data_out <= data_sync2;
end if;
end process;
For more information on properly handling clock domain crossing and commonly used design
techniques, Clifford E Cummings' paper for Sunburst Design Inc, here.
Moreover, optimizing the clock tree is essential to minimize clock skew and jitter, ensuring
consistent clock signals across the FPGA. Proper placement and routing of clock signals, along with
clock skew analysis, contribute to improved timing closure and overall performance.
Lastly, incorporating clock enable signals allows for finer control over data capture by enabling or
disabling flip-flops or registers based on specific conditions. This helps reduce unnecessary power
consumption when not actively processing data. By following these good patterns for using clocks,
designers can optimize the utilization of clock signals in synchronous logic designs, ensuring
robustness, reliability, and performance in FPGA-based systems.
Resetting
Proper handling of reset signals is paramount in synchronous logic design to ensure reliable
initialization and operation of digital circuits. This subsection explores the concepts of synchronous
and asynchronous resets, highlighting their distinctions and offering examples of their
implementation in VHDL.
Synchronous Reset
Synchronous resets are synchronized to the clock signal, guaranteeing that the reset operation
occurs at a known point relative to the clock edge. This synchronization mitigates potential timing
hazards and ensures consistent behaviour across different clock domains. In VHDL, synchronous
resets are typically realized using a flip-flop with a reset-enable input. Consider the following VHDL
snippet illustrating the implementation of a synchronous reset:
process (clk)
begin
if rising_edge(clk) then
if reset = '1' then
-- Synchronous reset: reset flip-flop to '0' on active edge of
clock
flip_flop <= '0';
else
-- Update flip-flop state on rising edge of clock
flip_flop <= data_in;
end if;
end if;
end process;
Notice how the reset signal here must be in the sensitivity list of the process, as such the process
will be triggered both with a logical change of the clock and the reset. This way the reset signal is
completely independent of the clock and thus is asynchronous. Understanding the principles of
synchronous and asynchronous resets and their VHDL implementations empowers designers to
effectively manage reset signals, ensuring the robustness and reliability of their digital designs.
Digilent provides some common modules for handling reset synchronization and some basic CDC
(Clock Domain Crossing) techniques, in the vivado-library repository on GitHub, here.
Asynchronous Logic
Asynchronous logic, on the other hand, operates independently of clock signals, introducing
potential timing hazards and metastability issues. Asynchronous logic in VHDL refers to digital
circuitry that operates independently of a clock signal, unlike synchronous logic which relies on
clock signals for synchronization. In asynchronous logic design, circuit elements respond
immediately to changes in their inputs, without waiting for a clock signal to trigger their actions. This
approach offers advantages in certain scenarios where strict timing requirements are not critical or
where responsiveness to input changes is paramount.
In VHDL, asynchronous logic can be implemented using processes sensitive to input signals,
allowing for immediate response to changes in input values. For example, an asynchronous D
flip-flop can be designed to update its output whenever the input signal changes, rather than waiting
for a clock signal.
While asynchronous logic offers flexibility and responsiveness, it also introduces challenges such as
metastability, where flip-flops may capture uncertain values due to input changes occurring near
Overall, asynchronous logic in VHDL provides a versatile approach to digital circuit design, offering
responsiveness and simplicity in certain applications where strict timing synchronization is not
required. However, careful consideration of timing hazards and appropriate design techniques is
necessary to ensure the robustness and reliability of asynchronous circuits.
entity Counter is
generic (
WIDTH : integer := 8 -- Default width of 8 bits
);
Port (
clk : in STD_LOGIC;
reset : in STD_LOGIC;
count : out STD_LOGIC_VECTOR(WIDTH-1 downto 0)
);
end Counter;
The top-level module acts as a centralized control hub, receiving external inputs, processing them
through the system, and generating corresponding outputs. It orchestrates the flow of data and
control signals throughout the design, facilitating communication between different modules and
managing overall system behaviour.
Designers often use top-level modules to instantiate and interconnect lower-level modules or
components, organizing the design into a hierarchical structure. This hierarchical approach
simplifies the design process, enhances modularity, and promotes code reuse by breaking down
complex systems into smaller, more manageable units.
The following is an example of a top-level module which is instantiating the counter in the previous
VHDL example.
By leveraging hierarchical design and module parameterization in VHDL, designers can create
scalable and reusable digital systems that are adaptable to diverse requirements and facilitate
efficient development processes. Overall, top-level modules play a pivotal role in digital design,
serving as the foundation upon which complex systems are built. They provide a centralized
interface for system control and integration, facilitating efficient development, testing, and
maintenance of digital designs.
1. Scalar Types: Scalar types represent single values and include basic types such as BIT,
BOOLEAN, INTEGER, REAL, and TIME.
3. Enumeration Types: Enumeration types define a set of named values, such as ENUMERATION
and SUBTYPE.
4. Access Types: Access types provide references to objects in memory, enabling dynamic
memory allocation and manipulation.
5. File Types: File types are used for file I/O operations within VHDL designs. (Not synthesisable)
Each data type in VHDL serves specific purposes and offers unique capabilities for digital design.
Understanding the characteristics and usage scenarios of each type is crucial for effective design
implementation.
Conversions
Data and type conversions in VHDL involve transforming data from one type to another. These
conversions can be implicit or explicit, depending on the context and compatibility between the
source and target types. Examples of data conversions include:
1. Type Conversion: Type conversion operations enable transforming data from one type to
another using explicit casting operations. These conversions are essential for ensuring
compatibility between different types in hardware designs. For example:
Conversions play a vital role in VHDL designs, enabling the transformation of data between different
types while maintaining synthesizability for hardware implementation. By leveraging these
conversion techniques effectively, designers can develop robust and efficient digital systems
suitable for hardware synthesis.
Figure 25: A diagram illustrating how to convert between the most common VHDL types.
This diagram was taken from a great resource for VHDL type conversions on bitweenie.com.
Pipeline Design: Pipeline design techniques enhance system throughput and performance by
breaking down complex operations into smaller stages, enabling parallel processing and reducing
latency.
FIFO Design: First-In-First-Out (FIFO) design methodologies facilitate efficient data buffering and
management, ensuring smooth data flow and preventing data loss or overflow.
Finite State Machine (FSM) Design: FSM design techniques enable the implementation of complex
control logic and state-dependent behaviour, enhancing system functionality and versatility.
Clock Domain Crossing (CDC) Mitigation: CDC mitigation strategies address timing issues arising
from data transfer between different clock domains, ensuring reliable and synchronized operation
across the entire system.
Resource Sharing: Resource sharing techniques reduce hardware resource utilization by identifying
and consolidating common logic elements, minimizing area overhead and improving design
efficiency.
Clock Gating: Clock gating strategies optimize power consumption by selectively enabling or
disabling clock signals based on specific conditions, reducing dynamic power dissipation in the
digital system.
Area Optimization: Area optimization methodologies focus on minimizing the physical footprint of
the design by optimizing logic placement, routing, and resource allocation to achieve compact and
efficient designs.
Timing Closure Techniques: Timing closure techniques ensure that the design meets timing
constraints and achieves reliable operation by optimizing critical paths, balancing clock skew, and
resolving timing violations.
In this chapter of the FPGA Handbook the essential design techniques and best practices for
developing efficient and reliable digital systems is discussed.
RTL Design
RTL design serves as the foundation for describing digital circuits using registers and combinational
logic. It involves partitioning the design into data path and control logic, ensuring synchronous
operation, and adhering to coding guidelines for clarity and maintainability. The following is an
example of a Register-Transfer Level Flip-Flop:
entity D_FF is
Port (
clk : in STD_LOGIC;
rst : in STD_LOGIC;
d : in STD_LOGIC;
q : out STD_LOGIC
);
end D_FF;
Mealy Machines: In a Mealy machine, both the outputs and the state transitions are dependent on
the current state and the inputs. The output is a function of both the current state and the input. This
type of FSM is characterized by its ability to produce outputs that can change asynchronously with
respect to the inputs.
Moore Machines: Unlike Mealy machines, Moore machines have outputs that are only dependent on
the current state. The state transitions are determined solely by the current state and the inputs.
This type of FSM is known for its synchronous output behaviour, where the output changes only at
the clock edge.
FSMs are commonly represented using state transition diagrams, where nodes represent states,
and directed edges represent transitions triggered by inputs. Each state is associated with specific
outputs or actions that occur when the system is in that state. These diagrams provide a visual
representation of the system's behaviour and aid in understanding and designing FSMs.
FSMs find applications in various areas of digital design, including control systems, protocol
implementations, and stateful data processing. They are versatile tools that allow designers to
model complex behaviour in a systematic and structured manner, facilitating the development of
efficient and reliable digital systems. Understanding FSMs is essential for anyone involved in digital
design, as they form the basis for many advanced design techniques and methodologies.
process (state)
begin
case state is
when S0 => output <= '0';
when S1 => output <= '1';
when S2 => output <= '0';
end case;
end process;
end RTL;
In this example, we define a Moore FSM entity with clock (clk), reset (reset), input (input), and output
(output) ports. The architecture comprises two processes: one for state transition and another for
output generation based on the current state. The FSM transitions between states based on input
conditions and generates outputs accordingly.
Timing Constraints
Timing constraints are essential for ensuring correct operation and timing closure of digital designs.
They define the timing requirements and constraints for signals in the design, guiding the synthesis
and place-and-route processes to meet timing objectives.
This timing constraint specifies a clock signal named clk with a period of 10 units of time. It guides
the synthesis and place-and-route tools to optimize the design to meet this timing requirement,
ensuring proper operation of synchronous elements in the design. A large number of timing
constraints exist and are very important in complex designs.
It is important to note that methods and syntax for timing constraints are largely specific to different
vendors. Take a look at UG903 Vivado Design Suite User Guide: Using Constraints for AMD FPGAs.
In this example, we implement a pipelined adder with three stages: input addition (sum_1),
intermediate addition (sum_2), and final result (result). Each stage operates on the rising edge of the
clock signal (clk), enabling concurrent execution of multiple additions and improving overall system
throughput.
Power Optimization
Power optimization techniques aim to minimize power consumption while maintaining performance
and functionality in digital designs. They include strategies such as clock gating, voltage scaling, and
resource optimization to achieve power-efficient designs.
In this example, we implement clock gating to selectively enable or disable the clock signal
(clk_gated) based on the enable signal. By gating the clock when it is not needed, power
consumption is reduced, improving overall power efficiency in the design.
AMD devices incorporate dedicated clock networks designed to offer large-fanout, low-skew
clocking resources. The inclusion of fine-grained clock gating techniques in HDL code can impair
functionality and hinder the effective utilization of these dedicated clocking resources.
Consequently, AMD advises against implementing clock gating constructs in the clock path when
writing HDL for their devices. Instead, it is recommended to manage clocking by employing coding
techniques that infer clock enables to deactivate sections of the design, whether for functionality or
power optimization purposes. (UltraFast Design Methodology Guide for FPGAs and SoCs (UG949)).
In VHDL, optimizing for size involves implementing strategies to reduce the hardware resources
required by the design. One approach is to adopt a modular design methodology, breaking down the
system into smaller, reusable modules. This not only enhances organization but also includes
redundancy, leading to a more compact overall design. Parameterization of modules allows for
configurability, enabling the reuse of modules across different contexts and mitigating the need for
multiple similar modules, thus further optimizing size. Moreover, careful consideration of data types
based on the required range and precision of signals or variables can contribute to size reduction, as
employing smaller data types where feasible diminishes the overall design footprint. Efficiency in
VHDL code is paramount; designing code that is concise and rid of unnecessary operations or
redundancy aids in minimizing design size.
Additionally, eliminating extra and unneeded signals and variables, streamlining state machines,
optimizing design hierarchy, and fostering resource sharing between modules all play pivotal roles in
reducing the size of VHDL designs. Ensuring sequential logic elements are properly clocked and
avoiding the use of latches further aids in size optimization. Lastly, leveraging synthesis optimization
options provided by synthesis tools can automate certain optimization processes, aiding in the
quest for a compact VHDL design while still meeting functionality and performance criteria.
FPGA verification ensures that the FPGA design meets functional requirements, operates correctly
under various conditions, and meets timing constraints before it is manufactured and deployed in
the target application.
Simulation Models: These models serve as abstract representations of the FPGA design, utilized
during simulations to emulate the behavior of the actual hardware. Simulation models facilitate the
prediction of design performance under various operational scenarios.
Testbenches: Testbenches are HDL codes designed to provide a controlled environment for
stimulating and verifying the design. They encompass stimulus generation (input signals), the
instantiation of the design under test (DUT), and mechanisms to compare the outputs against
expected results. This is illustrated graphically in Figure 25.
Functional Simulation: This type of simulation focuses on verifying the logical correctness of the
design, ensuring that it performs the intended operations without considering timing constraints.
Testbench
ModelSim
ModelSim is a widely used HDL simulator that supports both functional and timing simulations,
accommodating VHDL, Verilog, and SystemVerilog.
Features:
1. Advanced debugging capabilities, including waveform viewing, breakpoints, and signal tracing.
2. Integration with numerous FPGA design tools and environments.
3. Comprehensive support for multiple HDL languages.
4. Efficient handling of large designs and complex testbenches.
Features:
1. Seamless integration with the Vivado Design Suite, providing a unified workflow from design
entry to simulation.
2. Support for mixed-language simulations, including VHDL, Verilog, and SystemVerilog.
3. A rich set of debugging tools, such as waveform viewers and logic analyzers.
4. Capabilities for both functional and timing simulations, ensuring thorough verification.
5. Free of Charge
Features:
1. Support for VHDL and Verilog simulations.
2. Integration within the ISE design suite, facilitating a smooth transition from design to
simulation.
3. Suitable for both functional and timing simulations.
4. Basic debugging tools, though less advanced compared to Vivado.
1. Design Entry: The FPGA design is created using Vivado or ISE, involving HDL code writing, IP
core integration, and constraint definition.
2. Testbench Creation: A testbench is developed to apply stimulus and verify the outputs of the
design. The testbench is typically written in the same HDL as the design.
3. Simulation Setup: The simulation environment is configured in Vivado or ISE, which includes
selecting the appropriate simulator (Vivado Simulator or ISim), specifying the testbench, and
setting up simulation parameters.
4. Running Simulations:
• Functional Simulation: Initial simulations are conducted to verify the logical correctness of
the design. Functional errors are debugged using waveform viewers and other tools.
• Timing Simulation: Post-synthesis and implementation, timing simulations are performed to
ensure the design meets timing constraints. This step verifies the design's correctness under
real-world timing conditions.
5. Debugging and Verification: Debugging tools provided by the simulator are used to trace
signals, set breakpoints, and inspect waveforms. The design and testbench are iteratively
refined to resolve issues.
6. Validation: Upon successful simulation, the design proceeds to further stages such as
synthesis, implementation, and eventual hardware testing on the FPGA.
Hardware/Software Cosimulation
Hardware/software cosimulation is an essential technique in the design and verification of complex
systems, particularly for AMD devices. This approach integrates the simulation of hardware
components with software execution, ensuring both interact correctly and perform optimally within
a unified environment. Tools like the AMD Vivado Design Suite, which includes the Vivado Simulator
and System Generator, facilitate this process by allowing for concurrent simulation of HDL designs
and embedded software. Additionally, the AMD Vitis Unified Software Platform and Vitis Model
Composer enable comprehensive cosimulation by providing environments where hardware
accelerators can be simulated alongside software. This methodology emphasizes a co-design
approach, where hardware and software are developed concurrently with iterative testing to ensure
optimal integration. Proper partitioning of tasks between hardware and software is crucial, with
high-performance, parallel tasks typically assigned to hardware, and control-oriented, complex
algorithms handled by software. This integrated approach helps in identifying and addressing issues
early in the design process, thereby reducing development time and costs.
Testbench Development
Testbench development is a fundamental aspect of the verification process in VHDL design. A
testbench provides a controlled environment to apply stimuli to the DUT and observe its behaviour,
ensuring that the design meets its specifications. This review explores the key components and
methodology of developing testbenches in VHDL, along with a simplified example. The key
components of a testbench in VHDL are given below.
Architecture Body: This section contains the actual testbench implementation. It includes signal
declarations, instance of the DUT, stimulus generation, and output checking mechanisms.
Signal Declarations: Signals are used to connect the DUT and to generate stimulus inputs and
capture outputs.
Instance of the DUT: The DUT is instantiated within the testbench architecture, connecting internal
signals to its ports.
Stimulus Generation: This involves creating the necessary input signals to exercise the DUT. Stimuli
can be applied using concurrent statements (processes) that generate waveforms.
Plan Test Cases: Develop a comprehensive set of test cases to cover all possible scenarios, including
normal operation, boundary conditions, and erroneous inputs.
Write the Testbench: Implement the testbench in VHDL, ensuring that it accurately reflects the
planned test cases. This includes creating processes for stimulus generation and output checking.
Run Simulations: Use a VHDL simulator to execute the testbench and observe the behaviour of the
DUT. Record the simulation results for analysis.
Analyse Results: Compare the observed behaviour against expected outcomes. Identify and debug
any discrepancies to ensure the DUT operates correctly under all tested conditions.
DUT:
library IEEE;
use IEEE.STD_LOGIC_1164.ALL;
entity full_adder_vhdl_code is
Port ( A : in STD_LOGIC;
B : in STD_LOGIC;
Cin : in STD_LOGIC;
S : out STD_LOGIC;
Cout : out STD_LOGIC);
end full_adder_vhdl_code;
begin
end gate_level;
library IEEE;
use IEEE.STD_LOGIC_1164.ALL;
entity full_adder_tb is
end full_adder_tb;
begin
-- Instantiate the Unit Under Test (UUT)
uut: full_adder_vhdl_code
Port map (
A => A,
B => B,
Cin => Cin,
S => S,
Cout => Cout
);
-- Stimulus process
stim_proc: process
begin
-- Test case 1: 0 + 0 + 0
A <= '0'; B <= '0'; Cin <= '0';
wait for 10 ns;
assert (S = '0' and Cout = '0') report "Test case 1 failed"
severity error;
-- Test case 2: 0 + 0 + 1
A <= '0'; B <= '0'; Cin <= '1';
wait for 10 ns;
assert (S = '1' and Cout = '0') report "Test case 2 failed"
severity error;
-- Test case 3: 0 + 1 + 0
A <= '0'; B <= '1'; Cin <= '0';
wait for 10 ns;
assert (S = '1' and Cout = '0') report "Test case 3 failed"
severity error;
-- Test case 4: 0 + 1 + 1
A <= '0'; B <= '1'; Cin <= '1';
wait for 10 ns;
The AMD ILA is embedded within the FPGA design and offers deep visibility into the internal state of
the FPGA without needing to route signals to external pins. The ILA core can be instantiated within
the FPGA design, in fact multiple ILAs can be instantiated in the same design, where it can probe and
capture internal signal states based on user-defined trigger conditions. The captured data is then
analysed using the Vivado toolset, which provides a graphical interface for viewing and interpreting
the signal waveforms. This in-situ debugging capability is especially valuable for complex designs
where external access to internal signals is limited or impractical.
One of the key features of the AMD ILA is its configurability. Users can specify which signals to
monitor, set trigger conditions, and determine the depth of the capture memory. This allows for
tailored debugging sessions focused on specific areas of interest within the design. The ILA
supports a wide range of triggering options, including basic edge triggers, logical combinations of
signals, and even sequential triggers, enabling the capture of complex scenarios that may lead to
bugs or unexpected behaviour. Moreover, the ILA core can operate at the full speed of the FPGA,
ensuring that high-speed signals are accurately captured and analysed.
The integration of the ILA within the Vivado Design Suite further enhances its utility. Vivado provides
a seamless workflow for inserting the ILA core into the design, compiling the FPGA configuration,
and subsequently analysing the captured data. The tool's interface allows for real-time interaction
with the running design, enabling users to adjust trigger conditions and probe settings on-the-fly.
This dynamic capability is crucial for iterative debugging and refinement, allowing developers to
quickly hone in on issues and verify fixes without lengthy design re-implementations. A user guide
of the AMD ILA is titled UG936, chapter 10 has a very interesting example of this.
Integrated Logic Analyzers (ILAs) can be built into intellectual property (IP) cores, like the video
converter IPs provided by Digilent, such as dvi2rgb or rgb2dvi. These ILAs use a generic and a
generate statement, or similar methods, allowing users to easily enable them by ticking a box in the
IP configuration. This adds ILAs to important signals, such as phase-locked loop (PLL) locks, making
it easier to monitor and debug the system.
IP (Intellectual Property) cores in FPGAs are pre-designed, reusable blocks of logic or functions that
simplify and expedite the FPGA design process. These cores encapsulate specific functionalities,
allowing designers to incorporate complex operations without having to design them from scratch.
The use of IP cores not only accelerates development but also reduces cost and risk by leveraging
proven and verified components. This approach is especially beneficial in modern FPGA
development, where time-to-market pressures and design complexities continue to increase.
• Types of IP Cores
• AXI Interfaces
• Licensing and Integration
• Design and Reuse Strategies
• High-Level Synthesis (HLS)
Types of IP Cores
Basic Functional Cores are the fundamental building blocks used in FPGA designs. These include
simple arithmetic operations like adders, multipliers, and accumulators, as well as essential
components like counters, shift registers, and logical gates. These cores provide the basic
functionalities required in almost every digital system, serving as the foundation upon which more
complex systems are built. By utilizing these pre-verified components, designers can focus on
higher-level design and system integration tasks, rather than spending time on elementary logic
design. Additionally, infrastructure like AXI interconnects and processor subsystem resets further
support the efficient integration and operation of these cores. Clocking wizard is a basic function
that simplifies the configuration of complex clocking primitives as well.
Processor Cores provide embedded processing capabilities within FPGA designs. These can be soft
processors like the AMD MicroBlaze, which are implemented using the FPGA's programmable logic,
or hard processors like the ARM cores embedded in AMD Zynq SoCs. Soft processors offer flexibility,
allowing customization to specific application needs, while hard processors provide higher
performance and efficiency. Processor cores enable the FPGA to execute software programs,
making them suitable for complex applications that require both hardware acceleration and
software programmability.
In summary, IP cores are a vital component of FPGA design, offering reusable, pre-verified building
blocks that enhance development efficiency and reduce time-to-market. By leveraging a wide range
of available IP cores—from basic functional units to complex communication and processing
systems—designers can create sophisticated and robust FPGA-based applications more effectively.
AXI4-Lite
AXI4-Lite is a simplified subset of the AXI4 protocol, optimized for scenarios where simplicity and
minimal resource usage are more critical than high data throughput. It is designed primarily for
accessing control and status registers in peripherals. Unlike AXI4, AXI4-Lite supports only single
data transfers per address, which reduces complexity but limits throughput. This streamlined
approach results in reduced resource utilization, making AXI4-Lite ideal for low-bandwidth
interfaces and control paths within an SoC. The ease of implementation associated with AXI4-Lite
allows designers to quickly and efficiently integrate simple peripherals and control logic into their
FPGA designs without the overhead of managing complex data transfers.
AXI4-Stream
AXI4-Stream is specialized for high-speed, streaming data applications, where continuous,
high-bandwidth data flow is essential without the need for addressing overhead. Unlike AXI4 and
AXI4-Lite, AXI4-Stream eliminates the address phase, simplifying the protocol and reducing latency,
which is critical for applications such as video processing, data acquisition, and network data
streams. The protocol supports flexible data widths, enabling efficient use of available bandwidth
and accommodating various data formats. Additionally, AXI4-Stream incorporates flow control
mechanisms, allowing the receiver to manage data flow effectively and prevent buffer overflows.
This makes AXI4-Stream particularly suitable for applications requiring real-time data streaming and
high-throughput data processing. Furthermore, AXI4-Stream is significantly simpler, making it an
excellent starting point for learning about other AXI interfaces. It relies on a straightforward
handshake mechanism, with the core protocol consisting only of data, ready, and valid signals. In
contrast, AXI4 and AXI4-Lite utilize the same handshake mechanism but with multiple channels,
interdependencies between channels, and various sideband signals, adding complexity to their
protocols.
Integration of IP Cores
Integrating IP cores into FPGA designs is a streamlined process facilitated by modern FPGA design
tools. The first step involves selecting and configuring the required IP cores from the available
catalogue, tailoring them to meet specific design requirements such as data bus width or memory
type. FPGA design tools, like AMD’s Vivado Design Suite, include IP integrator tools that provide a
graphical interface for seamless integration. These tools enable designers to connect IP cores and
custom logic effortlessly, using drag-and-drop functionality and automated connection
suggestions, ensuring compatibility and proper signal routing. This graphical interface is often
referred to as a Block Design or Block Diagram, which visually represents the interconnected IP cores
and custom logic.
After the integration, the design is synthesized to generate a netlist, followed by simulation to verify
the integrated system's functionality. This simulation step is critical for identifying and resolving any
integration issues, ensuring that the IP cores and the overall system meet performance and
functional specifications. Once verified, the design undergoes implementation, which involves
placement and routing to optimize performance and resource usage. The final step is testing the
implemented design on actual hardware, using debugging tools like the Integrated Logic Analyzer
(ILA) to monitor internal signals and troubleshoot issues.
IP Core Libraries: Building and maintaining libraries of reusable IP cores is a fundamental approach
to design reuse. These libraries contain verified and validated functional blocks, such as processors,
memory controllers, controllers for communication interfaces, and custom logic modules. IP cores
within these libraries are typically designed to be parameterizable and configurable, allowing them
to be easily adapted to different applications and project requirements. FPGA vendors like AMD
often provide extensive IP core libraries as part of their development tools, supplemented by
third-party and open-source offerings. Similarly. Board vendors and vendors for other chips
commonly used alongside FPGAs (Analog Devices ADCs) also provide libraries.
Modular Design Approach: Adopting a modular design methodology involves breaking down
complex systems into smaller, self-contained modules that can be independently developed, tested,
and reused. Each module encapsulates a specific functionality or feature, with well-defined
interfaces for interaction with other modules. This modular approach promotes design scalability,
maintainability, and reusability, as modules can be easily integrated, replaced, or modified without
affecting the overall system architecture.
Design Templates and Frameworks: Design templates and frameworks provide reusable structures,
architectures, and design patterns tailored to specific application domains or design methodologies.
These templates encapsulate best practices, design guidelines, and implementation
methodologies, enabling designers to jumpstart their projects and streamline the development
process. Templates may include predefined configurations, scripts, and constraints to expedite the
setup and implementation of common FPGA designs, such as signal processing algorithms or digital
signal processing (DSP) applications.
Alternatively, AMD Model Composer is a powerful tool within the AMD Vivado Design Suite aimed at
accelerating the development of complex signal processing algorithms and models for FPGA
implementation. It offers a comprehensive environment for algorithm exploration, modelling, and
verification, allowing designers to seamlessly transition from high-level algorithm development to
FPGA implementation.
With AMD Model Composer, engineers can develop and simulate signal processing algorithms using
MATLAB or Simulink, industry-standard tools for algorithm development and simulation. The tool
provides extensive support for FPGA-specific optimizations and constraints, enabling designers to
achieve optimal performance and resource utilization in their FPGA implementations. By integrating
seamlessly with Vivado HLS and the Vivado Design Suite, AMD Model Composer empowers
designers to rapidly prototype, refine, and deploy sophisticated signal processing algorithms on
AMD FPGAs, significantly reducing time-to-market and accelerating innovation in domains such as
wireless communication, digital signal processing, and image processing. More information on
AMD’s High Level Design tools can be found here.
Processing System
No-OS Hardware Platform
Modulator axi_ad9361
Bit
I2S_Rec FIFO
Formatter
Sample
FIFO
Formatter
axi_ad9361
I2S
Demodulator util_wfifo
Transmitter
The main reason for using FPGAs was the need for high-speed signal processing, Having the
parallelism provided by FPGAs helps quite a lot with high-speed processing. The ability to rapidly
tweak an algorithm that can run at hardware speeds and the design modularity provided by FPGAs
lead to the decision to go for FPGAs quite easy.
Input filtering and averaging in an oscilloscope acquisition chain (Analog Discovery 3, most Analog
Discovery Pro devices) help to reduce noise and improve signal analysis. Hardware and software
filters are available within the Scope instrument.
Designs like this often require substantial additional effort to actually put that dependability and
safety in place. design techniques like keeping three copies of each safety-critical register so that
single-event upsets like a cosmic ray flipping a bit in a satellite can be detected and recovered from
quickly.
Software-based security has a large "attack surface," meaning many potential targets for attacks,
including:
• Operating systems
• Device drivers
• Cryptographic libraries
• Compiler optimizations and microarchitectural changes
• Depth of the software stack
• Cache and memory management
• Key management (e.g., buffer overflow bugs)
• Incomplete control over security algorithms
Software implementations may also struggle with performance (throughput and latency) and power
consumption. Additionally, maintaining software security through continuous updates over the
system's lifetime can be very challenging and expensive. IoT devices, for example, need ongoing
updates for bug fixes throughout their lifecycle, increasing the total cost of ownership.
Recent security concerns have also emerged regarding the underlying processor architecture.
Assumptions about the inherent security of processors have been questioned due to vulnerabilities
in performance optimizations. Though many issues have been patched, new vulnerabilities may still
arise.
Given these challenges, there is a growing trend towards hardware-based security solutions,
particularly using FPGAs, which offer greater control and security.
FPGA Reddit Community: A subreddit dedicated to FPGA technology and discussions. You can find
it at https://www.reddit.com/r/FPGA/.
FPGA Central Forums: An online community focused on FPGA discussions, projects, and resources.
You can visit the forums at https://www.fpgacentral.com/forum.
AMD Community Forums: AMD's official community forums where you can find support,
discussions, and resources related to AMD/Xilinx FPGAs and tools. You can access it at
https://support.xilinx.com/.
Altera Forums (now Intel FPGA Forums): Intel's official community forums for discussions on Intel
(formerly Altera) FPGAs and development tools. You can find the forums at
https://www.intel.com/content/www/us/en/programmable/support/support-resources/support
-centers/support-community.html.
VHDL Reddit Community: A subreddit for VHDL programming enthusiasts. You can join discussions
on VHDL at https://www.reddit.com/r/VHDL/.
FPGA Developer Forum: A platform for FPGA developers to exchange ideas, ask questions, and
share knowledge about FPGA design and development. You can visit the forum at
https://www.fpgadeveloper.com/forum.
VHDL Cafe Forum: An online forum dedicated to VHDL programming language discussions,
tutorials, and projects. You can participate in VHDL discussions at http://www.vhdlcafe.com/forum.
FPGA Groups on LinkedIn: Join FPGA-related groups on LinkedIn such as "FPGA Design," "FPGA
Engineers," and "VHDL Developers Forum" to network with professionals, share insights, and stay
updated on industry trends.
FPGA and VHDL Discord Channels: Explore various Discord channels dedicated to FPGA
development and VHDL programming. Joining these channels can help you connect with
enthusiasts and professionals in real-time discussions.
FPGA World Conference: A global conference series that brings together FPGA professionals,
researchers, and enthusiasts to share knowledge and insights on FPGA design, applications, and
advancements. Check their website for upcoming events: https://www.fpgaworld.com/.
Embedded Systems Conference (ESC): While not specifically focused on FPGAs, ESC is a significant
event where FPGA technology often plays a crucial role in embedded system design. It's a great
place to explore the latest trends in embedded systems and FPGA integration. Check out
https://esc.embedded.com/ for more information.
Additionally, there are numerous other conferences around the world that tangentially relate to the
FPGA field. These events provide valuable opportunities for learning and networking within the
broader context of FPGA technology and its applications.