VLSI Systems for Engineers
VLSI Systems for Engineers
Table of Contents
❍ Historical Perspective
❍ VLSI Design Flow
❍ Design Hierarchy
❍ Concepts of Regularity, Modularity and Locality
❍ VLSI Design Styles
❍ Introduction
❍ Fabrication Process Flow - Basic Steps
❍ The CMOS n-Well Process
❍ Advanced CMOS Fabrication Technologies
❍ Layout Design Rules
❍ Introduction
❍ CMOS Layout Design Rules
❍ CMOS Inverter Layout Design
❍ Layout of CMOS NAND and NOR Gates
❍ Complex CMOS Logic Gates
❍ Introduction
❍ The Reality with Interconnections
❍ MOSFET Capacitances
❍ Interconnect Capacitance Estimation
❍ Interconnect Resistance Estimation
❍ Introduction
❍ Notation Systems
❍ Introduction
❍ Overview of Power Consumption
❍ Low-Power Design Through Voltage Scaling
❍ Estimation and Optimization of Switching Activity
❍ Reduction of Switched Capacitance
❍ Adiabatic Logic Circuits
❍ Design Constraints
❍ Testing
❍ The Rule of Ten
❍ Terminology
❍ Failures in CMOS
❍ Combinational Logic Testing
❍ Practical Ad-Hoc DFT Guidelines
❍ Scan Design Techniques
❍ Systems Considerations
❍ Fuzzy Logic Based Control Background
❍ Integrated Implementations of Fuzzy Logic Circuits
❍ Digital Implementations of Fuzzy Logic Circuits
❍ Analog Implementations of Fuzzy Logic Circuits
❍ Mixed Digital/Analog Implementations of Fuzzy Systems
❍ CAD Automation for Fuzzy Logic Circuits Design
❍ Neural Networks Implementing Fuzzy Systems
❍ Introduction
❍ Digitization Of "TV Functions"
❍ Points Of Concern For The Design Methodology
❍ Conclusion
❍ Telecommunication Fundamentals
❍ ATM Networks
❍ Case Study: ATM Switch
❍ Case Study: ATM Transmission of Multiplexed-MPEG Streams
❍ Conclusion
❍ Bibliography
❍ General Architectures
❍ Data Path
❍ Addressing
❍ Peripherals
❍ Superscalar Architectures
❍ References
production of
Chapter 1
INTRODUCTION TO VLSI SYSTEMS
● Historical Perspective
● VLSI Design Flow
● Design Hierarchy
● Concepts of Regularity, Modularity and Locality
● VLSI Design Styles
The electronics industry has achieved a phenomenal growth over the last two decades, mainly due to the rapid advances in integration
technologies, large-scale systems design - in short, due to the advent of VLSI. The number of applications of integrated circuits in high-
performance computing, telecommunications, and consumer electronics has been rising steadily, and at a very fast pace. Typically, the
required computational power (or, in other words, the intelligence) of these applications is the driving force for the fast development of
this field. Figure 1.1 gives an overview of the prominent trends in information technologies over the next few decades. The current
leading-edge technologies (such as low bit-rate video and cellular communications) already provide the end-users a certain amount of
processing power and portability. This trend is expected to continue, with very important implications on VLSI and systems design. One
of the most important characteristics of information services is their increasing need for very high processing power and bandwidth (in
order to handle real-time video, for example). The other important characteristic is that the information services tend to become more
and more personalized (as opposed to collective services such as broadcasting), which means that the devices must be more intelligent
to answer individual demands, and at the same time they must be portable to allow more flexibility/mobility.
As more and more complex functions are required in various data processing and telecommunications devices, the need to integrate
these functions in a small system/package is also increasing. The level of integration as measured by the number of logic gates in a
monolithic chip has been steadily rising for almost three decades, mainly due to the rapid progress in processing technology and
interconnect technology. Table 1.1 shows the evolution of logic complexity in integrated circuits over the last three decades, and marks
the milestones of each era. Here, the numbers for circuit complexity should be interpreted only as representative examples to show the
order-of-magnitude. A logic block can contain anywhere from 10 to 100 transistors, depending on the function. State-of-the-art
examples of ULSI chips, such as the DEC Alpha or the INTEL Pentium contain 3 to 6 million transistors.
The most important message here is that the logic complexity per chip has been (and still is) increasing exponentially. The monolithic
integration of a large number of functions on a single chip usually provides:
Figure-1.2: Evolution of integration density and minimum feature size, as seen in the early 1980s.
Therefore, the current trend of integration will also continue in the foreseeable future. Advances in device manufacturing technology,
and especially the steady reduction of minimum feature size (minimum length of a transistor or an interconnect realizable on chip)
support this trend. Figure 1.2 shows the history and forecast of chip complexity - and minimum feature size - over time, as seen in the
early 1980s. At that time, a minimum feature size of 0.3 microns was expected around the year 2000. The actual development of the
technology, however, has far exceeded these expectations. A minimum size of 0.25 microns was readily achievable by the year 1995. As
a direct result of this, the integration density has also exceeded previous expectations - the first 64 Mbit DRAM, and the INTEL Pentium
microprocessor chip containing more than 3 million transistors were already available by 1994, pushing the envelope of integration
density.
When comparing the integration density of integrated circuits, a clear distinction must be made between the memory chips and logic
chips. Figure 1.3 shows the level of integration over time for memory and logic chips, starting in 1970. It can be observed that in terms
of transistor count, logic chips contain significantly fewer transistors in any given year mainly due to large consumption of chip area for
complex interconnects. Memory circuits are highly regular and thus more cells can be integrated with much less area for interconnects.
Figure-1.3: Level of integration over time, for memory chips and logic chips.
Generally speaking, logic chips such as microprocessor chips and digital signal processing (DSP) chips contain not only large arrays of
memory (SRAM) cells, but also many different functional units. As a result, their design complexity is considered much higher than that
of memory chips, although advanced memory chips contain some sophisticated logic functions. The design complexity of logic chips
increases almost exponentially with the number of transistors to be integrated. This is translated into the increase in the design cycle
time, which is the time period from the start of the chip development until the mask-tape delivery time. However, in order to make the
best use of the current technology, the chip development time has to be short enough to allow the maturing of chip manufacturing and
timely delivery to customers. As a result, the level of actual logic integration tends to fall short of the integration level achievable with
the current processing technology. Sophisticated computer-aided design (CAD) tools and methodologies are developed and applied in
order to manage the rapidly increasing design complexity.
The design process, at various levels, is usually evolutionary in nature. It starts with a given set of requirements. Initial design is
developed and tested against the requirements. When requirements are not met, the design has to be improved. If such improvement is
either not possible or too costly, then the revision of requirements and its impact analysis must be considered. The Y-chart (first
introduced by D. Gajski) shown in Fig. 1.4 illustrates a design flow for most logic chips, using design activities on three different axes
● behavioral domain,
● structural domain,
● geometrical layout domain.
The design flow starts from the algorithm that describes the behavior of the target chip. The corresponding architecture of the processor
is first defined. It is mapped onto the chip surface by floorplanning. The next design evolution in the behavioral domain defines finite
state machines (FSMs) which are structurally implemented with functional modules such as registers and arithmetic logic units (ALUs).
These modules are then geometrically placed onto the chip surface using CAD tools for automatic module placement followed by
routing, with a goal of minimizing the interconnects area and signal delays. The third evolution starts with a behavioral module
description. Individual modules are then implemented with leaf cells. At this stage the chip is described in terms of logic gates (leaf
cells), which can be placed and interconnected by using a cell placement & routing program. The last evolution involves a detailed
Boolean description of leaf cells followed by a transistor level implementation of leaf cells and mask generation. In standard-cell based
design, leaf cells are already pre-designed and stored in a library for logic design use.
Figure 1.5 provides a more simplified view of the VLSI design flow, taking into account the various representations, or abstractions of
design - behavioral, logic, circuit and mask layout. Note that the verification of design plays a very important role in every step during
this process. The failure to properly verify a design in its early phases typically causes significant and expensive re-design at a later
stage, which ultimately increases the time-to-market.
Although the design process has been described in linear fashion for simplicity, in reality there are many iterations back and forth,
especially between any two neighboring steps, and occasionally even remotely separated pairs. Although top-down design flow provides
an excellent design process control, in reality, there is no truly unidirectional top-down design flow. Both top-down and bottom-up
approaches have to be combined. For instance, if a chip designer defined an architecture without close estimation of the corresponding
chip area, then it is very likely that the resulting chip layout exceeds the area limit of the available technology. In such a case, in order to
fit the architecture into the allowable chip area, some functions may have to be removed and the design process must be repeated. Such
changes may require significant modification of the original requirements. Thus, it is very important to feed forward low-level
information to higher levels (bottom up) as early as possible.
In the following, we will examine design methodologies and structured approaches which have been developed over the years to deal
with both complex hardware and software projects. Regardless of the actual size of the project, the basic principles of structured design
will improve the prospects of success. Some of the classical techniques for reducing the complexity of IC design are: Hierarchy,
regularity, modularity and locality.
The use of hierarchy, or “divide and conquer” technique involves dividing a module into sub- modules and then repeating this operation
on the sub-modules until the complexity of the smaller parts becomes manageable. This approach is very similar to the software case
where large programs are split into smaller and smaller sections until simple subroutines, with well-defined functions and interfaces, can
be written. In Section 1.2, we have seen that the design of a VLSI chip can be represented in three domains. Correspondingly, a
hierarchy structure can be described in each domain separately. However, it is important for the simplicity of design that the hierarchies
in different domains can be mapped into each other easily.
As an example of structural hierarchy, Fig. 1.6 shows the structural decomposition of a CMOS four-bit adder into its components. The
adder can be decomposed progressively into one- bit adders, separate carry and sum circuits, and finally, into individual logic gates. At
this lower level of the hierarchy, the design of a simple circuit realizing a well-defined Boolean function is much more easier to handle
than at the higher levels of the hierarchy.
In the physical domain, partitioning a complex system into its various functional blocks will provide a valuable guidance for the actual
realization of these blocks on chip. Obviously, the approximate shape and size (area) of each sub-module should be estimated in order to
provide a useful floorplan. Figure 1.7 shows the hierarchical decomposition of a four-bit adder in physical description (geometrical
layout) domain, resulting in a simple floorplan. This physical view describes the external geometry of the adder, the locations of input
and output pins, and how pin locations allow some signals (in this case the carry signals) to be transferred from one sub-block to the
other without external routing. At lower levels of the physical hierarchy, the internal mask
Figure-1.6: Structural decomposition of a four-bit adder circuit, showing the hierarchy down to gate level.
Figure-1.8: Layout of a 16-bit adder, and the components (sub-blocks) of its physical hierarchy.
layout of each adder cell defines the locations and the connections of each transistor and wire. Figure 1.8 shows the full-custom layout
of a 16-bit dynamic CMOS adder, and the sub-modules that describe the lower levels of its physical hierarchy. Here, the 16-bit adder
consists of a cascade connection of four 4-bit adders, and each 4-bit adder can again be decomposed into its functional blocks such as
the Manchester chain, carry/propagate circuits and the output buffers. Finally, Fig. 1.9 and Fig. 1.10 show the structural hierarchy and
the physical layout of a simple triangle generator chip, respectively. Note that there is a corresponding physical description for every
module in the structural hierarchy, i.e., the components of the physical view closely match this structural view.
The hierarchical design approach reduces the design complexity by dividing the large system into several sub-modules. Usually, other
design concepts and design approaches are also needed to simplify the process. Regularity means that the hierarchical decomposition of
a large system should result in not only simple, but also similar blocks, as much as possible. A good example of regularity is the design
of array structures consisting of identical cells - such as a parallel multiplication array. Regularity can exist at all levels of abstraction:
At the transistor level, uniformly sized transistors simplify the design. At the logic level, identical gate structures can be used, etc.
Figure 1.11 shows regular circuit-level designs of a 2-1 MUX (multiplexer), an D-type edge-triggered flip flop, and a one-bit full adder.
Note that all of these circuits were designed by using inverters and tri-state buffers only. If the designer has a small library of well-
defined and well-characterized basic building blocks, a number of different functions can be constructed by using this principle.
Regularity usually reduces the number of different modules that need to be designed and verified, at all levels of abstraction.
Figure-1.11: Regular design of a 2-1 MUX, a DFF and an adder, using inverters and tri-state buffers.
Modularity in design means that the various functional blocks which make up the larger system must have well-defined functions and
interfaces. Modularity allows that each block or module can be designed relatively independently from each other, since there is no
ambiguity about the function and the signal interface of these blocks. All of the blocks can be combined with ease at the end of the
design process, to form the large system. The concept of modularity enables the parallelisation of the design process. It also allows the
use of generic modules in various designs - the well-defined functionality and signal interface allow plug-and-play design.
By defining well-characterized interfaces for each module in the system, we effectively ensure that the internals of each module become
unimportant to the exterior modules. Internal details remain at the local level. The concept of locality also ensures that connections are
mostly between neighboring modules, avoiding long-distance connections as much as possible. This last point is extremely important
for avoiding excessive interconnect delays. Time-critical operations should be performed locally, without the need to access distant
modules or signals. If necessary, the replication of some logic may solve this problem in large system architectures.
Several design styles can be considered for chip implementation of specified algorithms or logic functions. Each design style has its own
merits and shortcomings, and thus a proper choice has to be made by designers in order to provide the functionality at low cost.
Fully fabricated FPGA chips containing thousands of logic gates or even more, with programmable interconnects, are available to users
for their custom hardware programming to realize desired functionality. This design style provides a means for fast prototyping and also
for cost-effective chip design, especially for low-volume applications. A typical field programmable gate array (FPGA) chip consists of
I/O buffers, an array of configurable logic blocks (CLBs), and programmable interconnect structures. The programming of the
interconnects is implemented by programming of RAM cells whose output terminals are connected to the gates of MOS pass transistors.
A general architecture of FPGA from XILINX is shown in Fig. 1.12. A more detailed view showing the locations of switch matrices
used for interconnect routing is given in Fig. 1.13.
A simple CLB (model XC2000 from XILINX) is shown in Fig. 1.14. It consists of four signal input terminals (A, B, C, D), a clock
signal terminal, user-programmable multiplexers, an SR-latch, and a look-up table (LUT). The LUT is a digital memory that stores the
truth table of the Boolean function. Thus, it can generate any function of up to four variables or any two functions of three variables.
The control terminals of multiplexers are not shown explicitly in Fig. 1.14.
The CLB is configured such that many different logic functions can be realized by programming its array. More sophisticated CLBs
have also been introduced to map complex functions. The typical design flow of an FPGA chip starts with the behavioral description of
its functionality, using a hardware description language such as VHDL. The synthesized architecture is then technology-mapped (or
partitioned) into circuits or logic cells. At this stage, the chip design is completely described in terms of available logic cells. Next, the
placement and routing step assigns individual logic cells to FPGA sites (CLBs) and determines the routing patterns among the cells in
accordance with the netlist. After routing is completed, the on-chip
Figure-1.13: Detailed view of switch matrices and interconnection routing between CLBs.
performance of the design can be simulated and verified before downloading the design for programming of the FPGA chip. The
programming of the chip remains valid as long as the chip is powered-on, or until new programming is done. In most cases, full
utilization of the FPGA chip area is not possible - many cell sites may remain unused.
The largest advantage of FPGA-based design is the very short turn-around time, i.e., the time required from the start of the design
process until a functional chip is available. Since no physical manufacturing step is necessary for customizing the FPGA chip, a
functional sample can be obtained almost as soon as the design is mapped into a specific technology. The typical price of FPGA chips
are usually higher than other realization alternatives (such as gate array or standard cells) of the same design, but for small-volume
production of ASIC chips and for fast prototyping, FPGA offers a very valuable option.
In view of the fast prototyping capability, the gate array (GA) comes after the FPGA. While the design implementation of the FPGA
chip is done with user programming, that of the gate array is done with metal mask design and processing. Gate array implementation
requires a two-step manufacturing process: The first phase, which is based on generic (standard) masks, results in an array of
uncommitted transistors on each GA chip. These uncommitted chips can be stored for later customization, which is completed by
defining the metal interconnects between the transistors of the array (Fig. 1.15). Since the patterning of metallic interconnects is done at
the end of the chip fabrication, the turn-around time can be still short, a few days to a few weeks. Figure 1.16 shows a corner of a gate
array chip which contains bonding pads on its left and bottom edges, diodes for I/O protection, nMOS transistors and pMOS transistors
for chip output driver circuits in the neighboring areas of bonding pads, arrays of nMOS transistors and pMOS transistors, underpass
wire segments, and power and ground buses along with contact windows.
Figure 1.17 shows a magnified portion of the internal array with metal mask design (metal lines highlighted in dark) to realize a
complex logic function. Typical gate array platforms allow dedicated areas, called channels, for intercell routing as shown in Figs. 1.16
and 1.17 between rows or columns of MOS transistors. The availability of these routing channels simplifies the interconnections, even
using one metal layer only. The interconnection patterns to realize basic logic gates can be stored in a library, which can then be used to
customize rows of uncommitted transistors according to the netlist. While most gate array platforms only contain rows of uncommitted
transistors separated by routing channels, some other platforms also offer dedicated memory (RAM) arrays to allow a higher density
where memory functions are required. Figure 1.18 shows the layout views of a conventional gate array and a gate array platform with
two dedicated memory banks.
With the use of multiple interconnect layers, the routing can be achieved over the active cell areas; thus, the routing channels can be
removed as in Sea-of-Gates (SOG) chips. Here, the entire chip surface is covered with uncommitted nMOS and pMOS transistors. As in
the gate array case, neighboring transistors can be customized using a metal mask to form basic logic gates. For intercell routing,
however, some of the uncommitted transistors must be sacrificed. This approach results in more flexibility for interconnections, and
usually in a higher density. The basic platform of a SOG chip is shown in Fig. 1.19. Figure 1.20 offers a brief comparison between the
channeled (GA) vs. the channelless (SOG) approaches.
Figure-1.17: Metal mask design to realize a complex logic function on a channeled GA platform.
Figure-1.18: Layout views of a conventional GA chip and a gate array with two memory banks.
In general, the GA chip utilization factor, as measured by the used chip area divided by the total chip area, is higher than that of the
FPGA and so is the chip speed, since more customized design can be achieved with metal mask designs. The current gate array chips
can implement as many as hundreds of thousands of logic gates.
Figure-1.20: Comparison between the channeled (GA) vs. the channelless (SOG) approaches.
The standard-cells based design is one of the most prevalent full custom design styles which require development of a full custom mask
set. The standard cell is also called the polycell. In this design style, all of the commonly used logic cells are developed, characterized,
and stored in a standard cell library. A typical library may contain a few hundred cells including inverters, NAND gates, NOR gates,
complex AOI, OAI gates, D-latches, and flip-flops. Each gate type can have multiple implementations to provide adequate driving
capability for different fanouts. For instance, the inverter gate can have standard size transistors, double size transistors, and quadruple
size transistors so that the chip designer can choose the proper size to achieve high circuit speed and layout density. The characterization
of each cell is done for several different categories. It consists of
To enable automated placement of the cells and routing of inter-cell connections, each cell layout is designed with a fixed height, so that
a number of cells can be abutted side-by-side to form rows. The power and ground rails typically run parallel to the upper and lower
boundaries of the cell, thus, neighboring cells share a common power and ground bus. The input and output pins are located on the
upper and lower boundaries of the cell. Figure 1.21 shows the layout of a typical standard cell. Notice that the nMOS transistors are
located closer to the ground rail while the pMOS transistors are placed closer to the power rail.
Figure 1.22 shows a floorplan for standard-cell based design. Inside the I/O frame which is reserved for I/O cells, the chip area contains
rows or columns of standard cells. Between cell rows are channels for dedicated inter-cell routing. As in the case of Sea-of-Gates, with
over-the- cell routing, the channel areas can be reduced or even removed provided that the cell rows offer sufficient routing space. The
physical design and layout of logic cells ensure that when cells are placed into rows, their heights are matched and neighboring cells can
be abutted side-by-side, which provides natural connections for power and ground lines in each row. The signal delay, noise margins,
and power consumption of each cell should be also optimized with proper sizing of transistors using circuit simulation.
If a number of cells must share the same input and/or output signals, a common signal bus structure can also be incorporated into the
standard-cell-based chip layout. Figure 1.23 shows the simplified symbolic view of a case where a signal bus has been inserted between
the rows of standard cells. Note that in this case the chip consists of two blocks, and power/ground routing must be provided from both
sides of the layout area. Standard-cell based designs may consist of several such macro-blocks, each corresponding to a specific unit of
the system architecture such as ALU, control logic, etc.
Figure-1.23: Simplified floorplan consisting of two separate blocks and a common signal bus.
After chip logic design is done using standard cells in the library, the most challenging task is to place individual cells into rows and
interconnect them in a way that meets stringent design goals in circuit speed, chip area, and power consumption. Many advanced CAD
tools for place-and-route have been developed and used to achieve such goals. Also from the chip layout, circuit models which include
interconnect parasitics can be extracted and used for timing simulation and analysis to identify timing critical paths. For timing critical
paths, proper gate sizing is often practiced to meet the timing requirements. In many VLSI chips, such as microprocessors and digital
signal processing chips, standard-cells based design is used to implement complex control logic modules. Some full custom chips can be
also implemented exclusively with standard cells.
Finally, Fig. 1.24 shows the detailed mask layout of a standard-cell-based chip with an uninterrupted single block of cell rows, and three
memory banks placed on one side of the chip. Notice that within the cell block, the separations between neighboring rows depend on the
number of wires in the routing channel between the cell rows. If a high interconnect density can be achieved in the routing channel, the
standard cell rows can be placed closer to each other, resulting in a smaller chip area. The availability of dedicated memory blocks also
reduces the area, since the realization of memory elements using standard cells would occupy a larger area.
Figure-1.24: Mask layout of a standard-cell-based chip with a single block of cells and three memory banks.
Although the standard-cells based design is often called full custom design, in a strict sense, it is somewhat less than fully custom since
the cells are pre-designed for general use and the same cells are utilized in many different chip designs. In a fuller custom design, the
entire mask design is done anew without use of any library. However, the development cost of such a design style is becoming
prohibitively high. Thus, the concept of design reuse is becoming popular in order to reduce design cycle time and development cost.
The most rigorous full custom design can be the design of a memory cell, be it static or dynamic. Since the same layout design is
replicated, there would not be any alternative to high density memory chip design. For logic chip design, a good compromise can be
achieved by using a combination of different design styles on the same chip, such as standard cells, data-path cells and PLAs. In real full-
custom layout in which the geometry, orientation and placement of every transistor is done individually by the designer, design
productivity is usually very low - typically 10 to 20 transistors per day, per designer.
In digital CMOS VLSI, full-custom design is rarely used due to the high labor cost. Exceptions to this include the design of high-volume
products such as memory chips, high- performance microprocessors and FPGA masters. Figure 1.25 shows the full layout of the Intel
486 microprocessor chip, which is a good example of a hybrid full-custom design. Here, one can identify four different design styles on
one chip: Memory banks (RAM cache), data-path units consisting of bit-slice cells, control circuitry mainly consisting of standard cells
and PLA blocks.
Figure-1.25: Mask layout of the Intel 486 microprocessor chip, as an example of full-custom design.
production of
Notice: This chapter is a largely based on Chapter 2 (Fabrication of MOSFETs) of the book CMOS Digital Integrated Circuit Design -
Analysis and Design by S.M. Kang and Y. Leblebici.
● Introduction
● Fabrication Process Flow - Basic Steps
● The CMOS n-Well Process
● Advanced CMOS Fabrication Technologies
● Layout Design Rules
2.1 Introduction
In this chapter, the fundamentals of MOS chip fabrication will be discussed and the major steps of the process flow will be examined. It
is not the aim of this chapter to present a detailed discussion of silicon fabrication technology, which deserves separate treatment in a
dedicated course. Rather, the emphasis will be on the general outline of the process flow and on the interaction of various processing
steps, which ultimately determine the device and the circuit performance characteristics. The following chapters show that there are very
strong links between the fabrication process, the circuit design process and the performance of the resulting chip. Hence, circuit
designers must have a working knowledge of chip fabrication to create effective designs and in order to optimize the circuits with
respect to various manufacturing parameters. Also, the circuit designer must have a clear understanding of the roles of various masks
used in the fabrication process, and how the masks are used to define various features of the devices on-chip.
The following discussion will concentrate on the well-established CMOS fabrication technology, which requires that both n-channel
(nMOS) and p-channel (pMOS) transistors be built on the same chip substrate. To accommodate both nMOS and pMOS devices, special
regions must be created in which the semiconductor type is opposite to the substrate type. These regions are called wells or tubs. A p-
well is created in an n-type substrate or, alternatively, an n- well is created in a p-type substrate. In the simple n-well CMOS fabrication
technology presented, the nMOS transistor is created in the p-type substrate, and the pMOS transistor is created in the n-well, which is
built-in into the p-type substrate. In the twin-tub CMOS technology, additional tubs of the same type as the substrate can also be created
for device optimization.
The simplified process sequence for the fabrication of CMOS integrated circuits on a p- type silicon substrate is shown in Fig. 2.1. The
process starts with the creation of the n-well regions for pMOS transistors, by impurity implantation into the substrate. Then, a thick
oxide is grown in the regions surrounding the nMOS and pMOS active regions. The thin gate oxide is subsequently grown on the
surface through thermal oxidation. These steps are followed by the creation of n+ and p+ regions (source, drain and channel-stop
implants) and by final metallization (creation of metal interconnects).
Figure-2.1: Simplified process sequence for fabrication of the n-well CMOS integrated circuit with a single polysilicon layer, showing
only major fabrication steps.
The process flow sequence pictured in Fig. 2.1 may at first seem to be too abstract, since detailed fabrication steps are not shown. To
obtain a better understanding of the issues involved in the semiconductor fabrication process, we first have to consider some of the basic
steps in more detail.
Note that each processing step requires that certain areas are defined on chip by appropriate masks. Consequently, the integrated circuit
may be viewed as a set of patterned layers of doped silicon, polysilicon, metal and insulating silicon dioxide. In general, a layer must be
patterned before the next layer of material is applied on chip. The process used to transfer a pattern to a layer on the chip is called
lithography. Since each layer has its own distinct patterning requirements, the lithographic sequence must be repeated for every layer,
using a different mask.
To illustrate the fabrication steps involved in patterning silicon dioxide through optical lithography, let us first examine the process flow
shown in Fig. 2.2. The sequence starts with the thermal oxidation of the silicon surface, by which an oxide layer of about 1 micrometer
thickness, for example, is created on the substrate (Fig. 2.2(b)). The entire oxide surface is then covered with a layer of photoresist,
which is essentially a light-sensitive, acid-resistant organic polymer, initially insoluble in the developing solution (Fig. 2.2(c)). If the
photoresist material is exposed to ultraviolet (UV) light, the exposed areas become soluble so that the they are no longer resistant to
etching solvents. To selectively expose the photoresist, we have to cover some of the areas on the surface with a mask during exposure.
Thus, when the structure with the mask on top is exposed to UV light, areas which are covered by the opaque features on the mask are
shielded. In the areas where the UV light can pass through, on the other hand, the photoresist is exposed and becomes soluble (Fig.
2.2(d)).
The type of photoresist which is initially insoluble and becomes soluble after exposure to UV light is called positive photoresist. The
process sequence shown in Fig. 2.2 uses positive photoresist. There is another type of photoresist which is initially soluble and becomes
insoluble (hardened) after exposure to UV light, called negative photoresist. If negative photoresist is used in the photolithography
process, the areas which are not shielded from the UV light by the opaque mask features become insoluble, whereas the shielded areas
can subsequently be etched away by a developing solution. Negative photoresists are more sensitive to light, but their photolithographic
resolution is not as high as that of the positive photoresists. Therefore, negative photoresists are used less commonly in the
manufacturing of high-density integrated circuits.
Following the UV exposure step, the unexposed portions of the photoresist can be removed by a solvent. Now, the silicon dioxide
regions which are not covered by hardened photoresist can be etched away either by using a chemical solvent (HF acid) or by using a
dry etch (plasma etch) process (Fig. 2.2(e)). Note that at the end of this step, we obtain an oxide window that reaches down to the silicon
surface (Fig. 2.2(f)). The remaining photoresist can now be stripped from the silicon dioxide surface by using another solvent, leaving
the patterned silicon dioxide feature on the surface as shown in Fig. 2.2(g).
The sequence of process steps illustrated in detail in Fig. 2.2 actually accomplishes a single pattern transfer onto the silicon dioxide
surface, as shown in Fig. 2.3. The fabrication of semiconductor devices requires several such pattern transfers to be performed on silicon
dioxide, polysilicon, and metal. The basic patterning process used in all fabrication steps, however, is quite similar to the one shown in
Fig. 2.2. Also note that for accurate generation of high-density patterns required in sub-micron devices, electron beam (E-beam)
lithography is used instead of optical lithography. In the following, the main processing steps involved in the fabrication of an n-channel
MOS transistor on p-type silicon substrate will be examined.
Figure-2.3: The result of a single lithographic patterning sequence on silicon dioxide, without showing the intermediate steps. Compare
the unpatterned structure (top) and the patterned structure (bottom) with Fig. 2.2(b) and Fig. 2.2(g), respectively.
The process starts with the oxidation of the silicon substrate (Fig. 2.4(a)), in which a relatively thick silicon dioxide layer, also called
field oxide, is created on the surface (Fig. 2.4(b)). Then, the field oxide is selectively etched to expose the silicon surface on which the
MOS transistor will be created (Fig. 2.4(c)). Following this step, the surface is covered with a thin, high-quality oxide layer, which will
eventually form the gate oxide of the MOS transistor (Fig. 2.4(d)). On top of the thin oxide, a layer of polysilicon (polycrystalline
silicon) is deposited (Fig. 2.4(e)). Polysilicon is used both as gate electrode material for MOS transistors and also as an interconnect
medium in silicon integrated circuits. Undoped polysilicon has relatively high resistivity. The resistivity of polysilicon can be reduced,
however, by doping it with impurity atoms.
After deposition, the polysilicon layer is patterned and etched to form the interconnects and the MOS transistor gates (Fig. 2.4(f)). The
thin gate oxide not covered by polysilicon is also etched away, which exposes the bare silicon surface on which the source and drain
junctions are to be formed (Fig. 2.4(g)). The entire silicon surface is then doped with a high concentration of impurities, either through
diffusion or ion implantation (in this case with donor atoms to produce n-type doping). Figure 2.4(h) shows that the doping penetrates
the exposed areas on the silicon surface, ultimately creating two n-type regions (source and drain junctions) in the p-type substrate. The
impurity doping also penetrates the polysilicon on the surface, reducing its resistivity. Note that the polysilicon gate, which is patterned
before doping actually defines the precise location of the channel region and, hence, the location of the source and the drain regions.
Since this procedure allows very precise positioning of the two regions relative to the gate, it is also called the self-aligned process.
Figure-2.4: Process flow for the fabrication of an n-type MOSFET on p-type silicon.
Once the source and drain regions are completed, the entire surface is again covered with an insulating layer of silicon dioxide (Fig.
2.4(i)). The insulating oxide layer is then patterned in order to provide contact windows for the drain and source junctions (Fig. 2.4(j)).
The surface is covered with evaporated aluminum which will form the interconnects (Fig. 2.4(k)). Finally, the metal layer is patterned
and etched, completing the interconnection of the MOS transistors on the surface (Fig. 2.4(l)). Usually, a second (and third) layer of
metallic interconnect can also be added on top of this structure by creating another insulating oxide layer, cutting contact (via) holes,
depositing, and patterning the metal.
Having examined the basic process steps for pattern transfer through lithography, and having gone through the fabrication procedure of
a single n-type MOS transistor, we can now return to the generalized fabrication sequence of n-well CMOS integrated circuits, as shown
in Fig. 2.1. In the following figures, some of the important process steps involved in the fabrication of a CMOS inverter will be shown
by a top view of the lithographic masks and a cross-sectional view of the relevant areas.
The n-well CMOS process starts with a moderately doped (with impurity concentration typically less than 1015 cm-3) p-type silicon
substrate. Then, an initial oxide layer is grown on the entire surface. The first lithographic mask defines the n-well region. Donor atoms,
usually phosphorus, are implanted through this window in the oxide. Once the n-well is created, the active areas of the nMOS and
pMOS transistors can be defined. Figures 2.5 through 2.10 illustrate the significant milestones that occur during the fabrication process
of a CMOS inverter.
Figure-2.5: Following the creation of the n-well region, a thick field oxide is grown in the areas surrounding the transistor active
regions, and a thin gate oxide is grown on top of the active regions. The thickness and the quality of the gate oxide are two of the most
critical fabrication parameters, since they strongly affect the operational characteristics of the MOS transistor, as well as its long-term
reliability.
Figure-2.6: The polysilicon layer is deposited using chemical vapor deposition (CVD) and patterned by dry (plasma) etching. The
created polysilicon lines will function as the gate electrodes of the nMOS and the pMOS transistors and their interconnects. Also, the
polysilicon gates act as self-aligned masks for the source and drain implantations that follow this step.
Figure-2.7: Using a set of two masks, the n+ and p+ regions are implanted into the substrate and into the n- well, respectively. Also, the
ohmic contacts to the substrate and to the n-well are implanted in this process step.
Figure-2.8: An insulating silicon dioxide layer is deposited over the entire wafer using CVD. Then, the contacts are defined and etched
away to expose the silicon or polysilicon contact windows. These contact windows are necessary to complete the circuit
interconnections using the metal layer, which is patterned in the next step.
Figure-2.9: Metal (aluminum) is deposited over the entire chip surface using metal evaporation, and the metal lines are patterned
through etching. Since the wafer surface is non-planar, the quality and the integrity of the metal lines created in this step are very critical
and are ultimately essential for circuit reliability.
Figure-2.10: The composite layout and the resulting cross-sectional view of the chip, showing one nMOS and one pMOS transistor
(built-in n-well), the polysilicon and metal interconnections. The final step is to deposit the passivation layer (for protection) over the
chip, except for wire-bonding pad areas.
The patterning process by the use of a succession of masks and process steps is conceptually summarized in Fig. 2.11. It is seen that a
series of masking steps must be sequentially performed for the desired patterns to be created on the wafer surface. An example of the
end result of this sequence is shown as a cross-section on the right.
Figure-2.11: Conceptual illustration of the mask sequence applied to create desired structures.
In this section, two examples will be given for advanced CMOS processes which offer additional benefits in terms of device
performance and integration density. These processes, namely, the twin-tub CMOS process and the silicon-on-insulator (SOI) process,
are becoming especially more popular for sub-micron geometries where device performance and density must be pushed beyond the
limits of the conventional n-well CMOS process.
This technology provides the basis for separate optimization of the nMOS and pMOS transistors, thus making it possible for threshold
voltage, body effect and the channel transconductance of both types of transistors to be tuned independently. Generally, the starting
material is a n+ or p+ substrate, with a lightly doped epitaxial layer on top. This epitaxial layer provides the actual substrate on which
the n-well and the p-well are formed. Since two independent doping steps are performed for the creation of the well regions, the dopant
concentrations can be carefully optimized to produce the desired device characteristics.
In the conventional n-well CMOS process, the doping density of the well region is typically about one order of magnitude higher than
the substrate, which, among other effects, results in unbalanced drain parasitics. The twin-tub process (Fig. 2.12) also avoids this
problem.
Rather than using silicon as the substrate material, technologists have sought to use an insulating substrate to improve process
characteristics such as speed and latch-up susceptibility. The SOI CMOS technology allows the creation of independent, completely
isolated nMOS and pMOS transistors virtually side-by-side on an insulating substrate (for example: sapphire). The main advantages of
this technology are the higher integration density (because of the absence of well regions), complete avoidance of the latch-up problem,
and lower parasitic capacitances compared to the conventional n-well or twin-tub CMOS processes. A cross-section of nMOS and
pMOS devices in created using SOI process is shown in Fig. 2.13.
The SOI CMOS process is considerably more costly than the standard n-well CMOS process. Yet the improvements of device
performance and the absence of latch-up problems can justify its use, especially for deep-sub-micron devices.
The physical mask layout of any circuit to be manufactured using a particular process must conform to a set of geometric constraints or
rules, which are generally called layout design rules. These rules usually specify the minimum allowable line widths for physical objects
on-chip such as metal and polysilicon interconnects or diffusion areas, minimum feature dimensions, and minimum allowable
separations between two such features. If a metal line width is made too small, for example, it is possible for the line to break during the
fabrication process or afterwards, resulting in an open circuit. If two lines are placed too close to each other in the layout, they may form
an unwanted short circuit by merging during or after the fabrication process. The main objective of design rules is to achieve a high
overall yield and reliability while using the smallest possible silicon area, for any circuit to be manufactured with a particular process.
Note that there is usually a trade-off between higher yield which is obtained through conservative geometries, and better area efficiency,
which is obtained through aggressive, high- density placement of various features on the chip. The layout design rules which are
specified for a particular fabrication process normally represent a reasonable optimum point in terms of yield and density. It must be
emphasized, however, that the design rules do not represent strict boundaries which separate "correct" designs from "incorrect" ones. A
layout which violates some of the specified design rules may still result in an operational circuit with reasonable yield, whereas another
layout observing all specified design rules may result in a circuit which is not functional and/or has very low yield. To summarize, we
can say, in general, that observing the layout design rules significantly increases the probability of fabricating a successful product with
high yield.
● Micron rules, in which the layout constraints such as minimum feature sizes and minimum allowable feature separations, are
stated in terms of absolute dimensions in micrometers, or,
● Lambda rules, which specify the layout constraints in terms of a single parameter (?) and, thus, allow linear, proportional scaling
of all geometrical constraints.
Lambda-based layout design rules were originally devised to simplify the industry- standard micron-based design rules and to allow
scaling capability for various processes. It must be emphasized, however, that most of the submicron CMOS process design rules do not
lend themselves to straightforward linear scaling. The use of lambda-based design rules must therefore be handled with caution in sub-
micron geometries. In the following, we present a sample set of the lambda-based layout design rules devised for the MOSIS CMOS
process and illustrate the implications of these rules on a section a simple layout which includes two transistors (Fig. 2.14).
Figure-2.14: Illustration of some of the typical MOSIS layout design rules listed above.
References
production of
Chapter 3
FULL-CUSTOM MASK LAYOUT DESIGN
● Introduction
● CMOS Layout Design Rules
● CMOS Inverter Layout Design
● Layout of CMOS NAND and NOR Gates
● Complex CMOS Logic Gates
3.1 Introduction
In this chapter, the basic mask layout design guidelines for CMOS logic gates will be presented. The design of physical layout is very
tightly linked to overall circuit performance (area, speed, power dissipation) since the physical structure directly determines the
transconductances of the transistors, the parasitic capacitances and resistances, and obviously, the silicon area which is used for a certain
function. On the other hand, the detailed mask layout of logic gates requires a very intensive and time-consuming design effort, which is
justifiable only in special circumstances where the area and/or the performance of the circuit must be optimized under very tight
constraints. Therefore, automated layout generation (e.g., standard cells + computer-aided placement and routing) is typically preferred
for the design of most digital VLSI circuits. In order to judge the physical constraints and limitations, however, the VLSI designer must
also have a good understanding of the physical mask layout process.
Mask layout drawings must strictly conform to a set of layout design rules as described in Chapter 2, therefore, we will start this chapter
with the review of a complete design rule set. The design of a simple CMOS inverter will be presented step-by-step, in order to show the
influence of various design rules on the mask structure and on the dimensions. Also, we will introduce the concept of stick diagrams,
which can be used very effectively to simplify the overall topology of layout in the early design phases. With the help of stick diagrams,
the designer can have a good understanding of the topological constraints, and quickly test several possibilities for the optimum layout
without actually drawing a complete mask diagram.
The physical (mask layout) design of CMOS logic gates is an iterative process which starts with the circuit topology (to realize the
desired logic function) and the initial sizing of the transistors (to realize the desired performance specifications). At this point, the
designer can only estimate the total parasitic load at the output node, based on the fan-out, the number of devices, and the expected
length of the interconnection lines. If the logic gate contains more than 4-6 transistors, the topological graph representation and the Euler-
path method allow the designer to determine the optimum ordering of the transistors. A simple stick diagram layout can now be drawn,
showing the locations of the transistors, the local interconnections between the transistors and the locations of the contacts.
After a topologically feasible layout is found, the mask layers are drawn (using a layout editor tool) according to the layout design rules.
This procedure may require several small iterations in order to accommodate all design rules, but the basic topology should not change
very significantly. Following the final DRC (Design Rule Check), a circuit extraction procedure is performed on the finished layout to
determine the actual transistor sizes, and more importantly, the parasitic capacitances at each node. The result of the extraction step is
usually a detailed
Figure-3.1: The typical design flow for the production of a mask layout.
SPICE input file, which is automatically generated by the extraction tool. Now, the actual performance of the circuit can be determined
by performing a SPICE simulation, using the extracted net-list. If the simulated circuit performance (e.g., transient response times or
power dissipation) do not match the desired specifications, the layout must be modified and the whole process must be repeated. The
layout modifications are usually concentrated on the (W/L) ratios of the transistors (transistor re-sizing), since the width-to-length ratios
of the transistors determine the device transconductance and the parasitic source/drain capacitances. The designer may also decide to
change parts or all of the circuit topology in order to reduce the parasitics. The flow diagram of this iterative process is shown in Fig.
3.1.
As already discussed in Chapter 2, each mask layout design must conform to a set of layout design rules, which dictate the geometrical
constraints imposed upon the mask layers by the technology and by the fabrication process. The layout designer must follow these rules
in order to guarantee a certain yield for the finished product, i.e., a certain ratio of acceptable chips out of a fabrication batch. A design
which violates some of the layout design rules may still result in a functional chip, but the yield is expected to be lower because of
random process variations.
The design rules below are given in terms of scaleable lambda-rules. Note that while the concept of scaleable design rules is very
convenient for defining a technology-independent mask layout and for memorizing the basic constraints, most of the rules do not scale
linearly, especially for sub-micron technologies. This fact is illustrated in the right column, where a representative rule set is given in
real micron dimensions. A simple comparison with the lambda- based rules shows that there are significant differences. Therefore,
lambda-based design rules are simply not useful for sub-micron CMOS technologies.
In the following, the mask layout design of a CMOS inverter will be examined step-by-step. The circuit consists of one nMOS and one
pMOS transistor, therefore, one would assume that the layout topology is relatively simple. Yet, we will see that there exist quite a
number of different design possibilities even for this very simple circuit.
First, we need to create the individual transistors according to the design rules. Assume that we attempt to design the inverter with
minimum-size transistors. The width of the active area is then determined by the minimum diffusion contact size (which is necessary for
source and drain connections) and the minimum separation from diffusion contact to both active area edges. The width of the
polysilicon line over the active area (which is the gate of the transistor) is typically taken as the minimum poly width (Fig. 3.3). Then,
the overall length of the active area is simply determined by the following sum: (minimum poly width) + 2 x (minimum poly-to- contact
spacing) + 2 x (minimum spacing from contact to active area edge). The pMOS transistor must be placed in an n-well region, and the
minimum size of the n- well is dictated by the pMOS active area and the minimum n-well overlap over n+. The distance between the
nMOS and the pMOS transistor is determined by the minimum separation between the n+ active area and the n-well (Fig. 3.4). The
polysilicon gates of the nMOS and the pMOS transistors are usually aligned. The final step in the mask layout is the local
interconnections in metal, for the output node and for the VDD and GND contacts (Fig. 3.5). Notice that in order to be biased properly,
the n-well region must also have a VDD contact.
Figure-3.3: Design rule constraints which determine the dimensions of a minimum-size transistor.
The inital phase of layout design can be simplified significantly by the use of stick diagrams - or so-called symbolic layouts. Here, the
detailed layout design rules are simply neglected and the main features (active areas, polysilicon lines, metal lines) are represented by
constant width rectangles or simple sticks. The purpose of the stick diagram is to provide the designer a good understanding of the
topological constraints, and to quickly test several possibilities for the optimum layout without actually drawing a complete mask
diagram. In the following, we will examine a series of stick diagrams which show different layout options for the CMOS inverter circuit.
The first two stick diagram layouts shown in Fig. 3.6 are the two most basic inverter configurations, with different alignments of the
transistors. In some cases, other signals must be routed over the inverter. For instance, if one or two metal lines have to be passed
through the middle of the cell from left to right, horizontal metal straps can be used to access the drain terminals of the transistors, which
in turn connect to a vertical Metal-2 line. Metal-1 can now be used to route the signals passing through the inverter. Alternatively, the
diffusion areas of both transistors may be used for extending the power and ground connections. This makes the inverter transistors
transparent to horizontal metal lines which may pass over.
The addition of a second metal layer allows more interconnect freedom. The second- level metal can be used for power and ground
supply lines, or alternatively, it may be used to vertically strap the input and the output signals. The final layout example in Fig. 3.6
shows one possibility of using a third metal layer, which is utilized for routing three signals on top.
The mask layout designs of CMOS NAND and NOR gates follow the general principles examined earlier for the CMOS inverter layout.
Figure 3.7 shows the sample layouts of a two- input NOR gate and a two-input NAND gate, using single-layer polysilicon and single-
layer metal. Here, the p-type diffusion area for the pMOS transistors and the n-type diffusion area for the nMOS transistors are aligned
in parallel to allow simple routing of the gate signals with two parallel polysilicon lines running vertically. Also notice that the two mask
layouts show a very strong symmetry, due to the fact that the NAND and the NOR gate are have a symmetrical circuit topology. Finally,
Figs 3.8 and 3.9 show the major steps of the mask layout design for both gates, starting from the stick diagram and progressively
defining the mask layers.
Figure-3.7: Sample layouts of a CMOS NOR2 gate and a CMOS NAND2 gate.
Figure-3.8: Major steps required for generating the mask layout of a CMOS NOR2 gate.
Figure-3.9: Major steps required for generating the mask layout of a CMOS NAND2 gate.
The realization of complex Boolean functions (which may include several input variables and several product terms) typically requires a
series-parallel network of nMOS transistors which constitute the so-called pull-down net, and a corresponding dual network of pMOS
transistors which constitute the pull-up net. Figure 3.10 shows the circuit diagram and the corresponding network graphs of a complex
CMOS logic gate. Once the network topology of the nMOS pull- down network is known, the pull-up network of pMOS transistors can
easily be constructed by using the dual-graph concept.
Figure-3.10: A complex CMOS logic gate realizing a Boolean function with 5 input variables.
Now, we will investigate the problem of constructing a minimum-area layout for the complex CMOS logic gate. Figure 3.11 shows the
stick-diagram layout of a “first-attempt”, using an arbitrary ordering of the polysilicon gate columns. Note that in this case, the
separation between the polysilicon columns must be sufficiently wide to allow for two metal-diffusion contacts on both sides and one
diffusion-diffusion separation. This certainly consumes a considerable amount of extra silicon area.
If we can minimize the number of active-area breaks both for the nMOS and for the pMOS transistors, the separation between the
polysilicon gate columns can be made smaller. This, in turn, will reduce the overall horizontal dimension and the overall circuit layout
area. The number of active-area breaks can be minimized by changing the ordering of the polysilicon columns, i.e., by changing the
ordering of the transistors.
Figure-3.11: Stick diagram layout of the complex CMOS logic gate, with an arbitrary ordering of the polysilicon gate columns.
A simple method for finding the optimum gate ordering is the Euler-path method: Simply find a Euler path in the pull-down network
graph and a Euler path in the pull-up network graph with the identical ordering of input labels, i.e., find a common Euler path for both
graphs. The Euler path is defined as an uninterrupted path that traverses each edge (branch) of the graph exactly once. Figure 3.12
shows the construction of a common Euler path for both graphs in our example.
Figure-3.12: Finding a common Euler path in both graphs for the pull-down and pull-up net provides a gate ordering that minimizes the
number of active-area breaks. In both cases, the Euler path starts at (x) and ends at (y).
It is seen that there is a common sequence (E-D-A-B-C) in both graphs. The polysilicon gate columns can be arranged according to this
sequence, which results in uninterrupted active areas for nMOS as well as for pMOS transistors. The stick diagram of the new layout is
shown in Fig. 3.13. In this case, the separation between two neighboring poly columns must allow only for one metal-diffusion contact.
The advantages of this new layout are more compact (smaller) layout area, simple routing of signals, and correspondingly, smaller
parasitic capacitance.
Figure-3.13: Optimized stick diagram layout of the complex CMOS logic gate.
It may not always be possible to construct a complete Euler path both in the pull-down and in the pull-up network. In that case, the best
strategy is to find sub-Euler-paths in both graphs, which should be as long as possible. This approach attempts to maximize the number
of transistors which can be placed in a single, uninterrupted active area.
Finally, Fig. 3.14 shows the circuit diagram of a CMOS one-bit full adder. The circuit has three inputs, and two outputs, sum and
carry_out. The corresponding mask layout of this circuit is given in Fig. 3.15. All input and output signals have been arranged in vertical
polysilicon columns. Notice that both the sum-circuit and the carry-circuit have been realized using one uninterrupted active area each.
production of
Chapter 4
PARASITIC EXTRACTION AND PERFORMANCE
ESTIMATION FROM PHYSICAL STRUCTURE
● Introduction
● The Reality with Interconnections
● MOSFET Capacitances
● Interconnect Capacitance Estimation
● Interconnect Resistance Estimation
4.1 Introduction
In this chapter, we will investigate some of the physical factors which determine and ultimately limit the performance of digital VLSI
circuits. The switching characteristics of digital integrated circuits essentially dictate the overall operating speed of digital systems. The
dynamic performance requirements of a digital system are usually among the most important design specifications that must be met by
the circuit designer. Therefore, the switching speed of the circuits must be estimated and optimized very early in the design phase.
The classical approach for determining the switching speed of a digital block is based on the assumption that the loads are mainly
capacitive and lumped. Relatively simple delay models exist for logic gates with purely capacitive load at the output node, hence, the
dynamic behavior of the circuit can be estimated easily once the load is determined. The conventional delay estimation approaches seek
to classify three main components of the gate load, all of which are assumed to be purely capacitive, as: (1) internal parasitic
capacitances of the transistors, (2) interconnect (line) capacitances, and (3) input capacitances of the fan-out gates. Of these three
components, the load conditions imposed by the interconnection lines present serious problems.
Figure 4.1 shows a simple situation where an inverter is driving three other inverters, linked over interconnection lines of different
length and geometry. If the total load of each interconnection line can be approximated by a lumped capacitance, then the total load seen
by the primary inverter is simply the sum of all capacitive components described above. The switching characteristics of the inverter are
then described by the charge/discharge time of the load capacitance, as seen in Fig. 4.2. The expected output voltage waveform of the
inverter is given in Fig. 4.3, where the propagation delay time is the primary measure of switching speed. It can be shown very easily
that the signal propagation delay under these conditions is linearly proportional to the load capacitance.
In most cases, however, the load conditions imposed by the interconnection line are far from being simple. The line, itself a three-
dimensional structure in metal and/or polysilicon, usually has a non-negligible resistance in addition to its capacitance. The length/width
ratio of the wire usually dictates that the parameters are distributed, making the interconnect a true transmission line. Also, an
interconnect is very rarely “alone”, i.e., isolated from other influences. In real conditions, the interconnection line is in very close
proximity to a number of other lines, either on the same level or on different levels. The capacitive/inductive coupling and the signal
interference between neighboring lines should also be taken into consideration for an accurate estimation of delay.
Figure-4.2: CMOS inverter stage with lumped capacitive load at the output node.
Figure-4.3: Typical input and output waveforms of an inverter with purely capacitive load.
Consider the following situation where an inverter is driving two other inverters, over long interconnection lines. In general, if the time
of flight across the interconnection line (as determined by the speed of light) is shorter than the signal rise/fall times, then the wire can
be modeled as a capacitive load, or as a lumped or distributed RC network. If the interconnection lines are sufficiently long and the rise
times of the signal waveforms are comparable to the time of flight across the line, then the inductance also becomes important, and the
interconnection lines must be modeled as transmission lines. Taking into consideration the RLCG (resistance, inductance, capacitance,
and conductance) parasitics (as seen in Fig. 4.4), the signal transmission across the wire becomes a very complicated matter, compared
to the relatively simplistic lumped- load case. Note that the signal integrity can be significantly degraded especially when the output
impedance of the driver is significantly lower than the characteristic impedance of the transmission line.
Figure-4.4: (a) An RLCG interconnection tree. (b) Typical signal waveforms at the nodes A and B, showing the signal delay and the
various delay components.
The transmission-line effects have not been a serious concern in CMOS VLSI until recently, since the gate delay originating from purely
or mostly capacitive load components dominated the line delay in most cases. But as the fabrication technologies move to finer (sub-
micron) design rules, the intrinsic gate delay components tend to decrease dramatically. By contrast, the overall chip size does not
decrease - designers just put more functionality on the same sized chip. A 100 mm2 chip has been a standard large chip for almost a
decade. The factors which determine the chip size are mainly driven by the packaging technology, manufacturing equipment, and the
yield. Since the chip size and the worst-case line length on a chip remain unchanged, the importance of interconnect delay increases in
sub-micron technologies. In addition, as the widths of metal lines shrink, the transmission line effects and signal coupling between
This fact is illustrated in Fig. 4.5, where typical intrinsic gate delay and interconnect delay are plotted qualitatively, for different
technologies. It can be seen that for sub-micron technologies, the interconnect delay starts to dominate the gate delay. In order to deal
with the implications and to optimize a system for speed, the designers must have reliable and efficient means of (1) estimating the
interconnect parasitics in a large chip, and (2) simulating the time- domain effects. Yet we will see that neither of these tasks is simple -
interconnect parasitic extraction and accurate simulation of line effects are two of the most difficult problems in physical design of VLSI
circuits today.
Once we establish the fact that the interconnection delay becomes a dominant factor in CMOS VLSI, the next question is: how many of
the interconnections in a large chip may cause serious problems in terms of delay. The hierarchical structure of most VLSI designs
offers some insight on this question. In a chip consisting of several functional modules, each module contains a relatively large number
of local connections between its functional blocks, logic gates, and transistors. Since these intra-module connections are usually made
over short distances, their influence on speed can be simulated easily with conventional models. Yet there are also a fair amount of
longer connections between the modules on a chip, the so-called inter-module connections. It is usually these inter-module connections
which should be scrutinized in the early design phases for possible timing problems. Figure 4.6 shows the typical statistical distribution
of wire lengths on a chip, normalized for the chip diagonal length. The distribution plot clearly exhibits two distinct peaks, one for the
relatively shorter intra-module connections, and the other for the longer inter-module connections. Also note that a small number of
interconnections may be very long, typically longer than the chip diagonal length. These lines are usually required for global signal bus
connections, and for clock distribution networks. Although their numbers are relatively small, these long interconnections are obviously
the most problematic ones.
To summarize the message of this section, we state that : (1) interconnection delay is becoming the dominating factor which determines
the dynamic performance of large-scale systems, and (2) interconnect parasitics are difficult to model and to simulate. In the following
sections, we will concentrate on various aspects of on-chip parasitics, and we will mainly consider capacitive and resistive components.
The first component of capacitive parasitics we will examine is the MOSFET capacitances. These parasitic components are mainly
responsible for the intrinsic delay of logic gates, and they can be modeled with fairly high accuracy for gate delay estimation. The
extraction of transistor parasitics from physical structure (mask layout) is also fairly straightforward.
The parasitic capacitances associated with a MOSFET are shown in Fig. 4.7 as lumped elements between the device terminals. Based on
their physical origins, the parasitic device capacitances can be classified into two major groups: (1) oxide-related capacitances and (2)
junction capacitances. The gate-oxide-related capacitances are Cgd (gate-to-drain capacitance), Cgs (gate-to-source capacitance), and
Cgb (gate-to-substrate capacitance). Notice that in reality, the gate-to-channel capacitance is distributed and voltage dependent.
Consequently, all of the oxide-related capacitances described here change with the bias conditions of the transistor. Figure 4.8 shows
qualitatively the oxide-related capacitances during cut-off, linear-mode operation and saturation of the MOSFET. The simplified
variation of the three capacitances with gate-to-source bias voltage is shown in Fig. 4.9.
Figure-4.8: Schematic representation of MOSFET oxide capacitances during (a) cut-off, (b) linear- mode operation, and (c) saturation.
Note that the total gate oxide capacitance is mainly determined by the parallel-plate capacitance between the polysilicon gate and the
underlying structures. Hence, the magnitude of the oxide-related capacitances is very closely related to (1) the gate oxide thickness, and
(2) the area of the MOSFET gate. Obviously, the total gate capacitance decreases with decreasing device dimensions (W and L), yet it
increases with decreasing gate oxide thickness. In sub-micron technologies, the horizontal dimensions (which dictate the gate area) are
usually scaled down more easily than the horizontal dimensions, such as the gate oxide thickness. Consequently, MOSFET transistors
fabricated using sub-micron technologies have, in general, smaller gate capacitances.
Now we consider the voltage-dependent source-to-substrate and drain-to-substrate capacitances, Csb and Cdb. Both of these
capacitances are due to the depletion charge surrounding the respective source or drain regions of the transistor, which are embedded in
the substrate. Figure 4.10 shows the simplified geometry of an n-type diffusion region within the p-type substrate. Here, the diffusion
region has been approximated by a rectangular box, which consists of five planar pn-junctions. The total junction capacitance is a
function of the junction area (sum of all planar junction areas), the doping densities, and the applied terminal voltages. Accurate
methods for estimating the junction capacitances based on these data are readily available in the literature, therefore, a detailed
discussion of capacitance calculations will not be presented here.
One important aspect of parasitic device junction capacitances is that the amount of capacitance is a linear function of the junction area.
Consequently, the size of the drain or the source diffusion area dictates the amount of parasitic capacitance. In sub-micron technologies,
where the overall dimensions of the individual devices are scaled down, the parasitic junction capacitances also decrease significantly.
It was already mentioned that the MOSFET parasitic capacitances are mainly responsible for the intrinsic delay of logic gates. We have
seen that both the oxide-related parasitic capacitances and the junction capacitances tend to decrease with shrinking device dimensions,
hence, the relative significance of intrinsic gate delay diminishes in sub-micron technologies.
Figure-4.10: Three-dimensional view of the n-type diffusion region within the p-type substrate.
In a typical VLSI chip, the parasitic interconnect capacitances are among the most difficult parameters to estimate accurately. Each
interconnection line (wire) is a three dimensional structure in metal and/or polysilicon, with significant variations of shape, thickness,
and vertical distance from the ground plane (substrate). Also, each interconnect line is typically surrounded by a number of other lines,
either on the same level or on different levels. Figure 4.11 shows a possible, realistic situation where interconnections on three different
levels run in close proximity of each other. The accurate estimation of the parasitic capacitances of these wires with respect to the
ground plane, as well as with respect to each other, is obviously a complicated task.
Unfortunately for the VLSI designers, most of the conventional computer-aided VLSI design tools have a relatively limited capability of
interconnect parasitic estimation. This is true even for the design tools regularly used for sub-micron VLSI design, where interconnect
parasitics were shown to be very dominant. The designer should therefore be aware of the physical problem and try to incorporate this
knowledge early in the design phase, when the initial floorplanning of the chip is done.
First, consider the section of a single interconnect which is shown in Fig. 4.12. It is assumed that this wire segment has a length of (l) in
the current direction, a width of (w) and a thickness of (t). Moreover, we assume that the interconnect segment runs parallel to the chip
surface and is separated from the ground plane by a dielectric (oxide) layer of height (h). Now, the correct estimation of the parasitic
capacitance with respect to ground is an important issue. Using the basic geometry given in Fig. 4.12, one can calculate the parallel-
plate capacitance Cpp of the interconnect segment. However, in interconnect lines where the wire thickness (t) is comparable in
magnitude to the ground-plane distance (h), fringing electric fields significantly increase the total parasitic capacitance (Fig. 4.13).
Figure-4.12: Interconnect segment running parallel to the surface, used for parasitic capacitance estimations.
Figure-4.13: Influence of fringing electric fields upon the parasitic wire capacitance.
Figure 4.14 shows the variation of the fringing-field factor FF = Ctotal/Cpp, as a function of (t/h), (w/h) and (w/l). It can be seen that the
influence of fringing fields increases with the decreasing (w/h) ratio, and that the fringing-field capacitance can be as much as 10-20
times larger than the parallel-plate capacitance. It was mentioned earlier that the sub-micron fabrication technologies allow the width of
the metal lines to be decreased somewhat, yet the thickness of the line must be preserved in order to ensure structural integrity. This
situation, which involves narrow metal lines with a considerable vertical thickness, is especially vulnerable to fringing field effects.
A set of simple formulas developed by Yuan and Trick in the early 1980’s can be used to estimate the capacitance of the interconnect
structures in which fringing fields complicate the parasitic capacitance calculation. The following two cases are considered for two
different ranges of line width (w).
(4.1)
(4.2)
These formulas permit the accurate approximation of the parasitic capacitance values to within 10% error, even for very small values of
(t/h). Figure 4.15 shows a different view of the line capacitance as a function of (w/h) and (t/h). The linear dash-dotted line in this plot
represents the corresponding parallel-plate capacitance, and the other two curves represent the actual capacitance, taking into account
the fringing-field effects.
Now consider the more realistic case where the interconnection line is not “alone” but is coupled with other lines running in parallel. In
this case, the total parasitic capacitance of the line is not only increased by the fringing-field effects, but also by the capacitive coupling
between the lines. Figure 4.16 shows the capacitance of a line which is coupled with two other lines on both sides, separated by the
minimum design rule. Especially if both of the neighboring lines are biased at ground potential, the total parasitic capacitance of the
interconnect running in the middle (with respect to the ground plane) can be more than 20 times as large as the simple parallel-plate
capacitance. Note that the capacitive coupling between neighboring lines is increased when the thickness of the wire is comparable to its
width.
Figure 4.17 shows the cross-section view of a double-metal CMOS structure, where the individual parasitic capacitances between the
layers are also indicated. The cross-section does not show a MOSFET, but just a portion of a diffusion region over which some metal
lines may pass. The inter-layer capacitances between the metal-2 and metal-1, metal-1 and polysilicon, and metal-2 and polysilicon are
labeled as Cm2m1, Cm1p and Cm2p, respectively. The other parasitic capacitance components are defined with respect to the substrate.
If the metal line passes over an active region, the oxide thickness underneath is smaller (because of the active area window), and
consequently, the capacitance is larger. These special cases are labeled as Cm1a and Cm2a. Otherwise, the thick field oxide layer results
in a smaller capacitance value.
Figure-4.17: Cross-sectional view of a double-metal CMOS structure, showing capacitances between layers.
The vertical thickness values of the different layers in a typical 0.8 micron CMOS technology are given below as an example.
The list below contains the capacitance values between various layers, also for a typical 0.8 micron CMOS technology.
For the estimation of interconnect capacitances in a complicated three-dimensional structure, the exact geometry must be taken into
account for every portion of the wire. Yet this requires an unacceptable amount of computation in a large circuit, even if simple
formulas are applied for the calculation of capacitances. Usually, chip manufacturers supply the area capacitance (parallel-plate cap) and
the perimeter capacitance (fringing-field cap) figures for each layer, which are backed up by measurement of capacitance test structures.
These figures can be used to extract the parasitic capacitances from the mask layout. It is often prudent to include test structures on chip
that enable the designer to independently calibrate a process to a set of design tools. In some cases where the entire chip performance is
influenced by the parasitic capacitance of a specific line, accurate 3-D simulation is the only reliable solution.
The parasitic resistance of a metal or polysilicon line can also have a profound influence on the signal propagation delay over that line.
The resistance of a line depends on the type of material used (polysilicon, aluminum, gold, ...), the dimensions of the line and finally, the
number and locations of the contacts on that line. Consider again the interconnection line shown in Fig. 4.12. The total resistance in the
indicated current direction can be found as
(4.2)
where the greek letter ro represents the characteristic resistivity of the interconnect material, and Rsheet represents the sheet resistivity
of the line, in (ohm/square). For a typical polysilicon layer, the sheet resistivity is between 20-40 ohm/square, whereas the sheet
resistivity of silicide is about 2- 4 ohm/square. Using the formula given above, we can estimate the total parasitic resistance of a wire
segment based on its geometry. Typical metal-poly and metal-diffusion contact resistance values are between 20-30 ohms, while typical
via resistance is about 0.3 ohms.
In most short-distance aluminum and silicide interconnects, the amount of parasitic wire resistance is usually negligible. On the other
hand, the effects of the parasitic resistance must be taken into account for longer wire segments. As a first-order approximation in
simulations, the total lumped resistance may be assumed to be connected in series with the total lumped capacitance of the wire. A much
better approximation of the influence of distributed parasitic resistance can be obtained by using an RC-ladder network model to
represent the interconnect segment (Fig. 4.18). Here, the interconnect segment is divided into smaller, identical sectors, and each sector
is represented by an RC-cell. Typically, the number of these RC-cells (i.e., the resolution of the RC model) determines the accuracy of
the simulation results. On the other hand, simulation time restrictions usually limit the resolution of this distributed line model.
Figure-4.18: RC-ladder network used to model the distributed resistance and capacitance of an interconnect.
production of
Chapter 5
CLOCK SIGNALS AND SYSTEM TIMING
Clock signals are the heartbeats of digital systems. Hence, the stability of clock signals is highly important. Ideally, clock signals should
have minimum rise and fall times, specified duty cycles, and zero skew. In reality, clock signals have nonzero skews and noticeable rise
and fall times; duty cycles can also vary. In fact, as much as 10% of a machine cycle time is expended to allow realistic clock skews in
large computer systems. The problem is no less serious in VLSI chip design. A simple technique for on-chip generation of a primary
clock signal would be to use a ring oscillator as shown in Fig. 5.1. Such a clock circuit has been used in low-end microprocessor chips.
Figure 5.1: Simple on-chip clock generation circuit using a ring oscillator.
However, the generated clock signal can be quite process-dependent and unstable. As a result, separate clock chips which use crystal
oscillators have been used for high- performance VLSI chip families. Figure 5.2 shows the circuit schematic of a Pierce crystal oscillator
with good frequency stability. This circuit is a near series-resonant circuit in which the crystal sees a low load impedance across its
terminals. Series resonance exists in the crystal but its internal series resistance largely the determines the oscillation frequency. In its
equivalent circuit model, the crystal can be represented as a series RLC circuit; thus, the higher the series resistance, the lower the
oscillation frequency. The external load at the terminals of the crystal also has a considerable effect on the frequency and the frequency
stability. The inverter across the crystal provides the necessary voltage differential, and the external inverter provides the amplification
to drive clock loads. Note that the oscillator circuit presented here is by no means a typical example of the state-of-the-art; design of
high-frequency, high-quality clock oscillators is a formidable task, which is beyond the scope of this section.
Usually a VLSI chip receives one or more primary clock signals from an external clock chip and, in turn, generates necessary
derivatives for its internal use. It is often necessary to use two non-overlapping clock signals. The logical product of such two clock
signals should be zero at all times. Figure 5.3 shows a simple circuit that generates CK-1 and CK-2 from the original clock signal CK.
Figure 5.4 shows a clock decoder circuit that takes in the primary clock signals and generates four phase signals.
Figure-5.3: A simple circuit that generates a pair of non-overlapping clock signals from CK.
Figure-5.4: Clock decoder circuit: (a) symbolic representation and (b) sample waveforms and gate-level implementation.
Since clock signals are required almost uniformly over the chip area, it is desirable that all clock signals are distributed with a uniform
delay. An ideal distribution network would be the H-tree structure shown in Fig. 5.5. In such a structure, the distances from the center to
all branch points are the same and hence, the signal delays would be the same. However, this structure is difficult to implement in
practice due to routing constraints and different fanout requirements. A more practical approach for clock-signal distribution is to route
main clock signals to macroblocks and use local clock decoders to carefully balance the delays under different loading conditions.
The reduction of clock skews, which are caused by the differences in clock arrival times and changes in clock waveforms due to
variations in load conditions, is a major concern in high-speed VLSI design. In addition to uniform clock distribution (H-tree) networks
and local skew balancing, a number of new computer-aided design techniques have been developed to automatically generate the layout
of an optimum clock distribution network with zero skew. Figure 5.6 shows a zero-skew clock routing network that was constructed
based on estimated routing parasitics.
Regardless of the exact geometry of the clock distribution network, the clock signals must be buffered in multiple stages as shown in
Fig. 5.7 to handle the high fan-out loads. It is also essential that every buffer stage drives the same number of fan-out gates so that the
clock delays are always balanced. In the configuration shown in Fig. 5.8 (used in the DEC Alpha chip designs), the interconnect wires
are cross- connected with vertical metal straps in a mesh pattern, in order to keep the clock signals in phase across the entire chip.
So far we have seen the needs for having equal interconnect lengths and extensive buffering in order to distribute clock signals with
minimal skews and healthy signal waveforms. In practice, designers must spend significant time and effort to tune the transistor sizes in
buffers (inverters) and also the widths of interconnects. Widening the interconnection wires decreases the series resistance, but at the
cost of increasing the parasitic resistance.
Figure-5.6: An example of the zero-skew clock routing network, generated by a computer-aided design tool.
Figure-5.8: Genaral structure of the clock distribution network used in DEC Alpha microprocessor chips.
The following points should always be considered carefully in digital system design, but especially for successful high-speed VLSI
design:
● Ideal duty cycle of a clock signal is 50%, and the signal can travel farther in a chain of inverting buffers with ideal duty cycle.
The duty cycle of a clock signal can be improved, i.e., made closer to 50%, by using feedback based on the voltage average.
● To prevent reflection in the interconnection network, the rise time and the fall time of the clock signal should not be reduced
excessively.
● The load capacitance should be reduced as much as possible, by reducing the fan-out, the interconnection lengths and the gate
capacitances.
● The characterictic impedance of the clock distribution line should be reduced by using properly increased (w/h)-ratios (the ratio
of the line width to vertical separation distance of the line from the substrate).
● Inductive loads can be used to partially cancel the effects of parasitic capacitance of a clock receiver (matching network).
● Adequate separation should be maintained between high-speed clock lines in order to prevent cross-talk. Also, placing a power
or ground rail between two high-speed lines can be an effective measure.
production of
Chapter 6
ARITHMETIC FOR DIGITAL SYSTEMS
● Introduction
● Notation Systems
● Principle of Generation and Propagation
● The 1-bit Full Adder
● Enhancement Techniques for Adders
● Multioperand Adders
● Multiplication
● Addition and Multiplication in Galois Fields, GF(2n)
6.1 Introduction
Computation speeds have increased dramatically during the past three decades resulting from the development of various technologies.
The execution speed of an arithmetic operation is a function of two factors. One is the circuit technology and the other is the algorithm
used. It can be rather confusing to discuss both factors simultaneously; for instance, a ripple-carry adder implemented in GaAs
technology may be faster than a carry-look-ahead adder implemented in CMOS. Further, in any technology, logic path delay depends
upon many different factors: the number of gates through which a signal has to pass before a decision is made, the logic capability of
each gate, cumulative distance among all such serial gates, the electrical signal propagation time of the medium per unit distance, etc.
Because the logic path delay is attributable to the delay internal and external to logic gates, a comprehensive model of performance
would have to include technology, distance, placement, layout, electrical and logical capabilities of the gates. It is not feasible to make a
general model of arithmetic performance and include all these variables.
The purpose of this chapter is to give an overview of the different components used in the design of arithmetic operators. The following
parts will not exhaustively go through all these components. However, the algorithms used, some mathematical concepts, the
architectures, the implementations at the block, transistor or even mask level will be presented. This chapter will start by the presentation
of various notation systems. Those are important because they influence the architectures, the size and the performance of the arithmetic
components. The well known and used principle of generation and propagation will be explained and basic implementation at transistor
level will be given as examples. The basic full adder cell (FA) will be shown as a brick used in the construction of various systems. After
that, the problem of building large adders will lead to the presentation of enhancement techniques. Multioperand adders are of particular
interest when building special CPU's and especially multipliers. That is why, certain algorithms will be introduced to give a better idea of
the building of multipliers. After the show of the classical approaches, a logarithmic multiplier and the multiplication and addition in the
Galois Fields will be briefly introduced. Muller [Mull92] and Cavanagh [Cava83] constitute two reference books on the matter.
The binary number system is the most conventional and easily implemented system for internal use in digital computers. It is also a
positional number system. In this mode the number is encoded as a vector of n bits (digits) in which each is weighted according to its
position in the vector. Associated to each vector is a base (or radix) r. Each bit has an integer value in the range 0 to r-1. In the binary
system where r=2, each bit has the value 0 or 1. Consider a n-bit vector of the form:
(1)
where ai=0 or 1 for i in [0, n-1]. This vector can represent positive integer values V = A in the range 0 to 2n-1, where:
(2)
The above representation can be extended to include fractions. An example follows. The string of binary digits 1101.11 can be
interpreted to represent the quantity :
The following Table 1 shows the 3-bit vector representing the decimal expression to the right.
If only positive integers were to be represented in fixed-point notation, then an n-bit word would permit a range from 0 to 2n-1.
However, both positive and negative integers are used in computations and an encoding scheme must be devised in which both positive
and negative numbers are distributed as evenly as possible. There must be also an easy way to distinguish between positive and negative
numbers. The left most digit is usually reserved for the sign. Consider the following number A with radix r,
The remaining digits in A indicate either the true value or the magnitude of A in a complemented form.
In this representation, the high-order bit indicates the sign of the integer (0 for positive, 1 for negative). A positive number has a range of
0 to 2n-1-1, and a negative number has a range of 0 to -(2n-1-1). The representation of a positive number is :
One problem with this kind of notation is the dual representation of the number 0. The next problem is when adding two number with
opposite signs. The magnitudes have to be compared to determine the sign of the result.
In this representation, the high-order bit also indicates the sign of the integer (0 for positive, 1 for negative). A positive number has a
range of 0 to 2n-1-1, and a negative number has a range of 0 to -(2n-1-1). The representation of a positive number is :
One problem with this kind of notation is the dual representation of the number 0. The next problem is when adding two number with
opposite signs. The magnitudes have to be compared to determine the sign of the result.
In this notation system (radix 2), the value of A is represented such as:
The test sign is also a simple comparison of two bits. There is a unique representation of 0. Addition and subtraction are easier because
the result comes out always in a unique 2's complement form.
In some particular operations requiring big additions such as in multiplication or in filtering operations, the carry save notation is used.
This notation can be either used in 1's or 2's or whatever other definition. It only means that for the result of an addition, the result will be
coded in two digits which are the carry in the sum digit. When coming to the multioperand adders and multipliers, this notion will be
understood by itself.
It has been stated that each bit in a number system has an integer value in the range 0 to r-1. This produces a digit set S:
(4)
in which all the digits of the set are positively weighted. It is also possible to have a digit set in which both positive- and negative-
weighted digits are allowed [Aviz61] [Taka87], such as:
(5)
where l is a positive integer representing the upper limit of the set. This is considered as a redundant number system, because there may
be more than one way to represent a given number. Each digit of a redundant number system can assume the 2(l+1) values of the set T.
The range of l is:
(6)
For any number x , the ceiling of x is the smallest integer not less than x. The floor of x , is the largest integer not greater than x. Since
the integer l bigger or equal than 1 and r bigger or equal than 2, then the maximum magnitude of l will be
(7)
(8)
(9)
For example, for n=4 and r=2, the number A=-5 has four representation as shown below on Table 5.
23 22 21 20
A= 0 -1 0 -1
A= 0 -1 -1 1
A= -1 0 1 1
A= -1 1 0 -1
Table-6.5: Redundant representation of A=-5 when r=4
This multirepresentation makes redundant number systems difficult to use for certain arithmetic operations. Also, since each signed digit
may require more than one bit to represent the digit, this may increase both the storage and the width of the storage bus.
However, redundant number systems have an advantage for the addition which is that is possible to eliminate the problem of the
propagation of the carry bit. This operation can be done in a constant time independent of the length of the data word. The conversion
from binary to binary redundant is usually a duplication or juxtaposition of bits and it does not cost anything. On the contrary, the
opposite conversion means an addition and the propagation of the carry bit cannot be removed.
Let us consider the example where r=2 and l=1. In this system the three used digits are -1, 0, +1.
The addition of 7 and 5 give 12 in decimal. The same is equivalent in a binary non redundant system to 111 + 101:
We note that a carry bit has to be added to the next digits when making the operation "by hand". In the redundant system the same
operation absorbs the carry bit which is never propagated to the next order digits:
The result 1001100 has now to be converted to the binary non redundant system. To achieve that, each couple of bits has to be added
together. The eventual carry has to be propagated to the next order bits:
The principle of Generation and Propagation seems to have been discussed for the first time by Burks, Goldstine and Von Neumann
[BGNe46]. It is based on a simple remark: when adding two numbers A and B in 2’s complement or in the simplest binary representation
(A=an-1...a1a0, B=bn-1...b1b0), when ai =bi it is not necessary to know the carry ci . So it is not necessary to wait for its calculation in
order to determine ci+1 and the sum si+1.
If ai=bi=0, then necessarily ci+1=0
If ai=bi=1, then necessarily ci+1=1
This means that when ai=bi, it is possible to add the bits greater than the ith, before the carry information ci+1 has arrived. The time
required to perform the addition will be proportional to the length of the longest chain i,i+1, i+2, i+p so that ak not equal to bk for k in
[i,i+p].
It has been shown [BGNe46] that the average value of this longest chain is proportional to the logarithm of the number of bits used to
represent A and B. By using this principle of generation and propagation it is possible to design an adder with an average delay o(log n).
However, this type of adder is usable only on asynchronous systems [Mull82]. Today the complexity of the systems is so high that
asynchronous timing of the operations is rarely implemented. That is why the problem is rather to minimize the maximum delay rather
than the average delay.
Generation:
This principle of generation allows the system to take advantage of the occurrences “ai=bi”. In both cases (ai=1 or ai=0) the carry
bit will be known.
Propagation:
If we are able to localize a chain of bits ai ai+1...ai+pand bi bi+1...bi+pfor which ak not equal to bkfor k in [i,i+p], then the output
carry bit of this chain will be equal to the input carry bit of the chain.
These remarks constitute the principle of generation and propagation used to speed the addition of two numbers.
pi = ai XOR bi (10)
gi = ai bi (11)
The previous equations determine the ability of the ith bit to propagate carry information or to generate a carry information.
[Click to enlarge image]Figure-6.1: A 1-bit adder with propagation signal controling the pass-gate
This implementation can be very performant (20 transistors) depending on the way the XOR function is built. The carry propagation of
the carry is controlled by the output of the XOR gate. The generation of the carry is directly made by the function at the bottom. When
both input signals are 1, then the inverse output carry is 0.
In the schematic of Figure 6.1, the carry passes through a complete transmission gate. If the carry path is precharged to VDD, the
transmission gate is then reduced to a simple NMOS transistor. In the same way the PMOS transistors of the carry generation is
removed. One gets a Manchester cell.
The Manchester cell is very fast, but a large set of such cascaded cells would be slow. This is due to the distributed RC effect and the
body effect making the propagation time grow with the square of the number of cells. Practically, an inverter is added every four cells,
like in Figure 6.3.
It is the generic cell used not only to perform addition but also arithmetic multiplication division and filtering operation. In this part we
will analyse the equations and give some implementations with layout examples.
The adder cell receives two operands ai and bi, and an incoming carry ci. It computes the sum and the outgoing carry ci+1.
ci+1 = ai . bi + ai . ci + ci . bi = ai . bi + (ai + bi ). ci ci+1 = pi . ci + gi
[Click to enlarge image]Figure-6.4: The full adder (FA) and half adder (HA) cells
where
These equation can be directly translated into two N and P nets of transistors leading to the following schematics. The main disadvantage
of this implementation is that there is no regularity in the nets.
The dual form of each equation described previously can be written in the same manner as the normal form:
(16)
(18)
(19)
[Click to enlarge image]Figure-6.6: Symmetrical implementation due to the dual expressions of ci and si.
The following Figure 6.7 shows different physical layouts in different technologies. The size, the technology and the performance of
each cell is summarized in the next Table 6.
The operands of addition are the addend the augend. The addend is added to the augend to form the sum. In most computers, the
augmented operand (the augend) is replaced by the sum, whereas the addend is unchanged. High speed adders are not only for addition
but also for subtraction, multiplication and division. The speed of a digital processor depends heavily on the speed of adders. The adders
add vectors of bits and the principal problem is to speed- up the carry signal. A traditional and non optimized four bit adder can be made
by the use of the generic one-bit adder cell connected one to the other. It is the ripple carry adder. In this case, the sum resulting at each
stage need to wait for the incoming carry signal to perform the sum operation. The carry propagation can be speed-up in two ways. The
first –and most obvious– way is to use a faster logic circuit technology. The second way is to generate carries by means of forecasting
logic that does not rely on the carry signal being rippled from stage to stage of the adder.
Generally, the size of an adder is determined according to the type of operations required, to the precision or to the time allowed to
perform the operation. Since the operands have a fixed size, if becomes important to determine whether or not there is a detected
overflow
Overflow: An overflow can be detected in two ways. First an overflow has occurred when the sign of the sum does not agree with the
signs of the operands and the sign s of the operands are the same. In an n-bit adder, overflow can be defined as:
(22)
Secondly, if the carry out of the high order numeric (magnitude) position of the sum and the carry out of the sign position of the sum
agree, the sum is satisfactory; if they disagree, an overflow has occurred. Thus,
(23)
A parallel adder adds two operands, including the sign bits. An overflow from the magnitude part will tend to change the sign of the sum.
So that an erroneous sign will be produced. The following Table 7 summarizes the overflow detection
Coming back to the acceleration of the computation, two major techniques are used: speed-up techniques (Carry Skip and Carry Select),
anticipation techniques (Carry Look Ahead, Brent and Kung and C3i). Finally, a combination of these techniques can prove to be an
optimum for large adders.
Depending on the position at which a carry signal has been generated, the propagation time can be variable. In the best case, when there
is no carry generation, the addition time will only take into account the time to propagate the carry signal. Figure 9 is an example
illustrating a carry signal generated twice, with the input carry being equal to 0. In this case three simultaneous carry propagations occur.
The longest is the second, which takes 7 cell delays (it starts at the 4th position and ends at the 11th position). So the addition time of
these two numbers with this 16-bits Ripple Carry Adder is 7.k + k’, where k is the delay cell and k’ is the time needed to compute the
11th sum bit using the 11th carry-in.
With a Ripple Carry Adder, if the input bits Ai and Bi are different for all position i, then the carry signal is propagated at all positions
(thus never generated), and the addition is completed when the carry signal has propagated through the whole adder. In this case, the
Ripple Carry Adder is as slow as it is large. Actually, Ripple Carry Adders are fast only for some configurations of the input words,
where carry signals are generated at some positions.
Carry Skip Adders take advantage both of the generation or the propagation of the carry signal. They are divided into blocks, where a
special circuit detects quickly if all the bits to be added are different (Pi = 1 in all the block). The signal produced by this circuit will be
called block propagation signal. If the carry is propagated at all positions in the block, then the carry signal entering into the block can
directly bypass it and so be transmitted through a multiplexer to the next block. As soon as the carry signal is transmitted to a block, it
starts to propagate through the block, as if it had been generated at the beginning of the block. Figure 6.10 shows the structure of a 24-
bits Carry Skip Adder, divided into 4 blocks.
[Click to enlarge image]Figure-6.10: The "domino behaviour of the carry propagation and generation signals
To summarize, if in a block all Ai's?Bi's, then the carry signal skips over the block. If they are equal, a carry signal is generated inside
the block and needs to complete the computation inside before to give the carry information to the next block.
It becomes now obvious that there exist a trade-off between the speed and the size of the blocks. In this part we analyse the division of
the adder into blocks of equal size. Let us denote k1 the time needed by the carry signal to propagate through an adder cell, and k2 the
time it needs to skip over one block. Suppose the N-bit Carry Skip Adder is divided into M blocks, and each block contains P adder cells.
The actual addition time of a Ripple Carry Adder depends on the configuration of the input words. The completion time may be small
but it also may reach the worst case, when all adder cells propagate the carry signal. In the same way, we must evaluate the worst carry
propagation time for the Carry Skip Adder. The worst case of carry propagation is depicted in Figure 6.11.
[Click to enlarge image]Figure-6.11: Worst case for the propagation signal in a Carry Skip adder with blocks of equal size
The configuration of the input words is such that a carry signal is generated at the beginning of the first block. Then this carry signal is
propagated by all the succeeding adder cells but the last which generates another carry signal. In the first and the last block the block
propagation signal is equal to 0, so the entering carry signal is not transmitted to the next block. Consequently, in the first block, the last
adder cells must wait for the carry signal, which comes from the first cell of the first block. When going out of the first block, the carry
signal is distributed to the 2nd, 3rd and last block, where it propagates. In these blocks, the carry signals propagate almost simultaneously
(we must account for the multiplexer delays). Any other situation leads to a better case. Suppose for instance that the 2nd block does not
propagate the carry signal (its block propagation signal is equal to zero), then it means that a carry signal is generated inside. This carry
signal starts to propagate as soon as the input bits are settled. In other words, at the beginning of the addition, there exists two sources for
the carry signals. The paths of these carry signals are shorter than the carry path of the worst case. Let us formalize that the total adder is
made of N adder cells. It contains M blocks of P adder cells. The total of adder cells is then
N=M.P (24)
The time T needed by the carry signal to propagate through P adder cells is
T=k1.P (25)
The time T' needed by the carry signal to skip through M adder blocks is
T'=k2.M (26)
The problem to solve is to minimize the worst case delay which is:
(27)
(28)
(29)
(30)
Let us formalize the problem as a geometric problem. A square will represent the generic full adder cell. These cells will be grouped in P
groups (in a column like manner).
L(1), L(2), ..., L(P) are the P adjacent columns. (see Figure 6.12)
If a carry signal is generated at the ith section, this carry skips j-i-1 sections and disappears at the end of the jth section. So the delay of
propagation is:
(31)
(32)
The constant a is equivalent to the slope dimension in the geometrical problem of the two two straight lines defined by equations (33)
and (34). These straight lines are adjacent to the top of the columns and the maximum time can be expressed as a geometrical distance y
equal to the y-value of the intersection of the two straight lines.
(35)
(36)
because (37)
A possible implementation of a block is shown in Figure 6.14. In a precharged mode, the output of the four inverter-like structure is set
to one. In the evaluation mode, the entire block is in action and the output will either receive c0 or the carry generated inside the
comparator cells according to the values given to A and B. If there is no carry generation needed, c0 will be transmitted to the output. In
the other case, one of the inversed pi's will switch the multiplexer to enable the other input.
This type of adder is not as fast as the Carry Look Ahead (CLA) presented in a next section. However, despite its bigger amount of
hardware needed, it has an interesting design concept. The Carry Select principle requires two identical parallel adders that are
partitioned into four-bit groups. Each group consists of the same design as that shown on Figure 15. The group generates a group carry.
In the carry select adder, two sums are generated simultaneously. One sum assumes that the carry in is equal to one as the other assumes
that the carry in is equal to zero. So that the predicted group carry is used to select one of the two sums.
It can be seen that the group carries logic increases rapidly when more high- order groups are added to the total adder length. This
complexity can be decreased, with a subsequent increase in the delay, by partitioning a long adder into sections, with four groups per
section, similar to the CLA adder.
[Click to enlarge image]Figure-6.16: The Carry Select adder . (a) the design with non optimised used of the gates, (b) Merging of the
redundant gates
A possible implementation is shown on Figure 6.16, where it is possible to merge some redundant logic gates to achieve a lower
complexity with a higher density.
The limitation in the sequential method of forming carries, especially in the Ripple Carry adder arises from specifying ci as a specific
function of ci-1. It is possible to express a carry as a function of all the preceding low order carry by using the recursivity of the carry
function. With the following expression a considerable increase in speed can be realized.
(38)
Usually the size and complexity for a big adder using this equation is not affordable. That is why the equation is used in a modular way
by making groups of carry (usually four bits). Such a unit generates then a group carry which give the right predicted information to the
next block giving time to the sum units to perform their calculation.
Figure-6.17: The Carry Generation unit performing the Carry group computation
Such unit can be implemented in various ways, according to the allowed level of abstraction. In a CMOS process, 17 transistors are able
to guarantee the static function (Figure 6.18). However this design requires a careful sizing of the transistors put in series.
The same design is available with less transistors in a dynamic logic design. The sizing is still an important issue, but the number of
transistors is reduced (Figure 6.19).
[Click to enlarge image]Figure-6.18: Static implementation of the 4-bit carry lookahead chain
[Click to enlarge image]Figure-6.19: Dynamic implementation of the 4-bit carry lookahead chain
To build large adders the preceding blocks are cascaded according to Figure 6.20.
The technique to speed up the addition is to introduce a "new" operator which combines couples of generation and propagation
signals. This "new" operator come from the reformulation of the carry chain.
Let an an-1 ... a1 and bn bn-1 ... b1 be n-bit binary numbers with sum sn+1 sn ... s1. The usual method for addition computes the si’s by:
c0 = 0 (39)
ci = aibi + aici-1 + bici-1 (40)
si = ai ++ bi ++ ci-1, i = 1,...,n (41)
sn+1 = cn (42)
Where ++ means the sum modulo-2 and ci is the carry from bit position i. From the previous paragraph we can deduce that the ci’s are
given by:
c0 = 0 (43)
ci = gi + pi ci-1 (44)
gi = ai bi (45)
pi = ai ++ bi for i = 1,...., n (46)
One can explain equation (44) saying that the carry ci is either generated by ai and bi or propagated from the previous carry ci-1. The
whole idea is now to generate the carry’s in parallel so that the nth stage does not have to “wait” for the n-1th carry bit to compute the
global sum. To achieve this goal an operator is defined.
c1 = g1 + p1 . 0 = g1 = G1 (50)
So the result holds for i=1. If i>1 and ci-1 = Gi-1 , then
(Gi, Pi) = (gi, pi) (Gi-1, Pi-1) (51)
(Gi, Pi) = (gi, pi) (ci-1, Pi-1) (52)
(Gi, Pi) = (gi + pi . ci-1, Pi . Pi-1) (53)
thus Gi = gi + pi . ci-1 (54)
Proof: For any (g3, p3), (g2, p2), (g1, p1) we have:
[(g3, p3) (g2, p2)] (g1, p1) = (g3+ p3 . g2, p3 . p2) (g1, p1)
= (g3+ p3 . g2+ , p3 . p2 . p1) (55)
and,
(g3, p3) [(g2, p2) (g1, p1)] = (g3, p3) (g2 + p2 . g1, p2 . p1)
One can check that the expressions (55) and (56) are equal using the distributivity of . and +.
To compute the ci’s it is only necessary to compute all the (Gi, Pi)’s but by Lemmas 1 and 2,
(Gi, Pi) = (gi, pi) (gi-1, pi-1) .... (g1, p1) (57)
can be evaluated in any order from the given gi’s and pi’s. The motivation for introducing the operator Delta is to generate the carry’s in
parallel. The carry’s will be generated in a block or carry chain block, and the sum will be obtained directly from all the carry’s and pi’s
since we use the fact that:
THE ADDER
Based on the previous reformulation of the carry computation Brent and Kung have proposed a scheme to add two n-bit numbers in a
time proportional to log(n) and in area proportional to n.log(n), for n bigger or equal to 2. Figure 6.21 shows how the carry’s are
computed in parallel for 16-bit numbers.
[Click to enlarge image]Figure-6.21: The first binary tree allowing the calculation of c1, c2, c4, c8, c16.
Using this binary tree approach, only the ci’s where i=2k (k=0,1,...n) are computed. The missing ci’s have to be computed using another
tree structure, but this time the root of the tree is inverted (see Figure 6.22).
In Figure 6.21 and Figure 6.22 the squares represent a cell which performs equation (47). Circles represent a duplication cell where
the inputs are separated into two distinct wires (see Figure 6.23).
When using this structure of two separate binary trees, the computation of two 16-bit numbers is performed in T=9 stages of cells.
During this time, all the carries are computed in a time necessary to traverse two independent binary trees.
According to Burks, Goldstine and Von Neumann, the fastest way to add two operands is proportional to the logarithm of the number of
bits. Brent and Kung have achieved such a result.
[Click to enlarge image]Figure-6.23: (a) The cell, (b) the duplication cell
THE ALGORITHM
Let ai and bi be the digits of A and B, two n-bit numbers with i = 1, 2, ...,n. The carry's will be computed according to (59).
(59)
with: Gi = ci (60)
(61)
and by introducing a parameter m less or equal than n so that it exists q in IN | n = q.m, it is possible to obtain the couple (Gi, Pi) by
forming groups of m cells performing the intermediate operations detailed in (62) and (63).
(62)
(63)
This manner of computing the carries is strictly based on the fact that the operator is associative. It shows also that the calculation is
performed sequentially, i.e. in a time proportional to the number of bits n. We will now illustrate this analytical approach by giving a
way to build an architectural layout of this new algorithm. We will proceed to give a graphical method to place the cells defined in the
previous paragraph [Kowa92].
This complete approach is illustrated in Figure 6.24, where all the steps are carefully observed. The only cells necessary for this carry
generation block to constitute a real parallel adder are the cells performing equations (45) and (46). The first row of functions is put at
the top of the structure. The second one is pasted at the bottom.
[Click to enlarge image]Figure-6.24: (a) Step1, (b) Step2, (c) Step3 and Step4, (d) Step5
At this point of the definition, two remarks have to be made about the definition of this algorithm. Both concern the m parameter used to
defined the algorithm. Remark 1 specifies the case for m not equal to 2q (q in [0,1, ...] ) as Remark 2 deals with the case where m=n.
[Click to enlarge image]Figure-6.25: Adder where m=6. The fan-out of the 11th carry bit is highlighted
Remark 1: For m not a power of two, the algorithm is built the same way up to the very last step. The only reported difference will
concern the delay which will be equal to the next nearest power of two. This means that there is no special interest to build such versions
of these adders. The fan-out of certain cells is even increased to three, so that the electrical behaviour will be degraded. Figure 6.25
illustrates the design of such an adder based on m=6. The fan-out of the cell of bit 11 is three. The delay of this adder is equivalent to
the delay of an adder with a duplication with m=8.
Remark 2: For m equal to the number of bits of the adder, the algorithm reaches the real theoretical limit demonstrated by Burks,
Goldstine and Von Neumann. The logarithmic time is attained using one depth of a binary tree instead of two in the case of Brent and
Kung. This particular case is illustrated in Figure 6.26. The definition of the algorithm is followed up to Step3. Once the reproduction of
the binary tree is made m times to the right, the only thing to do is to remove the cells at the negative bit positions and the adder is
finished. Mathematically, one can notice that this is the limit. We will discuss later whether it is the best way to build an adder using
m=n.
[Click to enlarge image]Figure-6.26: Adder where m=n. This constitutes the theoretical limit for the computation of the addition.
COMPARISONS
In this section, we develop a comparison between adders obtained using the new algorithm with different values of m. On the plots of
Figure 6.27 through Figure 6.29, the suffixes JK2, JK4, and JK8 will denote different adders obtained for m equal two, four or eight.
They are compared to the Brent and Kung implementation and to the theoretical limit which is obtained when m equals n, the number of
bits.
The comparison between these architectures is done according to the formalisation of a computational model described in [Kowa93]. We
clearly see that BK’s algorithm performs the addition with a delay proportional to the logarithm of the number of bits. JK2 performs the
addition in a linear time, just as JK4 or JK8. The parameter m influences the slope of the delay. So that, the higher is m, the longer the
delay stays under the logarithmic delay of (BK). We see that when one wants to implement the addition faster than (BK), there is a
choice to make among different values of m. The choice will depend on the size of the adder because it is evident that a 24-bit JK2 adder
(delay = 11 stages of cells) performs worse than BK (delay = 7 stages of cells).
On the other hand JK8 (delay = 5 stages of cells) is very attractive. The delay is better than BK up to 57 bits. At this point both delays
are equal. Furthermore, even at equal delays (up to 73 bits) our implementation performs better in terms of regularity, modularity and
ease to build. The strong advantage of this new algorithm compared to BK is that for a size of the input word which is not a power-of-
two, the design of the cells is much easier. There is no partial binary tree to build. The addition of a bit to the adder is the addition of a bit-
slice. This bit-slice is very compact and regular. Let us now consider the case where m equals n (denoted by XXX on our figures). The
delay of such an adder is exactly one half of BK and it is the lowest bound we obtain. For small adders (n < 16), the delay is very close to
XXX. And it can be demonstrated that the delays (always in term of stages) of JK2, JK4, JK8 are always at least equal to XXX.
This discussion took into account the two following characteristics of the computational model:
And the conclusion of this discussion is that m has to be chosen as high as possible to reduce the global delay. When we turn to the
comparisons concerning the area, we will take into account the following characteristics of our computational model:
For this discussion let us consider Figure 6.28 where we represent the area of the different adders versus the number of bits. It is obvious
that for m being the smallest, the area will be the smallest as well. For m increasing up to n, we can see that the area will still be
proportional to the number of bits following a straight line. For m equal to n the area will be exactly one half of the BK area with a linear
variation. The slope of this variation in both cases of BK and XXX will vary according to the intervals [2q,2q+1] where q=0.
Here we could point out that the floorplan of BK could be optimised to become comparable to the one of XXX, but the cost of such an
implementation would be very high because of the irregularity of the wirings and the interconnections. These considerations lead us to
the following conclusion: to minimise the area of a new adder, m must be chosen low. This is contradictory with the previous conclusion.
That is why a very wise choice of m will be necessary, and it will always depend on the targeted application. Finally, Figure 6.27 gives
us the indications about the number of transistors used to implement our different versions of adders. These calculations are based on the
dynamic logic family (TSPC: True Single Phase Clocking) described in [Kowa93]. When considering this graph, we see that BK and
XXX are two limits of the family of our adders. BK uses the smallest number of transistors , whereas XXX uses up to five times more
transistors. When m is highest, the number of transistors is highest.
Nevertheless, we see that the area is smaller than BK. A high density is an advantage, but an overhead in transistors can lead to higher
power dissipation. This evident drawback in our algorithm is counterbalanced by the progress being made in the VLSI area. With the
shrinking of the design rules, the size of the transistors decreases as well as the size of the interconnections. This leads to smaller power
dissipation. This fact is even more pronounced when the technologies tend to decrease the power supply from 5V to 3.3V.
In other words, the increase in the number of transistors corresponds to the redundancy we introduce in the calculations to decrease the
delay of our adders.
Now we will discuss an important characteristics of our computational model that differs from the model of Brent and Kung:
This assumption is very important as we discuss it with an example. Let us consider the 16-bit BK adder (Figure 6.22) and the 16-bit JK4
adder (Figure 6.24). The longest wire in the BK implementation will be equal to at least eight widths of ? cells, whereas in the JK4
implementation, the longest wire will be equal to four widths of -cells. For BK, the output capacitive load of a -cell will be variable
and a variable sizing of the cell will be necessary. In our case, the parameter m will defined a fixed library of -cells used in the adder.
The capacitive load will always be limited to a fixed value allowing all cells to be sized to a fixed value.
Figure-6.29: Delay in number of stages versus the number of bits in the adder
To partially conclude this section, we say that an optimum must be defined when choosing to implement our algorithm. This optimum
will depend on the application for which the operator is to be used.
The goal is to add more than 2 operand in a time. This generally occurs in multiplication operation or filtering.
For this purpose, Wallace trees were introduced. The addition time grows like the logarithm of the bit number. The simplest Wallace tree
is the adder cell. More generally, an n-inputs Wallace tree is an n-input operator and log2(n) outputs, such that the value of the output
word is equal to the number of “1” in the input word. The input bits and the least significant bit of the output have the same weight
(Figure 6.30). An important property of Wallace trees is that they may be constructed using adder cells. Furthermore, the number of
adder cells needed grows like the logarithm log2(n) of the number n of input bits. Consequently, Wallace trees are useful whenever a
large number of operands are to add, like in multipliers. In a Braun or Baugh-Wooley multiplier with a Ripple Carry Adder, the
completion time of the multiplication is proportional to twice the number n of bits. If the collection of the partial products is made
through Wallace trees, the time for getting the result in a carry save notation should be proportional to log2(n).
Figure 6.31 represents a 7-inputs adder: for each weight, Wallace trees are used until there remains only two bits of each weight, as to
add them using a classical 2-inputs adder. When taking into account the regularity of the interconnections, Wallace trees are the most
irregular.
To circumvent the irregularity Mou [Mou91] proposes an alternalive way to build multi-operand adders. The method uses basic cells
called branch, connector or root. These basic elements (see Figure 6.32) are connected together to form n-input trees. One has to take
care about the weight of the inputs. Because in this case the weights at the input of the 18-input OS tree are different. The regularity of
this structure is better than with Wallace trees but the construction of multipliers is still complex.
6.7 Multiplication
6.7.1 Inroduction
Multiplication can be considered as a serie of repeated additions. The number to be added is the multiplicand, the number of times that it
is added is the multiplier, the result is the product. Each step of the addition generates a partial product. In most coimputers, the operands
usually contain the same number of bits. When the operands are interpreted as integers, the product is generally twice the length of the
operands in order to preserve the information content. This repeated addition method that is suggested by the arithmetic definition is
slow that it is almost always replaced by an algorithm that makes use of positional number representation.
It is possible to decompose multipliers in two parts. The first part is dedicated to the generation of partial products, and the second one
collects and adds them. As for adders, it is possible to enhance the intrinsic performances of multipliers. Acting in the generation part,
the Booth (or modified Booth) algorithm is often used because it reduces the number of partial products. The collection of the partial
products can then be made using a regular array, a Wallace tree or a binary tree [Sinh89].
This algorithm is a powerful direct algorithm for signed-number multiplication. It generates a 2n-bit product and treats both positive and
negative numbers uniformly. The idea is to reduce the number of additions to perform. Booth algorithm allows in the best case n/2
additions whereas modified Booth algorithm allows always n/2 additions.
2i+k-2i=2i+k-1+2i+k-2+...+2i+1+2i
In fact, the modified Booth algorithm converts a signed number from the standard 2’s-complement radix into a number system where the
digits are in the set {-1,0,1}. In this number system, any number may be written in several forms, so the system is called redundant.
The coding table for the modified Booth algorithm is given in Table 8. The algorithm scans strings composed of three digits. Depending
on the value of the string, a certain operation will be performed.
A possible implementation of the Booth encoder is given on Figure 6.35. The layout of another possible structure is given on Figure
6.36.
BIT M is
21 20 2-1 OPERATION multiplied
Yi+1 Yi Yi-1 by
0 0 0 add zero (no string) +0
0 0 1 add multipleic (end of string) +X
0 1 0 add multiplic. (a string) +X
0 1 1 add twice the mul. (end of string) +2X
1 0 0 sub. twice the m. (beg. of string) -2X
1 0 1 sub. the m. (-2X and +X) -X
1 1 0 sub . the m. (beg. of string) -X
1 1 1 sub. zero (center of string) -0
Table-6.8: Modified Booth coding table.
This multiplier is the simplest one, the multiplication is considered as a succession of additions.
If A = (an an-1……a0) and B = (bn bn-1……b0)
The structure of Figure 6.37 is suited only for positive operands. If the operands are negative and coded in 2’s-complement :
1. The most significant bit of B has a negative weight, so a subtraction has to be performed at the last step.
2. Operand A.2k must be written on 2N bits, so the most significant bit of A must be duplicated. It may be easier to shift the content
of the accumulator to the right instead of shifting A to the left.
The simplest parallel multiplier is the Braun array. All the partial products A.bk are computed in parallel, then collected through a
cascade of Carry Save Adders. At the bottom of the array, the output of the array is noted in Carry Save, so an additional adder converts
it (by the mean of a carry propagation) into the classical notation (Figure 6.38). The completion time is limited by the depth of the carry
save array, and by the carry propagation in the adder. Note that this multiplier is only suited for positive operands. Negative operands
may be multiplied using a Baugh-Wooley multiplier.
[Click to enlarge image]Figure-6.38: A 4-bit Braun Multiplier without the final adder
Figure 6.38 and Figure 6.40 use the symbols given in Figure 6.39 where CMUL1 and CMUL2 are two generic cells consisting of an
adder without the final inverter and with one input connected to an AND or NAND gate. A non optimised (in term of transistors)
multiplier would consist only of adder cells connected one to another with AND gates generating the partial products. In these examples,
the inverters at the output of the adders have been eliminated and the parity of the bits has been compensated by the use of CMUL1 or
CMUL2.
This technique has been developed in order to design regular multipliers, suited for 2’s-complement numbers.
(64), (65)
(66)
We see that subtractor cells must be used. In order to use only adder cells, the negative terms may be rewritten as :
(67)
(68)
(69)
because :
(70)
A and B are n-bits operands, so their product is a 2n-bits number. Consequently, the most significant weight is 2n-1, and the first term -
22n-1 is taken into account by adding a 1 in the most significant cell of the multiplier.
[Click to enlarge image]Figure-6.41: A 4-bit Baugh-Wooley Multiplier with the final adder
The advantage of this method is the higher regularity of the array. Signed integers can be processed. The cost for this regularity is the
addition of an extra column of adders.
[Click to enlarge image]Figure-6.42: A 4-bit Baugh-Wooley Multiplier with the final adder
On Figure 6.43 the scheme using OS-trees is used in a 4-bit multiplier. The partial product generation is done according to Dadda
multiplication. Figure 6.44 represents the OS-tree structure used in a 16-bit multiplier. Although the author claims a better regularity, its
scheme does not allow an easy pipelining.
[Click to enlarge image]Figure-6.44: A 16-bit OS-tree Multiplier without a final adder and without the partial product cells
The objective of this circuit is to compute the product of two terms. The property used is the following equation :
Log(A * B) = Log (A) + Log (B) (71)
There are several ways to obtain the logarithm of a number : look-up tables, recursive algorithms or the segmentation of the logarithmic
curve [Hoef91]. The segmentation method : The basic idea is to approximate the logarithm curve with a set of linear segments.
If y = Log2 (x) (72)
an approximation of this value on the segment ]2n+1 , 2n[ can be made using the following equation :
y = ax + b = ( y / x ).x + b = [1 / (2n+1 - 2n)].x + n-1 = 2-n x + (n-1) (73)
If we take xi = (xi7, xi6, xi5, xi4, xi3, xi2, xi1, xi0), an integer coded with 8 bits, its logarithm will be obtained as follows. The decimal
part of the logarithm will be obtained by shifting xi n positions to the right, and the integer part will be the value where the MSB occurs.
For instance if xi is (0,0,1,0,1,1,1,0) = 46, the integer part of the logarithm is 5 because the MSB is xi5 and the decimal part is 01110. So
the logarithm of xi equals 101.01110 = 5.4375 because 01110 is 14 out of a possible 32, and 14/32 = 0.4275
Table 9 illustrates this coding. Once the coding of two linear words has been performed, the addition of the two logarithms can be done.
The last operation to be performed is the antilogarithm of the sum to obtain the value of the final product.
Using this method, a 11.6% error on the product of two binary operands (i.e. the sum of two logarithmic numbers) occurs. We would like
to reduce this error without increasing the complexity of the operation nor the complexity of the operator. Since the transfomations used
in this system are logarithms and antilogarithms, it is natural to think that the complexity of the correction systems will grow
exponentially if the error approaches zero. We analyze the error to derive an easy and effective way to increase the accuracy of the result.
Figure 6.45 describes the architecture of the logarithmic multiplier with the different variables used in the system.
Error analysis: Let us define the different functions used in this system.
The logarithm and antilogarithm curves are approximated by linear segments. They start at values which are in powers-of-two and end at
the next power-of- two value. Figure 6.46 shows how a logarithm is approximated. The same is true for the antilogarithm.
[Click to enlarge image]Figure-6.46: Approximated value of the logarithm compared to the exact logarithm
By adding the unique value 17*2-8 to the two logarithms an improvement of 40% is achieved on the maximum error. The maximum
error comes down from 11.6% to 7.0%, an improvement of 40% compared with a system without any correction. The only cost is the
replacement of the internal two input adder by a three input adder.
A more complex correction system which leads to better precision but at a much higher hardware cost is possible.
In Table 10 we suggest a system which would choose one correction among three depending on the value of the input bits. Table 10 can
be read as the values of the logarithms obtained after the coder for either a1 or a2. The penultimate column represents the ideal
correction which should be added to get 100% accuracy. The last column gives the correction chosen among three possibilities: 32, 16 or
0.
Three decoding functions have to be implemented for this proposal. If the exclusive -OR of a-2 and a-3 is true, then the added value is
32*2-8. If all the bits of the decimal part are zero, then the added value is zero. In all other cases the added value is 16*2-8.
This decreases the average error. But the drawback is that the maximum error will be minimized only if the steps between two ideal
corrections are bigger than the unity step. To minimize the maximum error the correcting functions should increase in an exponential
way. Further research could be performed in this area.
The group theory is used to introduce another algebraic system, called a field. A field is a set of elements in which we can do addition,
subtraction, multiplication and division without leaving the set. Addition and multiplication must satisfy the commutative, associative,
and distributive laws. A formal definition of a field is given below.
Definition
Let F be a set of elements on which two binary operations called addition "+" and multiplication".", are defined. The set F together with
the two binary operations + and . is a field if the following conditions are satisfied:
1. F is a commutative group under addition +. The identity element with respect to addition is called the zero element or the additive
identity of F and is denoted by 0.
2. The set of nonzero elements in F is a commutative group under multiplication . .The identity element with respect to
multiplication is called the unit element or the multiplicative identity of F and is denoted 1.
3. Multiplication is distributive over addition; that is, for any three elements, a, b, c in F:
a . ( b + c ) = a . b + a . c
Let us consider the set {0,1} together with modulo-2 addition and multiplication. We can easily check that the set {0,1} is a field of two
elements under modulo-2 addition and modulo-2 multiplication.field is called a binary field and is denoted by GF(2).
The binary field GF(2) plays an important role in coding theory [Rao74] and is widely used in digital computers and data transmission or
storage systems.
Another example using the residue to the base [Garn59] is given below. Table 11 represents the values of N, from 0 to 29 with their
representation according to the residue of the base (5, 3, 2).The addition and multiplication of two term in this base can be performed
according to the next example:
Table-6.11: N varying from 0 to 29 and its representation in the residue number system
The most interesting property in these systems is that there is no carry propagation inside the set. This can be attractive when
implementing into VLSI these operators
References
[Aviz61] Avizienis, Signed-Digit Number Representations For Fast Parallel Arithmetic, IRE Trans. Electron. Compute., Vol EC-
10, pp. 389-400, 1961.
[Cava83] J. J. F. Cavanagh, Digital Computer Arithmetic, McGraw-Hill computer sciences series, 1983.
[Garn59] H. L. Garner. The Residue Number System, IRE Transactions on Elec. Comput., p. 140- 147, September 1959
[Hoef91] B. Hoefflinger, M. Selzer, F. Warkowski, Digital Logarithmic CMOS Multiplier for Very- High Speed Signal
Processing, in Proc. IEEE Custom Integrated Circuit Conference, 1991, pp.16.7.1-16.7.5.
[Kowa92] J. Kowalczuk and D. Mlynek, Un Nouvel Algorithme De Generation D'Additionneurs Rapides Dédiés Au traitement
d'images, Proceeding of the Industrial Automation Conference, p. 20.9-20.13, Montreal, Québec, Canada, June 1992.
[Kowa93] J. Kowalczuk, On the Design and implementation of algorithms for Multimedia Systems, PhD Thesis No 1188, Swiss
Federal Institute of Techn., Lausanne 1994.
[Mull82] J.M. Muller, Arithmétique des ordinateurs, opérateurs et fonctions élémentaires, Masson 1989
[Rao74] T. R. N. Rao, Error Coding for Arithmetic Processors, New York, Academic Press, 1974
[Sinh89] B. P. Sinha and P. K. Srimani, Fast Parallel Algorithms for Binary Multiplication and Their Implementation on Systolic
Architectures, IEEE Transactions on Computers, vol. 38, No. 3, 424-431, March 1989
[Taka87] Y. Harata, Y. Nakamura, H. Nagase, M. Takigawa and N. Takagi, A High-Speed Multiplier Using a Redundant Binary
Adder Tree, IEEE Journal of Solid States Circuits, Volume SC- 22, No. 1, Pages 28-34, February 1987
production of
Chapter 7
LOW-POWER VLSI CIRCUITS AND SYSTEMS
● Introduction
● Overview of Power Consumption
● Low-Power Design Through Voltage Scaling
● Estimation and Optimization of Switching Activity
● Reduction of Switched Capacitance
● Adiabatic Logic Circuits
7.1 Introduction
The increasing prominence of portable systems and the need to limit power consumption (and hence, heat dissipation) in very-high
density ULSI chips have led to rapid and innovative developments in low-power design during the recent years. The driving forces
behind these developments are portable applications requiring low power dissipation and high throughput, such as notebook computers,
portable communication devices and personal digital assistants (PDAs). In most of these cases, the requirements of low power
consumption must be met along with equally demanding goals of high chip density and high throughput. Hence, low-power design of
digital integrated circuits has emerged as a very active and rapidly developing field of CMOS design.
The limited battery lifetime typically imposes very strict demands on the overall power consumption of the portable system. Although
new rechargeable battery types such as Nickel-Metal Hydride (NiMH) are being developed with higher energy capacity than that of the
conventional Nickel-Cadmium (NiCd) batteries, revolutionary increase of the energy capacity is not expected in the near future. The
energy density (amount of energy stored per unit weight) offered by the new battery technologies (e.g., NiMH) is about 30 Watt-
hour/pound, which is still low in view of the expanding applications of portable systems. Therefore, reducing the power dissipation of
integrated circuits through design improvements is a major challenge in portable systems design.
The need for low-power design is also becoming a major issue in high-performance digital systems, such as microprocessors, digital
signal processors (DSPs) and other applications. Increasing chip density and higher operating speed lead to the design of very complex
chips with high clock frequencies. Typically, the power dissipation of the chip, and thus, the temperature, increase linearly with the
clock frequency. Since the dissipated heat must be removed effectively to keep the chip temperature at an acceptable level, the cost of
packaging, cooling and heat removal becomes a significant factor. Several high-performance microprocessor chips designed in the early
1990s (e.g., Intel Pentium, DEC Alpha, PowerPC) operate at clock frequencies in the range of 100 to 300 MHz, and their typical power
consumption is between 20 and 50 W.
ULSI reliability is yet another concern which points to the need for low-power design. There is a close correlation between the peak
power dissipation of digital circuits and reliability problems such as electromigration and hot-carrier induced device degradation. Also,
the thermal stress caused by heat dissipation on chip is a major reliability concern. Consequently, the reduction of power consumption is
also crucial for reliability enhancement.
The methodologies which are used to achieve low power consumption in digital systems span a wide range, from device/process level to
algorithm level. Device characteristics (e.g., threshold voltage), device geometries and interconnect properties are significant factors in
lowering the power consumption. Circuit-level measures such as the proper choice of circuit design styles, reduction of the voltage
swing and clocking strategies can be used to reduce power dissipation at the transistor level. Architecture-level measures include smart
power management of various system blocks, utilization of pipelining and parallelism, and design of bus structures. Finally, the power
consumed by the system can be reduced by a proper selection of the data processing algorithms, specifically to minimize the number of
switching events for a given task.
In this chapter, we will primarily concentrate on the circuit- or transistor-level design measures which can be applied to reduce the
power dissipation of digital integrated circuits. Various sources of power consumption will be discussed in detail, and design strategies
will be introduced to reduce the power dissipation. The concept of adiabatic logic will be given a special emphasis since it emerges as a
very effective means for reducing the power consumption.
In the following, we will examine the various sources (components) of time-averaged power consumption in CMOS circuits. The
average power consumption in conventional CMOS digital circuits can be expressed as the sum of three main components, namely, (1)
the dynamic (switching) power consumption, (2) the short-circuit power consumption, and (3) the leakage power consumption. If the
system or chip includes circuits other than conventional CMOS gates that have continuous current paths between the power supply and
the ground, a fourth (static) power component should also be considered. We will limit our discussion to the conventional static and
dynamic CMOS logic circuits.
This component represents the power dissipated during a switching event, i.e., when the output node voltage of a CMOS logic gate
makes a power consuming transition. In digital CMOS circuits, dynamic power is dissipated when energy is drawn from the power
supply to charge up the output node capacitance. During the charge-up phase, the output node voltage typically makes a full transition
from 0 to VDD, and the energy used for the transition is relatively independent of the function performed by the circuit. To illustrate the
dynamic power dissipation during switching, consider the circuit example given in Fig. 7.1. Here, a two-input NOR gate drives two
NAND gates, through interconnection lines. The total capacitive load at the output of the NOR gate consists of (1) the output
capacitance of the gate itself, (2) the total interconnect capacitance, and (3) the input capacitances of the driven gates.
Figure-7.1: A NOR gate driving two NAND gates through interconnection lines.
The output capacitance of the gate consists mainly of the junction parasitic capacitances, which are due to the drain diffusion regions of
the MOS transistors in the circuit. The important aspect to emphasize here is that the amount of capacitance is approximately a linear
function of the junction area. Consequently, the size of the total drain diffusion area dictates the amount of parasitic capacitance. The
interconnect lines between the gates contribute to the second component of the total capacitance. The estimation of parasitic
interconnect capacitance was discussed thoroughly in Chapter 4. Note that especially in sub-micron technologies, the interconnect
capacitance can become the dominant component, compared to the transistor-related capacitances. Finally, the input capacitances are
mainly due to gate oxide capacitances of the transistors connected to the input terminal. Again, the amount of the gate oxide capacitance
is determined primarily by the gate area of each transistor.
Figure-7.2: Generic representation of a CMOS logic gate for switching power calculation
Any CMOS logic gate making an output voltage transition can thus be represented by its nMOS network, pMOS network, and the total
load capacitance connected to its output node, as seen in Fig. 7.2. The average power dissipation of the CMOS logic gate, driven by a
periodic input voltage waveform with ideally zero rise- and fall-times, can be calculated from the energy required to charge up the
output node to VDD and charge down the total output load capacitance to ground level.
(7.1)
Evaluating this integral yields the well-known expression for the average dynamic (switching) power consumption in CMOS logic
circuits.
(7.2)
or
(7.3)
Note that the average switching power dissipation of a CMOS gate is essentially independent of all transistor characteristics and
transistor sizes. Hence, given an input pattern, the switching delay times have no relevance to the amount of power consumption during
the switching events as long as the output voltage swing is between 0 and VDD.
Equation (7.3) shows that the average dynamic power dissipation is proportional to the square of the power supply voltage, hence, any
reduction of VDD will significantly reduce the power consumption. Another way to limit the dynamic power dissipation of a CMOS
logic gate is to reduce the amount of switched capacitance at the output. This issue will be discussed in more detail later. First, let us
briefly examine the effect of reducing the power supply voltage VDD upon switching power consumption and dynamic performance of
the gate.
Although the reduction of power supply voltage significantly reduces the dynamic power dissipation, the inevitable design trade-off is
the increase of delay. This can be seen by examining the following propagation delay expressions for the CMOS inverter circuit.
(7.4)
Assuming that the power supply voltage is being scaled down while all other variables are kept constant, it can be seen that the
propagation delay time will increase. Figure 7.3 shows the normalized variation of the delay as a function of VDD, where the threshold
voltages of the nMOS and the pMOS transistor are assumed to be constant, VT,n = 0.8 V and VT,p = - 0.8 V, respectively. The
normalized variation of the average switching power dissipation as a function of the supply voltage is also shown on the same plot.
Figure-7.3: Normalized propagation delay and average switching power dissipation of a CMOS inverter, as a function of the power
supply voltage VDD.
Notice that the dependence of circuit speed on the power supply voltage may also influence the relationship between the dynamic power
dissipation and the supply voltage. Equation (7.3) suggests a quadratic improvement (reduction) of power consumption as the power
supply voltage is reduced. However, this interpretation assumes that the switching frequency (i.e., the number of switching events per
unit time) remains constant. If the circuit is always operated at the maximum frequency allowed by its propagation delay, on the other
hand, the number of switching events per unit time (i.e., the operating frequency) will obviously drop as the propagation delay becomes
larger with the reduction of the power supply voltage. The net result is that the dependence of switching power dissipation on the power
supply voltage becomes stronger than a simple quadratic relationship, shown in Fig. 7.3.
The analysis of switching power dissipation presented above is based on the assumption that the output node of a CMOS gate undergoes
one power-consuming transition (0-to-VDD transition) in each clock cycle. This assumption, however, is not always correct; the node
transition rate can be smaller than the clock rate, depending on the circuit topology, logic style and the input signal statistics. To better
represent this behavior, we will introduce aT (node transition factor), which is the effective number of power-consuming voltage
transitions experienced per clock cycle. Then, the average switching power consumption becomes
(7.5)
The estimation of switching activity and various measures to reduce its rate will be discussed in detail in Section 7.4. Note that in most
complex CMOS logic gates, a number of internal circuit nodes also make full or partial voltage transitions during switching. Since there
is a parasitic node capacitance associated with each internal node, these internal transitions contribute to the overall power dissipation of
the circuit. In fact, an internal node may undergo several transitions while the output node voltage of the circuit remains unchanged, as
illustrated in Fig. 7.4.
[Zoom]
[Click to enlarge image]
Figure-7.4: Switching of the internal node in a two-input NOR gate results in dynamic power dissipation even if the output node
voltage remains unchanged.
In the most general case, the internal node voltage transitions can also be partial transitions, i.e., the node voltage swing may be only Vi
which is smaller than the full voltage swing of VDD. Taking this possibility into account, the generalized expression for the average
switching power dissipation can be written as
(7.6)
where Ci represents the parasitic capacitance associated with each node and aTi represents the corresponding node transition factor
associated with that node.
The switching power dissipation examined above is purely due to the energy required to charge up the parasitic capacitances in the
circuit, and the switching power is independent of the rise and fall times of the input signals. Yet, if a CMOS inverter (or a logic gate) is
driven with input voltage waveforms with finite rise and fall times, both the nMOS and the pMOS transistors in the circuit may conduct
simultaneously for a short amount of time during switching, forming a direct current path between the power supply and the ground, as
shown in Fig. 7.5.
The current component which passes through both the nMOS and the pMOS devices during switching does not contribute to the
charging of the capacitances in the circuit, and hence, it is called the short-circuit current component. This component is especially
prevalent if the output load capacitance is small, and/or if the input signal rise and fall times are large, as seen in Fig. 7.5. Here, the
input/output voltage waveforms and the components of the current drawn from the power supply are illustrated for a symmetrical
CMOS inverter with small capacitive load. The nMOS transistor in the circuit starts conducting when the rising input voltage exceeds
the threshold voltage VT,n. The pMOS transistor remains on until the input reaches the voltage level (VDD - |VT,p|). Thus, there is a
time window during which both transistors are turned on. As the output capacitance is discharged through the nMOS transistor, the
output voltage starts to fall. The drain-to-source voltage drop of the pMOS transistor becomes nonzero, which allows the pMOS
transistor to conduct as well. The short circuit current is terminated when the input voltage transition is completed and the pMOS
transistor is turned off. An similar event is responsible for the short- circuit current component during the falling input transition, when
the output voltage starts rising while both transistors are on.
Note that the magnitude of the short-circuit current component will be approximately the same during both the rising-input transition
and the falling-input transition, assuming that the inverter is symmetrical and the input rise and fall times are identical. The pMOS
transistor also conducts the current which is needed to charge up the small output load capacitance, but only during the falling-input
transition (the output capacitance is discharged through the nMOS device during the rising-input transition). This current component,
which is responsible for the switching power dissipation of the circuit (current component to charge up the load capacitance), is also
shown in Fig. 7.5. The average of both of these current components determines the total amount of power drawn from the supply.
For a simple analysis consider a symmetric CMOS inverter with k = kn = kp and VT = VT,n = |VT,p|, and with a very small capacitive
load. If the inverter is driven with an input voltage waveform with equal rise and fall times (t = trise = tfall), it can be derived that the
time-averaged short circuit current drawn from the power supply is
(7.7)
Figure-7.5: Input-output voltage waveforms, the supply current used to charge up the load capacitance and the short-circuit current in a
CMOS inverter with small capacitive load. The total current drawn from the power supply is the sum of both current components.
(7.8)
Note that the short-circuit power dissipation is linearly proportional to the input signal rise and fall times, and also to the
transconductance of the transistors. Hence, reducing the input transition times will obviously decrease the short-circuit current
component.
Now consider the same CMOS inverter with a larger output load capacitance and smaller input transition times. During the rising input
transition, the output voltage will effectively remain at VDD until the input voltage completes its swing and the output will start to drop
only after the input has reached its
Figure-7.6: Input-output voltage waveforms, the supply current used to charge up the load capacitance and the short-circuit current in a
CMOS inverter with larger capacitive load and smaller input transition times. The total current drawn from the power supply is
approximately equal to the charge-up current.
final value. Although both the nMOS and the pMOS transistors are on simultaneously during the transition, the pMOS transistor cannot
conduct a significant amount of current since the voltage drop between its source and drain terminals is approximately equal to zero.
Similarly, the output voltage will remain approximately equal to 0 V during a falling input transition and it will start to rise only after
the input voltage completes its swing. Again, both transistors will be on simultaneously during the input voltage transition, yet the
nMOS transistor will not be able to conduct a significant amount of current since its drain-to-source voltage is approximately equal to
zero. This situation is illustrated in Fig. 7.6, which shows the simulated input and output voltage waveforms of the inverter as well as the
short-circuit and dynamic current components drawn from the power supply. Notice that the peak value of the supply current to charge
up the output load capacitance is larger in this case. The reason for this is that the pMOS transistor remains in saturation during the
entire input transition, as opposed to the previous case shown in Fig. 7.5 where the transistor leaves the saturation region before the
input transition is completed.
The discussion concerning the magnitude of the short-circuit current may suggest that the short-circuit power dissipation can be reduced
by making the output voltage transition times larger and/or by making the input voltage transition times smaller. Yet this goal should be
balanced carefully against other performance goals such as propagation delay, and the reduction of the short-circuit current should be
considered as one of the many design requirements that must satisfied by the designer.
The nMOS and pMOS transistors used in a CMOS logic gate generally have nonzero reverse leakage and subthreshold currents. In a
CMOS VLSI chip containing a very large number of transistors, these currents can contribute to the overall power dissipation even
when the transistors are not undergoing any switching event. The magnitude of the leakage currents is determined mainly by the
processing parameters.
Of the two main leakage current components found in a MOSFET, the reverse diode leakage occurs when the pn-junction between the
drain and the bulk of the transistor is reversely biased. The reverse-biased drain junction then conducts a reverse saturation current
which is eventually drawn from the power supply. Consider a CMOS inverter with a high input voltage, where the nMOS transistor is
turned on and the output node voltage is discharged to zero. Although the pMOS transistor is turned off, there will be a reverse potential
difference of VDD between its drain and the n-well, causing a diode leakage through the drain junction. The n-well region of the pMOS
transistor is also reverse-biased with VDD, with respect to the p-type substrate. Therefore, another significant leakage current
component exists due to the n-well junction (Fig. 7.7).
Figure-7.7: Reverse leakage current paths in a CMOS inverter with high input voltage.
A similar situation can be observed when the input voltage is equal to zero, and the output voltage is charged up to VDD through the
pMOS transistor. Then, the reverse potential difference between the nMOS drain region and the p-type substrate causes a reverse
leakage current which is also drawn from the power supply (through the pMOS transistor).
The magnitude of the reverse leakage current of a pn-junction is given by the following expression
(7.9)
where Vbias is the magnitude of the reverse bias voltage across the junction, JS is the reverse saturation current density and the A is the
junction area. The typical magnitude of the reverse saturation current density is 1 - 5 pA/mm2, and it increases quite significantly with
temperature. Note that the reverse leakage occurs even during the stand-by operation when no switching takes place. Hence, the power
dissipation due to this mechanism can be significant in a large chip containing several million transistors.
Another component of leakage currents which occur in CMOS circuits is the subthreshold current, which is due to carrier diffusion
between the source and the drain region of the transistor in weak inversion. An MOS transistor in the subthreshold operating region
behaves similar to a bipolar device and the subthreshold current exhibits an exponential dependence on the gate voltage. The amount of
the subthreshold current may become significant when the gate-to-source voltage is smaller than, but very close to the threshold voltage
of the device. In this case, the power dissipation due to subthreshold leakage can become comparable in magnitude to the switching
power dissipation of the circuit. The subthreshold leakage current is illustrated in Fig. 7.8.
Figure-7.8: Subthreshold leakage current path in a CMOS inverter with high input voltage.
Note that the subthreshold leakage current also occurs when there is no switching activity in the circuit, and this component must be
carefully considered for estimating the total power dissipation in the stand-by operation mode. The subthreshold current expression is
given below, in order to illustrate the exponential dependence of the current on terminal voltages.
(7.10)
One relatively simple measure to limit the subthreshold current component is to avoid very low threshold voltages, so that the VGS of
the nMOS transistor remains safely below VT,n when the input is logic zero, and the |VGS| of the pMOS transistor remains safely below
|VT,p| when the input is logic one.
In addition to the three major sources of power consumption in CMOS digital integrated circuits discussed here, some chips may also
contain components or circuits which actually consume static power. One example is the pseudo-nMOS logic circuits which utilize a
pMOS transistor as the pull-up device. The presence of such circuit blocks should also be taken into account when estimating the overall
power dissipation of a complex system.
The switching power dissipation in CMOS digital integrated circuits is a strong function of the power supply voltage. Therefore,
reduction of VDD emerges as a very effective means of limiting the power consumption. Given a certain technology, the circuit
designer may utilize on-chip DC- DC converters and/or separate power pins to achieve this goal. As we have already discussed briefly
in Section 7.2, however, the savings in power dissipation comes at a significant cost in terms of increased circuit delay. When
considering drastic reduction of the power supply voltage below the new standard of 3.3 V, the issue of time-domain performance
should also be addressed carefully. In the following, we will examine reduction of the power supply voltage with a corresponding
scaling of threshold voltages, in order to compensate for the speed degradation. At the system level, architectural measures such as the
use of parallel processing blocks and/or pipelining techniques also offer very feasible alternatives for maintaining the system
performance (throughput) despite aggressive reduction of the power supply voltage.
The propagation delay expression (7.4) clearly shows that the negative effect of reducing the power supply voltage upon delay can be
compensated for, if the threshold voltage of the transistor is scaled down accordingly. However, this approach is limited due to the fact
that the threshold voltage cannot be scaled to the same extent as the supply voltage. When scaled linearly, reduced threshold voltages
allow the circuit to produce the same speed-performance at a lower VDD. Figure 7.9 shows the variation of the propagation delay of a
CMOS inverter as a function of the power supply voltage, and for different threshold voltage values.
Figure-7.9: Variation of the normalized propagation delay of a CMOS inverter, as a function of the power supply voltage VDD and the
threshold voltage VT.
We can see, for example, that reducing the threshold voltage from 0.8 V to 0.2 V can improve the delay at VDD = 2 V by a factor of 2.
The influence of threshold voltage reduction upon propagation delay is especially pronounced at low power supply voltages. It should
be noted, however, that the threshold voltage reduction approach is restricted by the concerns on noise margins and the subthreshold
conduction. Smaller threshold voltages lead to smaller noise margins for the CMOS logic gates. The subthreshold conduction current
also sets a severe limitation against reducing the threshold voltage. For threshold voltages smaller than 0.2 V, leakage power dissipation
due to subthreshold conduction may become a very significant component of the overall power consumption.
In certain types of applications, the reduction of circuit speed which comes as a result of voltage scaling can be compensated for at the
expense of more silicon area. In the following, we will examine the use of architectural measures such as pipelining and hardware
replication to offset the loss of speed at lower supply voltages.
Pipelining Approach
First, consider the single functional block shown in Fig. 7.10 which implements a logic function F(INPUT) of the input vector, INPUT.
Both the input and the output vectors are sampled through register arrays, driven by a clock signal CLK. Assume that the critical path in
this logic block (at a power supply voltage of VDD) allows a maximum sampling frequency of fCLK; in other words, the maximum
input-to-output propagation delay tP,max of this logic block is equal to or less than TCLK = 1/fCLK. Figure 7.10 also shows the
simplified timing diagram of the circuit. A new input vector is latched into the input register array at each clock cycle, and the output
data becomes valid with a latency of one cycle.
Figure-7.10: Single-stage implementation of a logic function and its simplified timing diagram.
Let Ctotal be the total capacitance switched every clock cycle. Here, Ctotal consists of (i) the capacitance switched in the input register
array, (ii) the capacitance switched to implement the logic function, and (iii) the capacitance switched in the output register array. Then,
the dynamic power consumption of this structure can be found as
(7.11)
Now consider an N-stage pipelined structure for implementing the same logic function, as shown in Fig. 7.11. The logic function
F(INPUT) has been partitioned into N successive stages, and a total of (N-1) register arrays have been introduced, in addition to the
original input and output registers, to create the pipeline. All registers are clocked at the original sample rate, fCLK. If all stages of the
partitioned function have approximately equal delay of
(7.12)
Then the logic blocks between two successive registers can operate N-times slower while maintaining the same functional throughput as
before. This implies that the power supply voltage can be reduced to a value of VDD,new, to effectively slow down the circuit by a
factor of N. The supply voltage to achieve this reduction can be found by solving (7.4).
Figure-7.11: N-stage pipeline structure realizing the same logic function as in Fig. 7.10. The maximum pipeline stage delay is equal to
the clock period, and the latency is N clock cycles.
The dynamic power consumption of the N-stage pipelined structure with a lower supply voltage and with the same functional
throughput as the single-stage structure can be approximated by
(7.13)
where Creg represents the capacitance switched by each pipeline register. Then, the power reduction factor achieved in a N-stage
pipeline structure is
(7.14)
As an example, consider replacing a single-stage logic block (VDD = 5 V, fCLK = 20 MHz) with a four-stage pipeline structure,
running at the same clock frequency. This means that the propagation delay of each pipeline stage can be increased by a factor of 4
without sacrificing the data throughput. Assuming that the magnitude of the threshold voltage of all transistors is 0.8 V, the desired
speed reduction can be achieved by reducing the power supply voltage from 5 V to approximately 2 V (see Fig. 7.9). With a typical ratio
of (Creg/Ctotal) = 0.1, the overall power reduction factor is found from (7.14) as 1/5. This means that replacing the original single-stage
logic block with a four-stage pipeline running at the same clock frequency and reducing the power supply voltage from 5 V to 2 V will
provide a dynamic power savings of about 80%, while maintaining the same throughput as before.
The architectural modification described here has a relatively small area overhead. A total of (N-1) register arrays have to be added to
convert the original single-stage structure into a pipeline. While trading off area for lower power, this approach also increases the
latency from one to N clock cycles. Yet in many applications such as signal processing and data encoding, latency is not a very
significant concern.
Another possibility of trading off area for lower power dissipation is to use parallelism, or hardware replication. This approach could be
useful especially when the logic function to be implemented is not suitable for pipelining. Consider N identical processing elements,
each implementing the logic function F(INPUT) in parallel, as shown in Fig. 7.12. Assume that the consecutive input vectors arrive at
the same rate as in the single-stage case examined earlier. The input vectors are routed to all the registers of the N processing blocks.
Gated clock signals, each with a clock period of (N TCLK), are used to load each register every N clock cycles. This means that the
clock signals to each input register are skewed by TCLK, such that each one of the N consecutive input vectors is loaded into a different
input register. Since each input register is clocked at a lower frequency of (fCLK / N), the time allowed to compute the function for each
input vector is increased by a factor of N. This implies that the power supply voltage can be reduced until the critical path delay equals
the new clock period of (N TCLK). The outputs of the N processing blocks are multiplexed and sent to an output register which operates
at a clock frequency of fCLK, ensuring the same data throughput rate as before. The timing diagram of this parallel arrangement is given
in Fig. 7.13.
Since the time allowed to compute the function for each input vector is increased by a factor of N, the power supply voltage can be
reduced to a value of VDD,new, to effectively slow down the circuit. The new supply voltage can be found, as in the pipelined case, by
solving (7.4). The total dynamic power dissipation of the parallel structure (neglecting the dissipation of the multiplexor) is found as the
sum of the power dissipated by the input registers and the logic blocks operating at a clock frequency of (fCLK / N), and the output
register operating at a clock frequency of fCLK.
(7.15)
Figure-7.12: N-block parallel structure realizing the same logic function as in Fig. 7.10. Notice that the input registers are clocked at a
lower frequency of (fCLK / N).
Note that there is also an additional overhead which consists of the input routing capacitance, the output routing capacitance and the
capacitance of the output multiplexor structure, all of which are increasing functions of N. If this overhead is neglected, the amount of
(7.16)
The lower bound of dynamic power reduction realizable with architecture-driven voltage scaling is found, assuming zero threshold
voltage, as
(7.17)
(7.17)
[Zoom]
[Click to enlarge image]
Figure-7.13: Simplified timing diagram of the N-block parallel structure shown in Fig. 7.12.
Two obvious consequences of this approach are the increased area and the increased latency. A total of N identical processing blocks
must be used to slow down the operation (clocking) speed by a factor of N. In fact, the silicon area will grow even faster than the
number of processor because of signal routing and the overhead circuitry. The timing diagram in Fig. 7.13 shows that the parallel
implementation has a latency of N clock cycles, as in the N-stage pipelined implementation. Considering its smaller area overhead,
however, the pipelined approach offers a more efficient alternative for reducing the power dissipation while maintaining the throughput.
In the previous section, we have discussed methods for minimizing dynamic power consumption in CMOS digital integrated circuits by
supply voltage scaling. Another approach to low power design is to reduce the switching activity and the amount of the switched
capacitance to the minimum level required to perform a given task. The measures to accomplish this goal can range from optimization
of algorithms to logic design, and finally to physical mask design. In the following, we will examine the concept of switching activity,
and introduce some of the approaches used to reduce it. We will also examine the various measures used to minimize the amount of
capacitance which must be switched to perform a given task in a circuit.
It was already discussed in Section 7.2 that the dynamic power consumption of a CMOS logic gate depends, among other parameters,
also on the node transition factor aT, which is the effective number of power-consuming voltage transitions experienced by the output
capacitance per clock cycle. This parameter, also called the switching activity factor, depends on the Boolean function performed by the
gate, the logic family, and the input signal statistics.
Assuming that all input signals have an equal probability to assume a logic "0" or a logic "1" state, we can easily investigate the output
transition probabilities for different types of logic gates. First, we will introduce two signal probabilities, P0 and P1. P0 corresponds to
the probability of having a logic "0" at the output, and P1 = (1 - P0) corresponds to the probability of having a logic "1" at the output.
Therefore, the probability that a power-consuming (0-to-1) transition occurs at the output node is the product of these two output signal
probabilities. Consider, for example, a static CMOS NOR2 gate. If the two inputs are independent and uniformly distributed, the four
possible input combinations (00, 01, 10, 11) are equally likely to occur. Thus, we can find from the truth table of the NOR2 gate that P0
= 3/4, and P1 = 1/4. The probability that a power-consuming transition occurs at the output node is therefore
(7.18)
The transition probabilities can be shown on a state transition diagram which consists of the only two possible output states and the
possible transitions among them (Fig. 7.14). In the general case of a CMOS logic gate with n input variables, the probability of a power-
consuming output transition can be expressed as a function of n0, which is the number of zeros in the output column of the truth table.
(7.19)
Figure-7.14: State transition diagram and state transition probabilities of a NOR2 gate.
The output transition probability is shown as a function of the number of inputs in Fig. 7.15, for different types of logic gates and
assuming equal input probabilities. For a NAND or NOR gate, the truth table contains only one "0" or "1", respectively, regardless of
the number of inputs. Therefore, the output transition probability drops as the number of inputs is increased. In a XOR gate, on the other
hand, the truth table always contains an equal number of logic "0" and logic "1" values. The output transition probability therefore
remains constant at 0.25.
Figure-7.15: Output transition probabilities of different logic gates, as a function of the number of inputs. Note that the transition
probability of the XOR gate is independent of the number or inputs.
In multi-level logic circuits, the distribution of input signal probabilities is typically not uniform, i.e., one cannot expect to have equal
probabilities for the occurrence of a logic "0" and a logic "1". Then, the output transition probability becomes a function of the input
probability distributions. As an example, consider the NOR2 gate examined above. Let P1,A represent the probability of having a logic
"1" at the input A, and P1,B represent the probability of having a logic "1" at the input B. The probability of obtaining a logic "1" at the
output node is
(7.20)
Using this expression, the probability of a power-consuming output transition is found as a function of P1,A and P1,B.
(7.21)
Figure 7.16 shows the distribution of the output transition probability in a NOR2 gate, as a function of two input probabilities. It can be
seen that the evaluation of switching activity becomes a complicated problem in large circuits, especially when sequential elements,
reconvergent nodes and feedback loops are involved. The designer must therefore rely on computer-aided design (CAD) tools for
correct estimation of switching activity in a given network.
Figure-7.16: Output transition probability of NOR2 gate as a function of two input probabilities.
In dynamic CMOS logic circuits, the output node is precharged during every clock cycle. If the output node was discharged (i.e., if the
output value was equal to "0") in the previous cycle, the pMOS precharge transistor will draw a current from the power supply during
the precharge phase. This means that the dynamic CMOS logic gate will consume power every time the output value equals "0",
regardless of the preceding or following values. Therefore, the power consumption of dynamic logic gates is determined by the signal-
value probability of the output node and not by the transition probability. From the discussion above, we can see that signal-value
probabilities are always larger than transition probabilities, hence, the power consumption of dynamic CMOS logic gates is typically
larger than static CMOS gates under the same conditions.
Switching activity in CMOS digital integrated circuits can be reduced by algorithmic optimization, by architecture optimization, by
proper choice of logic topology and by circuit-level optimization. In the following, we will briefly some of the measures that can be
applied to optimize the switching probabilities, and hence, the dynamic power consumption.
Algorithmic optimization depends heavily on the application and on the characteristics of the data such as dynamic range, correlation,
statistics of data transmission. Some of the techniques can be applied only for specific algorithms such as Digital Signal Processing
(DSP) and cannot be used for general purpose processing. One possibility is the choosing a proper vector quantization (VQ) algorithm
which results in minimum switching activity. For example, the number of memory accesses, the number of multiplications and the
number of additions can be reduced by about a factor of 30 if differential tree search algorithm is used instead of the full search
algorithm.
The representation of data may also have a significant impact on switching activity at the system level. In applications where data bits
change sequentially and are highly correlated (such as the address bits to access instructions) for example, the use of Gray coding leads
to a reduced number of transitions compared to simple binary coding. Another example is using sign- magnitude representation instead
of the conventional two's complement representation for signed data. A change in sign will cause transitions of the higher-order bits in
the two's complement representation, whereas only the sign bit will change in sign-magnitude representation. Therefore, the switching
activity can be reduced by using the sign-magnitude representation in applications where the data sign changes are frequent.
An important architecture-level measure to reduce switching activity is based on delay balancing and the reduction of glitches. In multi-
level logic circuits, the finite propagation delay from one logic block to the next can cause spurious signal transitions, or glitches as a
result of critical races or dynamic hazards. In general, if all input signals of a gate change simultaneously, no glitching occurs. But a
dynamic hazard or glitch can occur if input signals change at different times. Thus, a node can exhibit multiple transitions in a single
clock cycle before settling to the correct logic level (Fig. 7.17). In some cases, the signal glitches are only partial, i.e., the node voltage
does not make a full transition between the ground and VDD levels, yet even partial glitches can have a significant contribution to
dynamic power dissipation.
Glitches occur primarily due to a mismatch or imbalance in the path lengths in the logic network. Such a mismatch in path length results
in a mismatch of signal timing with respect to the primary inputs. As an example, consider the simple parity network shown in Fig. 7.18.
Assuming that all XOR blocks have the same delay, it can be seen that the network in Fig. 7.18(a) will suffer from glitching due to the
wide disparity between the arrival times of the input signals for the gates. In the network shown in Fig. 7.18(b), on the other hand, all
arrival times are identical because the delay paths are balanced. Such redesign can significantly reduce the glitching transitions, and
consequently, the dynamic power dissipation in complex multi-level networks. Also notice that the tree structure shown in Fig. 7.18(b)
results in smaller overall propagation delay. Finally, it should be noted that glitching is not a significant issue in multi-level dynamic
CMOS logic circuits, since each node undergoes at most one transition per clock cycle.
Figure-7.18: (a) Implementation of a four-input parity (XOR) function using a chain structure. (b) Implementation of the same function
It was already established in the previous sections that the amount of switched capacitance plays a significant role in the dynamic power
dissipation of the circuit. Hence, reduction of this parasitic capacitance is a major goal for low-power design of digital integrated
circuits.
At the system level, one of the approaches to reduce the switched capacitance is to limit the use of shared resources. A simple example
is the use of a global bus structure for data transmission between a large number of operational modules (Fig. 7.19). If a single shared
bus is connected to all modules as in Fig. 7.19(a), this structure results in a large bus capacitance due to (i) the large number of drivers
and receivers sharing the same transmission medium, and (ii) the parasitic capacitance of the long bus line. Obviously, driving the large
bus capacitance will require a significant amount of power consumption during each bus access. Alternatively, the global bus structure
can be partitioned into a number of smaller dedicated local busses to handle the data transmission between neighboring modules, as
sown in Fig. 7.19(b). In this case, the switched capacitance during each bus access is significantly reduced, yet multiple busses may
increase the overall routing area on chip.
Figure-7.19: (a) Using a single global bus structure for connecting a large number of modules on chip results in large bus capacitance
and large dynamic power dissipation. (b) Using smaller local busses reduces the amount of switched capacitance, at the expense of
additional chip area.
The type of logic style used to implement a digital circuit also affects the physical capacitance of the circuit. The physical capacitance is
a function of the number of transistors that are required to implement a given function. For example, one approach to reduce the
physical capacitance is to use transfer gates over conventional CMOS logic gates to implement logic functions. Pass-gate logic design is
attractive since fewer transistors are required certain functions such as XOR and XNOR. In many arithmetic operations where binary
adders and multipliers are used, pass transistor logic offers significant advantages. Similarly, multiplexors and other key building blocks
can also be simplified using this design style.
The amount of parasitic capacitance that is switched (i.e. charged up or charged down) during operation can be also reduced at the
physical design level, or mask level. The parasitic gate and diffusion capacitances of MOS transistors in the circuit typically constitute a
significant amount of the total capacitance in a combinational logic circuit. Hence, a simple mask-level measure to reduce power
dissipation is keeping the transistors (especially the drain and source regions) at minimum dimensions whenever possible and feasible,
thereby minimizing the parasitic capacitances. Designing a logic gate with minimum-size transistors certainly affects the dynamic
performance of the circuit, and this trade-off between dynamic performance and power dissipation should be carefully considered in
critical circuits. Especially in circuits driving a large extrinsic capacitive loads, e.g., large fan-out or routing capacitances, the transistors
must be designed with larger dimensions. Yet in many other cases where the load capacitance of a gate is mainly intrinsic, the transistor
sizes can be kept at minimum. Note that most standard cell libraries are designed with larger transistors in order to accommodate a wide
range of capacitive loads and performance requirements. Consequently, a standard-cell based design may have considerable overhead in
terms of switched capacitance in each cell.
In conventional level-restoring CMOS logic circuits with rail-to-rail output voltage swing, each switching event causes an energy
transfer from the power supply to the output node, or from the output node to the ground. During a 0-to-VDD transition of the output,
the total output charge Q = Cload VDD is drawn from the power supply at a constant voltage. Thus, an energy of Esupply = Cload
VDD2 is drawn from the power supply during this transition. Charging the output node capacitance to the voltage level VDD means that
at the end of the transition, the amount of energy Estored = Cload VDD2/2 is stored on the output node. Thus, half of the injected energy
from the power supply is dissipated in the pMOS network while only one half is delivered to the output node. During a subsequent VDD-
to-0 transition of the output node, no charge is drawn from the power supply and the energy stored in the load capacitance is dissipated
in the nMOS network.
To reduce the dissipation, the circuit designer can minimize the switching events, decrease the node capacitance, reduce the voltage
swing, or apply a combination of these methods. Yet in all cases, the energy drawn from the power supply is used only once before
being dissipated. To increase the energy efficiency of logic circuits, other measures must be introduced for recycling the energy drawn
from the power supply. A novel class of logic circuits called adiabatic logic offers the possibility of further reducing the energy
dissipated during switching events, and the possibility of recycling, or reusing, some of the energy drawn from the power supply. To
accomplish this goal, the circuit topology and the operation principles have to be modified, sometimes drastically. The amount of energy
recycling achievable using adiabatic techniques is also determined by the fabrication technology, switching speed and voltage swing.
The term "adiabatic" is typically used to describe thermodynamic processes that have no energy exchange with the environment, and
therefore, no energy loss in the form of dissipated heat. In our case, the electric charge transfer between the nodes of a circuit will be
viewed as the process, and various techniques will be explored to minimize the energy loss, or heat dissipation, during charge transfer
events. It should be noted that fully adiabatic operation of a circuit is an ideal condition which may only be approached asymptotically
as the switching process is slowed down. In practical cases, energy dissipation associated with a charge transfer event is usually
composed of an adiabatic component and a non-adiabatic component. Therefore, reducing all energy loss to zero may not be possible,
regardless of the switching speed.
Adiabatic Switching
Consider the simple circuit shown in Fig. 7.20 where a load capacitance is charged by a constant current source. This circuit is similar to
the equivalent circuit used to model the charge-up event in conventional CMOS circuits, with the exception that in conventional CMOS,
the output capacitance is charged by a constant voltage source and not by a constant current source. Here, R represents the on-resistance
of the pMOS network. Also note that a constant charging current corresponds to a linear voltage ramp. Assuming that the capacitance
voltage VC is equal to zero initially, the variation of the voltage as a function of time can be found as
(7.22)
Hence, the charging current can be expressed as a simple function of VC and time t.
(7.23)
(7.24)
Combining (7.23) and (7.24), the dissipated energy can also be expressed as follows.
(7.25)
Now, a number of simple observations can be made based on (7.25). First, the dissipated energy is smaller than for the conventional
case if the charging time T is larger than 2 RC. In fact, the dissipated energy can be made arbitrarily small by increasing the charging
time, since Ediss is inversely proportional to T. Also, we observe that the dissipated energy is proportional to the resistance R, as
opposed to the conventional case where the dissipation depends on the capacitance and the voltage swing. Reducing the on-resistance of
the pMOS network will reduce the energy dissipation.
We have seen that the constant-current charging process efficiently transfers energy from the power supply to the load capacitance. A
portion of the energy thus stored in the capacitance can also be reclaimed by reversing the current source direction, allowing the charge
to be transferred from the capacitance back into the supply. This possibility is unique to adiabatic operation, since in conventional
CMOS circuits the energy is dissipated after being used only once. The constant-current power supply must certainly be capable of
retrieving the energy back from the circuit. Adiabatic logic circuits thus require non-standard power supplies with time-varying voltage,
also called pulsed-power supplies. The additional hardware overhead associated with these specific power supply circuits is one of the
design trade-off that must be considered when using adiabatic logic.
In the following, we will examine simple circuit configurations which can be used for adiabatic switching. Note that most of the
research on adiabatic logic circuits are relatively recent, therefore, the circuits presented here should be considered as examples only.
Other circuit topologies are also possible, but the overall approach of energy recycling should still be applicable, regardless of the
specific circuit configuration.
First, consider the adiabatic amplifier circuit shown in Fig. 7.21, which can be used to drive capacitive loads. It consists of two CMOS
transmission gates and two nMOS clamp transistors. Both the input (X) and the output (Y) are dual-rail encoded, which means that the
inverses of both signals are also available, to control the CMOS T-gates.
Figure-7.21: Adiabatic amplifier circuit which transfers the complementary input signals to its complementary outputs through CMOS
transmission gates.
When the input signal X is set to a valid value, one of the two transmission gates becomes transparent. Next, the amplifier is energized
by applying a slow voltage ramp VA, rising from zero to VDD. The load capacitance at one of the two complementary outputs is
adiabatically charged to VDD through the transmission gate, while the other output node remains clamped to ground potential. When
the charging process is completed, the output signal pair is valid and can be used as an input to other, similar circuits. Next, the circuit is
de-energized by ramping the voltage VA back to zero. Thus, the energy that was stored in the output load capacitance is retrieved by the
power supply. Note that the input signal pair must be valid and stable throughout this sequence.
Figure-7.22: (a) The general circuit topology of a conventional CMOS logic gate. (b) The topology of an adiabatic logic gate
implementing the same function. Note the difference in charge-up and charge-down paths for the output capacitance.
The simple circuit principle of the adiabatic amplifier can be extended to allow the implementation of arbitrary logic functions. Figure
7.22 shows the general circuit topology of a conventional CMOS logic gate and an adiabatic counterpart. To convert a conventional
CMOS logic gate into an adiabatic gate, the pull-up and pull-down networks must be replaced with complementary transmission-gate
networks. The T-gate network implementing the pull-up function is used to drive the true output of the adiabatic gate, while the T-gate
network implementing the pull-down function drives the complementary output node. Note the all inputs should also be available in
complementary form. Both networks in the adiabatic logic circuit are used to charge-up as well as charge-down the output capacitances,
which ensures that the energy stored at the output node can be retrieved by the power supply, at the end of each cycle. To allow
adiabatic operation, the DC voltage source of the original circuit must be replaced by a pulsed-power supply with ramped voltage
output. Note that the circuit modifications which are necessary to convert a conventional CMOS logic circuit into an adiabatic logic
circuit increase the device count by a factor of two. Also, the reduction of energy dissipation comes at the cost of slower switching
speed, which is the ultimate trade-off in all adiabatic methods.
We have seen earlier that the dissipation during a charge-up event can be minimized, and in the ideal case be reduced to zero, by using a
constant-current power supply. This requires that the power supply be able to generate linear voltage ramps. Practical supplies can be
constructed by using resonant inductor circuits to approximate the constant output current and the linear voltage ramp with sinusoidal
signals. But the use of inductors presents several difficulties at the circuit level, especially in terms of chip-level integration and overall
efficiency.
An alternative to using pure voltage ramps is to use stepwise supply voltage waveforms, where the output voltage of the power supply is
increased and decreased in small increments during charging and discharging. Since the energy dissipation depends on the average
voltage drop traversed by the charge that flows onto the load capacitance, using smaller voltage steps, or increments, should reduce the
dissipation considerably.
Figure 7.23 shows a CMOS inverter driven by a stepwise supply voltage waveform. Assume that the output voltage is equal to zero
initially. With the input voltage set to logic low level, the power supply voltage VA is increased from 0 to VDD, in n equal voltage steps
(Fig. 7.24). Since the pMOS transistor is conducting during this transition, the output load capacitance will be charged up in a stepwise
manner. The on-resistance of the pMOS transistor can be represented by the linear resistor R. Thus, the output load capacitance is being
charged up through a resistor, in small voltage increments. For the ith time increment, the amount of capacitor current can be expressed
as
(7.26)
Solving this differential equation with the initial condition Vout(ti) = VA(i) yields
(7.27)
Figure-7.24: Equivalent circuit, and the input and output voltage waveforms of the CMOS inverter circuit in Fig. 7.23 (stepwise charge-
up case).
Here, n is the number of steps of the supply voltage waveform. The amount of energy dissipated during one voltage step increment can
now be found as
(7.28)
Since n steps are used to charge up the capacitance to VDD, the total dissipation is
(7.29)
According to this simplified analysis, charging the output capacitance with n voltage steps, or increments, reduces the energy dissipation
per cycle by a factor of n. Therefore, the total power dissipation is also reduced by a factor of n using stepwise charging. This result
implies that if the voltage steps can be made very small and the number of voltage steps n approaches infinity (i.e., if the supply voltage
is a slow linear ramp), the energy dissipation will approach zero.
Another example for simple stepwise charging circuits is the stepwise driver for capacitive loads, implemented with nMOS devices as
shown in Fig. 7.25. Here, a bank of n constant voltage supplies with evenly distributed voltage levels is used. The load capacitance is
charged up by connecting the constant voltage sources V1 through VN to the load successively, using an array of switch devices. To
discharge the load capacitance, the constant voltage sources are connected to the load in the reverse sequence.
The switch devices are shown as nMOS transistors in Fig. 7.25, yet some of them may be replaced by pMOS transistors to prevent the
undesirable threshold-voltage drop problem and the substrate-bias effects at higher voltage levels. One of the most significant
drawbacks of this circuit configuration is the need for multiple supply voltages. A power supply system capable of efficiently generating
n different voltage levels would be complex and expensive. Also, the routing of n different supply voltages to each circuit in a large
system would create a significant overhead. In addition, the concept is not easily extensible to general logic gates. Therefore, stepwise
charging driver circuits can be best utilized for driving a few critical nodes in the circuit that are responsible for a large portion of the
overall power dissipation, such as output pads and large busses.
In general, we have seen that adiabatic logic circuits can offer significant reduction of energy dissipation, but usually at the expense of
switching times. Therefore, adiabatic logic circuits can be best utilized in cases where delay is not critical. Moreover, the realization of
unconventional power supplies needed in adiabatic circuit configurations typically results in an overhead both in terms of overall energy
dissipation and in terms of silicon area. These issues should be carefully considered when adiabatic logic is used as a method for low-
power design.
Figure-7.25: Stepwise driver circuit for capacitive loads. The load capacitance is successively connected to constant voltage sources Vi
through an array of switch devices.
References
1. A.P. Chandrakasan and R.W. Brodersen, Low Power Digital CMOS Design, Norwell, MA: Kluwer Academic Publishers, 1995.
2. J.M. Rabaey and M. Pedram, ed., Low Power Design Methodologies, Norwell, MA: Kluwer Academic Publishers, 1995.
3. A. Bellaouar and M.I. Elmasry, Low-Power Digital VLSI Design, Norwell, MA: Kluwer Academic Publishers, 1995.
4. F. Najm, "A survey of power estimation techniques in VLSI circuits," IEEE Transactions on VLSI Systems, vol. 2, pp. 446-455,
December 1994.
5. W.C. Athas, L. Swensson, J.G. Koller and E. Chou, "Low-power digital systems based on adiabatic- switching principles," IEEE
Transactions on VLSI Systems, vol. 2, pp. 398-407, December 1994.
production of
Chapter 8
TESTABILITY OF INTEGRATED SYSTEMS
● Design Constraints
● Testing
● The Rule of Ten
● Terminology
● Failures in CMOS
● Combinational Logic Testing
● Practical Ad-Hoc DFT Guidelines
● Scan Design Techniques
The following paragraphs reminds the designer of some basic rules to consider before starting. Each of these constraints has at least one
tool helping in the development of the design in respect to a set of rules :
Every technology has its design rules. It consists in interpreting the possible geometrical implementation of the chips to be
manufactured. These rules are given by the technology department in every foundry of IC. Rules are often described in a document with
boxes representing the layers available in the technology on which are indicated the sizes, distances and geometrical constraints allowed
in this technology.
Figure-8.1:
Designer needs to execute a program called DRC to check if his design don't violate the rules defined by the founder. This step of
verification called DRC is as important as the simulation of the functionality of your design. A sole simulation can't take in
consideration if the rules are respected in which case the manufacturing of the chip could lead to shorts or cuts in the silicon physical
implementation. Some other verification tools should also be used, such as ERC and LVS described above.
As a complement to the DRC, LVS is another tool to be used especially if the design started with a Schematic Entry tool. The aim of
LVS is to check if the design at the layout level corresponds or is still coherent to the schematic. Usually, designers start with a
schematic and then simulate it, if it is OK then they go to layout. But in some cases like full-custom or some semi-custom designs the
layout implementation of the chip differs from the schematic because of some simulation results or because of a design error that
simulation can't detect easily : simulation could never be exhaustive. LVS checks that the designer did the same representation at the
schematic and layout levels, if not LVS tools indicate the occurrence. Of course a simulation of the layout using the same stimuli used
for the schematic is more secure for the final design.
Latch-up caused to CMOS the early problems that delayed its introduction in the electronic industry. It also called "Thyristor effect" and
could cause the destruction of the chip or a part of it. There are no real solution to this phenomena but a set of design techniques exist to
avoid instead of solving Latch-up occurence. The origin of Latch-up is the distribution of the NMOS and PMOS N and P basic
structures inside the silicon. In some cases, not only PN junction are formed but also a structure like PNPN or NPNP parasitic thyristors.
these parasitic elements could feature like a real thyristor and develop a high current destroying the area around it including the PMOS
and NMOS transistors.
The most used technique in avoiding the formation of such a structure is to add "butting contact" polarising the Nwell (or Pwell) to Vdd
(or to Ground). This technique cannot eliminate the Latch-up process but reduces its effect.
Another electrical constraint to CMOS is called ESD or Electro-Static Discharge. Handling CMOS chip properly could be a solution to
avoid gates destruction caused by electro- static charges that people could have at the surface of their hands. This is the reason why it is
important to have a conducting bracelet linked to ground when handling CMOS ICs. But even ground linked bracelet is not enough to
protect CMOS chips from destruction due to ESD. Two diodes at each pad inside the chip link every I/O to Vdd and Gnd. These two big
diodes protect the chip core (CMOS transistor gates) from ESD by limiting over-voltage.
Figure-8.2:
Based on the previous paragraph, ERC is a guarantee that the designer has considered all the minimum necessary implementations for
ERC free design. This tool verifies that the designer did used a sufficient number of well polarisations, applied the appropriate ESD
pads or used VDD and VSS at the right places.
8.2 Testing
Design of logic integrated circuits in CMOS technology is becoming more and more complex since VLSI is the interest of many
electronic IC users and manufacturers. A common problem to be solved by both users and manufacturers is the testing of these ICs.
Figure-8.3:
Testing can be expressed by checking if the outputs of a functional system (functional block, Integrated Circuit, Printed Circuit Board or
a complete system) correspond to the inputs applied to it. If the test of this functional system is positive, then the system is good for use.
If the outputs are different than expected, then the system has a problem: so either the system is rejected (Go/No Go test), or a diagnosis
is applied to it, in order to point out and probably eliminate the problem's causes.
Testing is applied to detect faults after several operations : design, manufacturing, packaging and especially during the active life of a
system, and thus since failures caused by wear-out can occur at any moment of its usage.
Design for Testability (DfT) is the ability of simplifying the test of any system. DfT could be synthesized by a set of techniques and
design guidelines where the goals are :
In the production process cycle, a fault can occur at the chip level. If a test strategy is considered at the beginning of the design, then the
fault could be detected rapidly, located and eliminated at a very low cost. When the faulty chip is soldered on a printed circuit board, the
cost of fault remedy would be multiplied by ten. And this cost factors continues to apply until the system has been assembled and
packaged and then sent to users.
Figure-8.4:
8.4 Terminology
At the system level the most used words are the following:
Testability could be expressed by the ability for a Device Under Test (DUT), to be better observed and controlled easily from its
external environment.
Figure-8.5:
The Design for Testability is then reduced to a set of design rules or guidelines to be respected in order to facilitate the test.
The Reliability is expressed in terms of probability for a device to work without major problems for a given time. Reliability goes down
when components number is increased.
The Security is the probability that user's life is not in danger while a problem occurs to a device. Security is enhanced if a certain type
components are added for more protection.
The Quality is essential in some types of applications. A "zero defect" target is often required. The Quality could be enhanced by
having a proper design methodology, and a good technology, avoiding problems and simplifying testing.
When a MOS circuit has been fabricated and initially tested, some mechanisms can still cause it to fail. Failures are caused either by
design bugs or by wearout (ageing or corrosion) mechanisms. The MOSFET transistor currently used has two main characteristics :
threshold voltage and transconductance on which the performance of that circuit depends.
Figure-8.6:
The design bugs or defects result generally in device length and width deviating from those specified for a process (design rules). This
type of fault is difficult to detect since it occurs later during the active life of the circuit, and leads mostly to opens and breaks in
conductors or shorts between conductors.
Failures are also caused by phenomena like "hot carrier injection", "oxide breakdown", "metallization failures" or "corrosion".
The consequences of hot carrier injection, for instance, is a threshold voltage shifting and transconductance degrading because the gate
oxide is charged when hot carriers are injected (usually electron in NMOS). Cross-talk is also a cause of faults (generally transient), and
needs to isolate properly the different parts of the device.
It is more convenient to talk about "test generation for combinational logic testing" in this section, and about "test generation for
sequential logic testing" in the next section. Thus the solution to the problem of testing a purely combinational logic block is a good set
of patterns detecting "all" the possible faults.
The first idea to test an N input circuit would be to apply an N-bit counter to the inputs (controllability), then generate all the 2N
combinations, and observe the outputs for checking (observability). This is called "exhaustive testing", and it is very efficient... but only
for few- input circuits. When the input number increase, this technique becomes very time consuming.
Figure-8.7:
Most of the time, in exhaustive testing, many patterns do not occur during the application of the circuit. So instead of spending a huge
amount of time searching for faults everywhere, the possible faults are first enumerated and a set of appropriate vectors are then
generated. This is called "single-path sensitization" and it is based on "fault oriented testing".
Figure-8.8:
The basic idea is to select a path from the site of a fault, through a sequence of gates leading to an output of the combinational logic
under test. The process is composed of three steps :
● Manifestation : gate inputs, at the site of the fault, are specified as to generate the opposite value of the faulty value (0 for SA1,
1 for SA0).
● Propagation : inputs of the other gates are determined so as to propagate the fault signal along the specified path to the primary
output of the circuit. This is done by setting these inputs to "1" for AND/NAND gates and "0" for OR/NOR gates.
● Consistency : or justification. This final step helps finding the primary input pattern that will realize all the necessary input
values. This is done by tracing backward from the gate inputs to the primary inputs of the logic in order to receive the test
patterns.
Figure-8.9:
EXAMPLE1 - SA1 of line1 (L1) : the aim is to find the vector(s) able to detect this fault.
● Manifestation : L1 = 0 , then input A = 0. In a fault-free situation, the output F changes with A if B,C and D are fixed : for B,C
and D fixed, L1 is SA1 gives F = 0, for instance, even if A = 0 (F = 1 for fault-free).
● Propagation : Through the AND-gate : L5 = L8 = 1, this condition is necessary for the propagation of the " L1 = 0 ". This leads to
L10 = 0. Through the NOR-gate, and since L10 = 0, then L11 = 0, so the propagated manifestation can reach the primary output
F. F is then read and compared with the fault-free value : F = 1.
● Consistency : From the AND-gate : L5=1, and then L2=B=1. Also L8=1, and then L7=1. Until now we found the values of A
and B. When C and D are found, then the test vectors are generated, in the same manner, and ready to be applied to detect L1=
SA1. From the NOT-gate, L11=0, so L9=L7=1 (coherency with L8=L7). From the OR-gate L7=1, and since L6=L2=B=1, so
B+C+D=L7=1, then C and D can have either 1 or 0.
These three steps have led to four possible vectors detecting L1=SA1.
Figure-8.10:
EXAMPLE 2 - SA1 of line8 (L8) : The same combinational logic having one internal line SA1.
● Manifestation : L8 = 0
● Propagation : Through the AND-gate : L5 = L1 = 1, then L10 = 0 Through the NOR-gate : we want to have L11 = 0, not to mask
L10 = 0.
● Consistency : From the AND-gate L8 = 0 leads to L7 = 0. From the NOT-gate L11 = 0 means L9 = L7 = 1, L7 could not be set to
1 and 0 at the same time. This incompatibility could not be resolved in this case, and the fault "L8 SA1" remains undetectable.
Figure-8.11:
EXAMPLE 3 - SA1 of line2 (L2) : Always the same combinational logic, with the line L2 SA1.
● Manifestation : L2 = 0, sets L5 = L6 = 0.
● Propagation : Through the AND-gate : L1 = 1 and then we need L10=0. Through the OR-gate L3=L4=0, so we can have
L7=L8=L9=0, but through the NOT-gate L11 = 1.
The propagated error "L2 SA1" across a reconvergent path is masked since the NOR-gate does not distinguish the origin of the
propagation.
This section provides a set of practical Design for Testability guidelines classified into three types: those who are facilitating test
generation, test application and those avoiding timing problems.
All "design for test" methods ensure that a design has enough observability and controllability to provide for a complete and efficient
testing. When a node has difficult access from primary inputs or outputs (pads of the circuit), a very efficient method is to add internal
pads acceding to this kind of node in order, for instance, to control block B2 and observe block B1 with a probe.
Figure-8.12:
It is easy to observe block B1 by adding a pad just on its output, without breaking the link between the two blocks. The control of the
block B2 means to set a 0 or a 1 to its input, and also to be transparent to the link B1-B2. The logic functions of this purpose are a NOR-
gate, transparent to a zero, and a NAND-gate, transparent to a one. By this way the control of B2 is possible across these two gates.
Another implementation of this cell is based on pass-gates multiplexers performing the same function, but with less transistors than with
the NAND and NOR gates (8 instead of 12).
The simple optimization of observation and control is not enough to guarantee a full testability of the blocks B1 and B2. This technique
has to be completed with some other techniques of testing depending on the internal structures of blocks B1 and B2.
This technique is an extension of the precedent, while multiplexers are used in case of limitation of primary inputs and outputs.
Figure-8.13:
In this case the major penalties are extra devices and propagation delays due to multiplexers. Demultiplexers are also used to improve
observability. Using multiplexers and demultiplexers allows internal access of blocks separately from each other, which is the basis of
techniques based on partitioning or bypassing blocks to observe or control separately other blocks.
Partitioning large circuits into smaller sub-circuits reduces the test-generation effort. The test- generation effort for a general purpose
circuit of n gates is assumed to be proportional to somewhere between n2 and n3. If the circuit is partitioned into two sub-circuits, then
the amount of test generation effort is reduced correspondingly.
Figure-8.14:
The example of the SN7480 full adder shows that an exhaustive testing requires 512 tests (29), while a full test after partitioning into
four sub-circuits, for SA0 and SA1 faults, requires 24 tests. Logical partitioning of a circuit should be based on recognizable sub-
functions and can be achieved physically by incorporating some facilities to isolate and control clock lines, reset lines and power supply
lines. The multiplexers can be massively used to separate sub-circuits without changing the function of the global circuit.
Based on the same principle of partitioning, the counters are sequential elements that need a large number of vectors to be fully tested.
The partitioning of a long counter corresponds to its division into sub-counters.
The full test of a 16-bit counter requires the application of 216 + 1 = 65537 clock pulses. If this counter is divided into two 8-bit
counters, then each counter can be tested separately, and the total test time is reduced 128 times (27). This is also useful if there are
subsequent requirements to set the counter to a particular count for tests associated with other parts of the circuit : pre-loading facilities.
Figure-8.15:
One of the most important problems in sequential logic testing occurs at the time of power-on, where the first state is random if there
were no initialization. In this case it is impossible to start a test sequence correctly, because of memory effects of the sequential
elements.
Figure-8.16:
The solution is to provide flip-flops or latches with a set or reset input, and then to use them so that the test sequence would start with a
known state.
Ideally, all memory elements should be able to be set to a known state, but practically this could be very surface consuming, also it is
not always necessary to initialize all the sequential logic. For example, a serial-in serial-out counter could have its first flip-flop
provided with an initialization, then after a few clock pulses the counter is in a known state.
Overriding of the tester is necessary some times, and requires the addition of gates before a Set or a Reset so the tester can override the
initialization state of the logic.
Asynchronous logic uses memory elements in which state-transitions are controlled by the sequence of changes on the primary inputs.
There is thus no way to determine easily when the next state will be established. This is again a problem of timing and memory effects.
Figure-8.17:
Asynchronous logic is faster than synchronous logic, since the speed in asynchronous logic is only limited by gate propagation delays
and interconnects. The design of asynchronous logic is then more difficult than synchronous (clocked) logic and must be carried out
with due regards to the possibility of critical races (circuit behavior depending on two inputs changing simultaneously) and hazards
(occurrence of a momentary value opposite to the expected value).
Non-deterministic behavior in asynchronous logic can cause problems during fault simulation. Time dependency of operation can make
testing very difficult, since it is sensitive to tester signal skew.
Logical redundancy exists either to mask a static-hazard condition, or unintentionally (design bug). In both cases, with a logically
redundant node it is not possible to make a primary output value dependent on the value of the redundant node. This means that certain
fault conditions on the node cannot be detected, such as a node SA1 of the function F.
Figure-8.18:
Another inconvenience of logical redundancy is the possibility for a non-detectable fault on a redundant node to mask the detection of a
fault normally-detectable, such a SA0 of input C in the second example, masked by a SA1 of a redundant node.
Automatic test pattern generators work in logic domains, they view delay dependent logic as redundant combinational logic. In this case
the ATPG will see an AND of a signal with its complement, and will therefore always compute a 0 on the output of the AND-gate
(instead of a pulse). Adding an OR-gate after the AND-gate output permits to the ATPG to substitute a clock signal directly.
Figure-8.19:
When a clock signal is gated with any data signal, for example a load signal coming from a tester, a skew or any other hazard on that
signal can cause an error on the output of logic.
Figure-8.20:
This is also due to asynchronous type of logic. Clock signals should be distributed in the circuit with respect to synchronous logic
structure.
This is another timing situation to avoid, in which the tester could not be synchronized if one clock or more are dependent on
asynchronous delays (across D-input of flip-flops, for example).
Figure-8.21:
The problem is the same when a signal fans out to a clock input and a data input.
The self resetting logic is more related to asynchronous logic, since a reset input is independent of clock signal.
Before the delayed reset, the tester reads the set value and continue the normal operation. If a reset has occurred before tester
observation, then the read value is erroneous. The solution to this problem is to allow the tester to override by adding an OR-gate, for
example, with an inhibition input coming from the tester. By this way the right response is given to the tester at the right time.
Figure-8.22:
This approach is related, by structure, to partitioning technique. It is very useful for microprocessor-like circuits. Using this structure
allows the external tester the access of three buses, which go to many different modules.
Figure-8.23:
The tester can then disconnect any module from the buses by putting its output into a high- impedance state. Test patterns can then be
applied to each module separately.
Testing analog circuit requires a completely different strategy than for digital circuit. Also the sharp edges of digital signals can cause
cross-talk problem to the analog lines, if they are close to each other.
Figure-8.24:
If it is necessary to route digital signals near analog lines, then the digital lines should be properly balanced and shielded. Also, in the
cases of circuits like Analog-Digital converters, it is better to bring out analog signals for observation before conversion. For Digital-
Analog converters, digital signals are to be brought out also for observation before conversion.
Bypassing a sub-circuit consists in propagating the sub-circuit inputs signals directly to the outputs. The aim of this technique is to
bypass a sub-circuit (part of a global circuit) in order to access another sub-circuit to be tested. The partitioning technique is based on
bypassing technique and they both use multiplexers to perform two different methods.
In the bypassing technique sub-circuits can be then tested exhaustively, by controlling multiplexers in the whole circuit. To speed-up the
test, some sub-circuits are tested simultaneously if the propagation paths are associated with other disjoint or separated sub- circuits.
Figure-8.25:
DfT Remarks
All the techniques listed above do not represent an exhaustive list for DfT, but give a set of rules to respect as possible. Some of
these guidelines goals are the simplification of test vectors generation, others goals are the simplification of test vectors
application, and many others are to avoid timing problems in the design.
The set of design for testability guidelines presented above is a set of ad hoc methods to design random logic in respect with testability
requirements. The scan design techniques are a set of structured approaches to design (for testability) the sequential circuits.
The major difficulty in testing sequential circuits is determining the internal state of the circuit. Scan design techniques are directed at
improving the controllability and observability of the internal states of a sequential circuit. By this the problem of testing a sequential
circuit is reduced to that of testing a combinational circuit, since the internal states of the circuit are under control.
The goal of the scan path technique is to reconfigure a sequential circuit, for the purpose of testing, into a combinational circuit. Since a
sequential circuit is based on a combinational circuit and some storage elements, the technique of scan path consists in connecting
together all the storage elements to form a long serial shift register. Thus the internal state of the circuit can be observed and controlled
by shifting (scanning) out the contents of the storage elements. The shift register is then called a scan path.
Figure-8.26:
The storage elements can either be D, J-K, or R-S types of flip-flops, but simple latches cannot be used in scan path. However, the
structure of storage elements is slightly different than classical ones. Generally the selection of the input source is achieved using a
multiplexer on the data input controlled by an external mode signal. This multiplexer is integrated into the D-flip-flop, in our case; the D-
flip-flop is then called MD-flip-flop (multiplexed-flip-flop).
The sequential circuit containing a scan path has two modes of operation : a normal mode and a test mode which configure the storage
elements in the scan path.
In the normal mode, the storage elements are connected to the combinational circuit, in the loops of the global sequential circuit, which
is considered then as a finite state machine.
In the test mode, the loops are broken and the storage elements are connected together as a serial shift register (scan path), receiving the
same clock signal. The input of the scan path is called scan-in and the output scan-out. Several scan paths can be implemented in one
same complex circuit if it is necessary, though having several scan-in inputs and scan-out outputs.
A large sequential circuit can be partitioned into sub-circuits, containing combinational sub-circuits, associated with one scan path each.
Efficiency of the test pattern generation for a combinational sub-circuit is greatly improved by partitioning, since its depth is reduced.
Before applying test patterns, the shift register itself has to be verified by shifting in all ones i.e. 111...11, or zeros i.e. 000...00, and
comparing.
1. Set test mode signal, flip-flops accept data from input scan-in
2. Verify the scan path by shifting in and out test data
3. Set the shift register to an initial state
4. Apply a test pattern to the primary inputs of the circuit
5. Set normal mode, the circuit settles and can monitor the primary outputs of the circuit
6. Activate the circuit clock for one cycle
7. Return to test mode
8. Scan out the contents of the registers, simultaneously scan in the next pattern
Boundary Scan Test (BST) is a technique involving scan path and self-testing techniques to resolve the problem of testing boards
carrying VLSI integrated circuits and/or surface mounted devices (SMD).
Printed circuit boards (PCB) are becoming very dense and complex, especially with SMD circuits, that most test equipment cannot
guarantee a good fault coverage.
Figure-8.27:
BST consists in placing a scan path (shift register) adjacent to each component pin and to interconnect the cells in order to form a chain
around the border of the circuit. The BST circuits contained on one board are then connected together to form a single path through the
board.
The boundary scan path is provided with serial input and output pads and appropriate clock pads which make it possible to :
Figure-8.28:
BS Techniques are grouped by the IEEE Standard Organization in a "standard test access port and boundary scan architecture", namely
IEEE P1149.1-1990. The Joint Test Action Group (JTAG), formed basically in 1986 at Philips, is an international committee composed
of IC manufacturers who have set the technical development of the IEEE P1149 standard and promoted its use by all sectors of
electronics industry.
The IEEE 1149 is a family of overall testability bus standards, defined by the Joint Test Action Group (JTAG), formed basically in 1986
at Philips. JTAG is an international committee composed of European and American IC manufacturers. The "standard Test Access Port
and Boundary Scan architecture", namely IEEE P1149.1 accepted by the IEEE standard committee in February1990, is the first one of
this family. Several other ongoing standards are developed and suggested as drafts to the technical committee of the IEEE 1149 standard
in order to promote their use by all sectors of electronics industry.
production of
Chapter 9
FUZZY LOGIC SYSTEMS
● Systems Considerations
● Fuzzy Logic Based Control Background
● Integrated Implementations of Fuzzy Logic Circuits
● Digital Implementations of Fuzzy Logic Circuits
● Analog Implementations of Fuzzy Logic Circuits
● Mixed Digital/Analog Implementations of Fuzzy Systems
● CAD Automation for Fuzzy Logic Circuits Design
● Neural Networks Implementing Fuzzy Systems
1 Systems Considerations
The use of fuzzy logic is rapidly spreading in the realm of consumer products design in order to satisfy the following requirements: (1) to
develop control systems with nonlinear characteristics and decision making systems for controllers, (2) to cope with an increasing
number of sensors and exploit the larger quantity of information, (3) to reduce development time, (4) to reduce costs associated with
incorporating the technology into the product. Fuzzy technology can satisfy these requirements for the following reasons.
Nonlinear characteristics are realized in fuzzy logic by partitioning the rule space, bu weighting the rules, and by the nonlinear
membership function. Rule-based systems compute their output by combining results from different parts of the partition, each part
being governed by separate rules. In fuzzy reasoning, the boundaries of these parts overlap, and the local results are combined by
weighting them appropriately. That is why the output in a fuzzy system is a smooth, nonlinear function.
In decision-making systems, the target of modeling is not a control surface but the person whose decision-making is to be emulated. This
kind of modeling is outside the realm of conventional control theory. Fuzzy reasoning can tackle this easily since it can handle
qualitative knowledge (e.g. linguistic terms like “big” and “fast”, and rules of thumb) directly. In most applications to consumer
products, fuzzy systems do not directly control the actuators, but determine the parameters to be used for control. For example, they may
determine washing time in washing machines, or if it is the hand or the image that is shaking in a camcorder, or they compute which
object is supposed to be the focus in an auto-focus system, or they determine the contrast optimal for watching television.
A fuzzy system encodes knowledge at two levels: knowledge which incorporates fuzzy heuristics, and the knowledge that defines the
terms being used in the former level. Due to this separation of meaning, it is possible to directly encode linguistic rules and heuristics.
This reduces the development time, since the expert’s knowledge can be directly built in.
Although the developed fuzzy system may have complex input-output characteristics, as long as the mapping is static during the
operation of the device, the mapping can be discretized and implemented as a memory lookup on simple hardware. This further reduces
the costs involved in incorporating the knowledge into the device.
E.H. Mamdani is famous in the circle of fuzzy for the works on fuzzy logic he made in the 70s and that are still topical now. He has
extended the application field of fuzzy logic theory to technical systems whereas most scientists thought that these applications were
restricted to non-technical fields (such as human sciences, trade, jurisprudence, etc.). At first, he suggested that a control that can be done
by an operator could be done as well by fuzzy logic after having translated the operator experience into qualitative linguistic terms
([Mam73]-[Mam77]). Mamdani's method gave then rise to many engineering applications, especially for industrial fuzzy control and
command systems.
Mamdani introduced the fuzzification/inference/defuzzification scheme and used an inference strategy that is generally mentioned as the
max-min method. This inference type is a way of linking input linguistic variables to output ones in accordance with the generalized
modus ponens, using only the MIN and MAX functions (as T-norm and S-norm (or T-conorm) respectively). It allows to achieve
approximate reasoning (or interpolative inference).
Let consider a set of inference rules in the guise of a fuzzy associative memory (FAM) represented in an inference matrix or table, and
some fuzzy sets and respective membership functions that have been attributed to each variables (fuzzification). Figure 2.1 represents the
case where 2 rules are activated involving 2 input variables (x & y) and one output variable (r). Let assume x be a measure of x(t) at time
t, and y a measure of y(t) at the same time. Let now consider the fuzzy sets A1, A2, B0 and B1 which respective membership functions
µA1(x), µA2(x), µB0(y) and µB1(y) take positive values for x and y. Affected inference rules are for example as follow:
A statement like x=A1 is true at a degree µA1(x) and a rule is activated when the combination of all membership grades (or truth degrees)
µi(k) in its condition part (premise) take a strictly positive value. Several rules may be activated simultaneously. The max-min method
realises the AND operators of the different rule conditions by taking the respective maximum membership functions. The premises can
also include some OR operators realised by taking the minimum of the membership functions but it is rarely the case in control systems.
The implications (connective then) are realised by the truncation (or clipping) of the output sets. That consists in taking for each point the
minimum value between the membership grades resulting from rules conditions (fig. 2.1: µA2 (x) and µB0 (y)) and the membership
functions of the respective output fuzzy sets (fig. 2.1: µC1 (r) and µC0 (r)):
The rules are finally combined by using the connective else, acting then as the OR operator and interpreted as the maximum operation
for each possible value of the output variable (r on fig. 2.1) according to the defined fuzzy sets (n sets):
It is then possible with the above defined operations to give an algorithm for fuzzy reasoning (in order to achieve a control action for
example).
[Click to enlarge image]Figure-2.1: max-min method for 2 rules involving 2 input and 1 output variables
The max-min method finally requires a defuzzification stage which is generally performed using the centre of gravity method. It will
give for the above example the real value r, using the membership function resulting from the max-min method:
2.2 Characteristics
There are several mathematical properties that make the MIN and MAX operators well appropriated (this is simple and efficient) for
fuzzy inferences as notably described in [Dub80] and [God94]. There are however several other methods based on different ways to
realise the OR & AND operators, in purpose to improve some mathematical properties or numerical implementation characteristics.
The max-prod method for example is similar to the max-min method, except that all implications in the rules (then operations) are
realised by a product instead of a minimisation. The truth values of the rule conditions are used to multiply uniformly the corresponding
output sets instead of clipping them at a certain level. This allow to keep the information that consist in the shape of these sets, and that is
partly lost with the max-min method. The product is moreover simpler and faster to execute than the minimum operator in software
implementations and allows some simplifications in numerical realizations of inferences (since it can deal with analytic expressions
instead of comparing each couple of stored points).
Another common method is the sum-prod method that uses the arithmetical mean and the product to realise respectively all the OR &
AND operators. Unlike the MAX operator which select only the maximums values, the sum takes into account all involved sets and
conserves part of the information that contain their shapes.
These different methods are described and compared in [Büh94]. From this analysis emerges the fact that they lead to very similar
input/output characteristics in the case of a single input variable, when used with triangular or trapezoid output sets. With several input
variables, the max-min method produces non-linear characteristics with strong discontinuities, while the sum-prod method produces non-
linear characteristics with smoother discontinuities. Nevertheless the choice of a specific method is mainly influenced by the way to
implement it. This suggests the choice of the max-min method for hardware implementations, because the MAX and MIN operators are
then the easiest to implement. These two operators are moreover the most robust to realize T- and S-norm according to the authors of
[Ngu93]. Since they are the most reliable when membership grades have imprecise and noisy values they are well appropriate to fuzzy
hardware with questionable accuracy. The max-min method is finally well suitable for fuzzy rule-based systems when there is no precise
model. It generally leads in a simple way to consistent rules as it can be noticed in practical applications.
The choice of an inference method has nevertheless a great importance for one-rule inferences, to select for example one among several
candidates or to choose an optimal solution (this case is frequent especially in non-technical fields). The operators are then to be chosen
with care because they influence directly the criterion of evaluation and consequently the final decision.
Finally, one important aspect of Mamdani's method is that it is essentially heuristic and it can sometimes be very difficult to express an
operator's or engineer's knowledge in terms of fuzzy sets and implications. Moreover such a knowledge is often incomplete and episodic
rather than systematic. There is in fact no specific methodology for clearly deriving and analysing a rule base of a fuzzy inference system
(there is consequently no exhaustive choice of optimal rule and set numbers, shapes, operators,...). Since their principle seems to be
rather simple, fuzzy system includes a lot of parameters and can lead then to a great deal of different and complicated characteristics.
Some problems may occur when the rules have to describe a process that is too complex or to deal with a high number of variables. It
can be then very difficult to define a sufficient set of coherent rules, and the danger of having not enough or conflicting rules occurs.
There are several methods used to optimise inference rules and fuzzy sets, that are more or less perfected and sometimes quite
complicated (typical ones are gradient method, least squares method, simulated annealing, neural-networks, etc.). They give rise to
adaptive fuzzy systems which parameters are suited to the conditions of a specific application. The high parametrical level of fuzzy
systems makes automatic adaptation solutions rather difficult.
Mamdani's method is currently and effectively applied to process control, robotics and other expert systems. It is especially well
appropriated to execute an operator's control or command action. It leads to good results that are often close to the operator's ones while
dismissing the risk of human error. Thus it has been used successfully in the control of several plants, such as those in the chemical,
cement or steel industries.
Mamdani-type control is simpler than most of standard ones and requires much shorter development cycle when linguistic rules can be
easily expressed (because there is no need to develop, analyse and implement a mathematical model). It is even in several cases as or
more efficient, especially when no precise model exists, for example when a process to control is governed by non-linear laws or
includes bad-known parameters or disturbances. When the mathematical model contains some non-linear terms, they are linearized and
simplified under the assumption of small error signals, whereas a non-linear fuzzy method often allows to control bigger ranges of error.
The non-linearity of Mamdani-type control can moreover have a favourable influence on transitory phenomenon. Consequently it can
sometimes supplant classic control to provide fast responses. When increasing the response speed of conventional controllers the
overshoot is also increasing (in position control for example). Generally fuzzy controllers with highly non-linear characteristics gives
lower overshoot before setting time than conventional PID control, but small oscillations often remain after settlement. This oscillatory
behaviour can become very difficult to restrain and fuzzy controllers are sometimes combined with conventional ones (in cascade
configuration for example) to exploit the advantages of both. However fuzzy controllers can also have perfectly linear characteristics and
replace standard comtrollers, which can sometimes be useful to provide some features of fuzzy controllers.
Fuzzy control is attractive in some cases where parameters are varying with time and easily expressed into linguistic terms. It is indeed
easier to reassign one or several fuzzy rules rather than to calculate the mathematical equations of a new model necessary to adapt classic
controllers. However these last ones have a smaller number of parameters to adjust, and these parameters can for example be issued from
a fuzzy system which regulates them according to the features fluctuations of a controlled process.
The common centre of gravity defuzzification method requires a quantity of calculation that is prohibitive for many real-time
applications with software implementations. Its calculation can however be simplified when associated with the sum-prod method
[Büh94]. The computation of the centre of gravity can take advantage of the high speed afforded by VLSI when integrated on an IC,
which is however quite complex. Simplest defuzzification types are often used, as for example the mean of maximum or height method
[Dri93],[Büh94],[God94].
A Sugeno-type controller differs from a Mamdani-type controller in the nature of the inference rules that are generally called Takagi-
Sugeno's rules in this case. Whereas Mamdani's method uses some inference rules that only deal with linguistic variables, the rules of
Takagi- Sugeno directly lead to real values that are not membership functions but deterministic crisp values [Tak83],[Tak85]. This
method only use fuzzy sets for the input variables, and there is no need of any defuzzification stage. Whereas the antecedents still consist
in logical functions of linguistic variables, the output values are resulting from standard functions of the input variables (real crisp
functions). In most cases, only linear functions are used. Some variables can appear in such a function and not in the corresponding
condition or vice versa. Each function belongs to the consequence part of a rule and is considered when the respective condition has a
positive truth degree. This degree results from the different membership factors of the input fuzzy sets that the rule condition deal with.
The final output value is calculated as the weighted average of all the linear functions and the weights are the truth degrees of their
respective rule conditions.
[Click to enlarge image]Figure-2.2: Takagi-Sugeno's method for 2 rules involving 2 input and 1 output variables
Let the antecedent part results from a specific fuzzification achieved according to human operator's or engineer's knowledge. That means
some fuzzy sets and some logical functions making up premises are defined. The input space has been thus divided into a certain number
of fuzzy subsets, i.e. a certain number of fuzzy rules. Then the setting-up of the consequence part of the rules requires a certain amount
of numerical data (corresponding to input and output variables) coming from the operation which has to be modelled (fig. 2.3). The
coefficients of each linear function are identified from the analysis of these data in order to minimized the difference between outputs of
an original system and the ones of a model (fuzzy engine). This optimisation is achieved by the analysis of a weighted linear regression
using a certain amount of input data (the weights are calculated as the truth degrees of the fuzzy conditions).
Finally the system identification is function of the consequence parameters (coefficients in the linear functions) as well as the premise
parameters (input sets) and the fuzzy partition of the input space. The identification of this last has no solution since it is a combinatorial
problem and is generally issued from an heuristic search. An iterative algorithm for parameters and structure identification can be used
efficiently (fig. 2.4) [Sug85],[Tak85]. The resulting fuzzy model gives generally much better results than a statistical model [Tak83].
[Click to enlarge image]Figure-2.3: Takagi-Sugeno's method: identification of a fuzzy logic inference engine
2.6 Characteristics
It is not possible to derive efficient control rules by Mamdani's method when it is too difficult to translate exactly an human operator's or
engineer's knowledge into linguistic terms. The adaptive Takagi-Sugeno's method is however efficient in such cases, provided that
inference rules can be derived from the analysis of appropriate numerical data. One more advantage of this method occurs in some
complicated cases where many different variables play a part in a process. It leads then more certainly to consistent rules than Mamdani's
method.
When fuzzy reasoning is used to describe some processes, methods as Mamdani's one are not so powerful partly because of their
nonlinearity. The use of linear relations in the consequences enable to easily deal with Takagi-Sugeno'rules as an efficient mathematical
tool for fuzzy modelling. It is however not possible to cascade inference rules without performing again a fuzzification step of crisp
output values.
Takagi-Sugeno'rules implementation is simplified because there is no need of a defuzzification stage. Output values are simply
calculated from input ones, using a set of weighted linear functions. One needs however to be able to collect a certain amount of data
(sometimes during a rather long period for time-varying processes) and then to find good coefficients with the least squares method.
Measurements and data analysis often require a great deal of time, and it is furthermore not always possible to find satisfying
coefficients. Takagi-Sugeno' method becomes actually difficult or even impossible to apply when a control has strongly non-linear
characteristics (for example when there is a large hysteresis loop) or which is moreover changing with time.
Some method as the simulated annealing for example could be used to reduce the iterative sequence of tests achieving the coefficients
and structure identification. The approximation of measured characteristic or of a calculated linear regression could also be efficiently
achieved by using a neural network. The structure identification would then result from a supervised learning phase instead of a long and
unwieldy iterative research.
The rules of Takagi-Sugeno are mainly used to realise process modelling. Fuzzy controllers are realised by regarding an operator's
control action as a process to model, and thus by collecting a lot of data when the operator is proceeding. A typical application of Takagi-
Sugeno'method is described in [Sug85], pp.125-138. Sugeno has experienced this method with good results in the case of an automated
car parking into a garage. Fuzzy control rules are then derived from observation and modelling of a man's parking action. The method is
well appropriated to this situation which is easier to observe than to express into linguistic terms. The arms and legs actions are indeed
easier to measure than to verbalise, just like most gestures which have been acquired by heuristic experiences or which are instinctive,
rather than issued from precise reasoning.
Sugeno's method is a special case of Takagi-Sugeno'one and its particularity lies in the fact that all the output membership functions are
precise values. The consequence functions are then replaced by constants which are weighted by the condition truth degrees to give the
final output values. The fuzzy sets forming these precise values are generally called singletons or sets of Sugeno (fig. 2.5). It can also be
regarded as a particular case of Mamdani'method where straight and without thickness non-overlapping output sets are used.
2.9 Characteristics
The main advantage of using singletons is the great simplicity of implementing the consequence part. It can be used with
Mamdani'method to simplify considerably the defuzzification stage, whose task is reduced to the calculation of a weighted average with
a restricted set of crisp values. An other defuzzification type called maxima method is sometimes used (cf.[Mam74bis],[Dri93]). In this
case each output crisp set corresponds to a specific action, and only the one with the maximum weight is selected (or an action midway
between two maximum values).
The use of singletons has no bad consequence on the output variable domain which can be the same as with triangular or trapezoid
output sets when using the centre of gravity method. The nonlinearity of a controller characteristic can be modulated by the distribution
of these sets (but not more by their shapes). The difficulty to describe some processes can however be then increased in comparison with
the use of more complex sets in Mamdani'method or the use of linear functions in Takagi-Sugeno'method. For the latter, the restriction to
constant functions in the consequence part doesn't allow any more to describe complicated relationships between input and output
variables in a rather simple format. This method leads then sometimes to a worse minimisation of the output error between a real process
and a fuzzy model, and structure identification (fuzzy partition of the input space) has to be much more improved to obtain satisfactory
results in complicated cases.
The recent advances in fuzzy logic theory has brought about rule-based algorithms to a growing field of applications. Several
implementation approaches have been proposed during the last ten years, especially in Japan where fuzzy systems have known a real
proliferation. Fuzzy control has proved it could lead to good performance with short design times in a wide variety of low complexity
real world applications. Many specific application programs have been developed to implement fuzzy operations on standard digital
computers, and most of processor manufacturers provide software environments to develop and simulate fuzzy applications on their
available microcontrollers.
The design of fuzzy dedicated integrated circuits is however of great interest, because of the increasing number of fuzzy applications
requiring highly parallel and high-speed fuzzy processing. They attempt to give a concrete expression to the idea of "fuzzy computers"
(sometimes called computers of the sixth generation), which deal with analog values or some digital representation of them. Fuzzy
processors are designed in a way to optimise fuzzy logic functions (cf.§3.2) as regards their implementation size and execution speed.
Since practical systems often require a great number of rule evaluations per second, a huge amount of real- time data processing is
necessary and the speed of fuzzy circuits is of prime importance. Their architectures are generally suited to the structure of approximate
reasoning and decision- making algorithms and generally include three distinct parts, proceeding fuzzification, inference and
defuzzification respectively. Fuzzy controllers are thus made on one hand of a knowledge base, which contains rules and membership
functions outlines, as other configuration parameters of the system. On the other hand stands an inference mechanism (based on
interpolative reasoning) interfacing a real process in a feedback loop configuration.
[Click to enlarge image]Figure-3.1: General structure of fuzzy logic dedicated processors for approximate reasoning
The structure of fuzzy dedicated circuits is characterized by the number and shape range of input and output variables, the number of
rules they can evaluate simultaneously, the type(s) of inference(s) (size of premises, operators, consequences,...), the type(s) of
defuzzification method(s), and so on. Their performance is evaluated according to their processing speed (that is the number of fuzzy
logic inferences per second (FLIPS)), as to their precision (error and noise generation in analog circuits and number of bit representing
fuzzy values in digital ones). Fast response is required for non-linear functions as MIN and MAX which output signals can be subject to
sudden discontinuities. These functions are however piecewise linear and accuracy is also quite important.
Fuzzy chips are mainly used in expert systems as well as in control and command field to achieve real-time performance. They are
however less efficient for applications that deal with a large amount of data, because of the limited I/O. It can thus be interesting to
design several compatible circuits, identical or different. They can be for example dedicated to inference rules and defuzzification
respectively. In this manner it is possible to connect them together in several different ways, using some of them for example to proceed
parallel fuzzifications and inferences, and connecting them to a defuzzification circuit [Hir93]. With a large number of input variables it
may be useful to split up the control system into several cascaded or superposed units, which aims at simplifying the inference engine
[Büh94].
Fuzzy software is useful when an application can be modelled to simulate and calculate in advance a multidimensional response
characteristic. They give the different parameters relative to an optimal characteristic in order to design or program a fuzzy dedicated
circuit. Some of these parameters (fuzzy sets and inference rules) have just to be adjusted then in the real implementation. In some
simple cases, the numeric response characteristic can be stored as a reference rule table in a multidimensional associative memory, and
provides then response values for real-time proceeding without any more inference calculations. The stored values are provided by fuzzy
software (when a model exist), by the measure of an operator's action or by an adaptive system with a retroactive learning scheme (which
comes close to the principle of fuzzy dedicated neural networks)[Chan93].
When no model exists, a fuzzy processor has to be programmable because its parameters are established by iterative testing on the real
implementation, not by simulation. This process requires a lot of tests and is time consuming, and may furthermore not necessarily reach
an optimal solution. Since no simulation of the system's dynamic behaviour are achieved, the danger of unstable working states occurs in
control applications. The configuration flexibility of integrated fuzzy system requires a great storage capacity (digital circuits) or tunable
components (analog circuits) to deal with variable numbers and shapes of membership functions as variable numbers and sizes of rules.
Since the shape of the membership functions is generally less important than their degree of overlap the horizontal position of sets should
always be tunable, but often complicated shapes are not essential.
The fuzzy operators presented below have been introduced to deal with fuzzy values and are relative to specific membership grades µ(x).
A fuzzy operator that deals with tho fuzzy sets A and B defines to a new membership function . The real
variable (x) has been omitted in the expressions below.
Conventional operators as algebraic sum (+), algebraic difference (-), algebraic product (.) are commonly used in fuzzy reasoning
systems. The first nine fuzzy operators are the more commonly used in hardware system design to implement fuzzy information
processing. They have moreover the particularity that they all can be expressed by the only bounded difference ( ) and algebraic sum
(+) functions [Yam86],[Zhij90],[Sas90]. There has been several approaches using these two functions to design fuzzy units, which
provides attractive perspectives for CAD automation and semicustom circuits. Other operators also allows to express any fuzzy formulae
when combined together, as for example the bounded sum ( ) and the bounded product ( ) [Lem94]. These ones are associative and
commutative, but are not distributive to each other. The non-distributiveness of bounded operators leads to long and complicated
manipulations when used to solve fuzzy formulae, and it is quite not obvious to substite or eliminate some terms. Fuzzy formulae are
unfortunately not able to be reduced as much and as simply than boolean formulae.
Some other fuzzy operators as symmetric difference ( ), drastic difference ( ), drastic sum ( ) or drastic product ( ) are sometimes to
resolve some fuzzy equations.
The VLSI digital implementation of Fuzzy Logic systems offers several advantages issued from the sound knowledge of digital circuit
design and technology. Several mature CAD tools allow relatively easy design automation (synthesis & simulation) reducing
consequently time and cost of development. The automatic regeneration of logic levels involves high noise immunity and low sensitivity
to the variances of transistor characteristics. This provides accurate and reliable data and signal processing. Binary data can be easily
stored and allows to realize programmable and multistage fuzzy processing. Complex representation of fuzzy vectors and parallel
structures are however required to obtain accurate and fast processing. Digital implementations of common fuzzy operations leads
unfortunately rapidly to complicated, enormous VLSI circuits. The density and speed of these circuits are nevertheless continually
increasing according to technological advances, so that they will become more and more efficient to implement fuzzy logic systems.
Digital fuzzy processors are generally designed for multipurpose applications in order to interest a maximum of potential customers.
They should thus implement a great and various number of fuzzy operators, membership functions and inference rules. This make them
rather efficient for a large range of applications, provided that appropriate programming is possible (which supposes an appropriate
internal or external memory). Combined with an appropriate object-oriented programming environment, linguistic rules derived from an
human expert can be directly translated into an implementation on a chip.
The first hardware approaches of implementing fuzzy logic inference engine were the digital circuits designed by Togai and Watanabe
[Tog86],[Tog87],[Wat90]. ASIC's implementing specific architectures and specialized instructions (MIN & MAX) exhibits much
enhanced processing speed relatively to regular microprocessors.
architectures: Fuzzy microcontroller (embedded systems with A/D & D/A converters, Fuzzy arithmetic logic unit (FALU) and
specific memories, ...) versus fuzzy coprocessor; coprocessor. Sequential processing -> enhance speed by parallel, pipeline
[Nak93], systolic, RISC, SIMD architectures
For applications where a restricted number of pins of a chip can be inappropriate (such as fuzzy databases (= ambiguous data or
fuzzy relations between crisp data)), a fuzzy coprocessor is used rather than a stand-alone chip. When other conventionnal digital
processing are required (flexible system)
Analog fuzzy values should be converted into strictly binary signals before being processed by standard digital circuits. On one hand the
analog input signals should be quantized through A/D converters, and on the other hand the membership functions should be quantized
to obtain their digital representations. Fuzzy sets are then storable but in the guise of stair-functions. The combination of these two round-
off effect can deteriorate fuzzy processing if the fuzzy values are not represented with a sufficient number of bit. There is however a
trade-off between precision and size (or speed) since the latter is proportional to the number of bit.
The input are furthermore sampled at the frequency of digital circuits. Sampling effects: non-continuous (or pseudo-continuous) control
[Büh94] § 7.3
Togai's FC110
fuzzy accelerator, coprocessor
Watanabe
Risc approach
● THOMSON's WARP (Weight Associative Rule Processor)
Cell-to-cell mapping approach, automatic synthesis [Pag93]
SIEMENS's coprocessor (SAE81C99)
optimized memory organisation [Eich92]
Sasaki
SIMD and Logic-in-Memory structure [Sas93]
OMRON's FP-3000, FP-1000 &FP-5000
NeuraLogix' controllers
Control, expert systems, robots, image recognition, diagnosis, database (interface with digital circuits), information retrieval, ...
Some less common hardware implementations such as pulse-width-modulation [Ung93] or sperconducting processor have also been
implemented, and are not exclude from future developments according to technological evolution.
Analog circuits present several advantages towards digital ones, especially regarding speed of processing, power dissipation and
functional density. They can moreover perform continuous-time processing and have the particularity to be well compatible with sensors,
actuators and all other analog signals. Therefore they are obviously indicated to deal with fuzzy values which are analog by nature. Some
continuous representations of symbolic membership functions and some non-linear fuzzy operations can be easily synthesised by dealing
with transistor characteristics. There is no need of A/D or D/A converters when implemented in a real system, provided that no specific
digital signal processing is required. Analog circuits can then supplant digital controllers for some applications requiring low-cost, low-
consumption, compact and high-speed stand-alone chips. They suffer nevertheless of the lack of reliable memory cells. They are
consequently not well appropriate to pipeline structures and have very restricted programmability possibilities. Fortunately, the nature of
fuzzy variable systems requires extensive parallelism which make analog circuits well appropriate to proceed high- speed numerous
inferences and also limits the problem of error accumulation.
Analog controllers can achieve fuzzy real-time reasoning with a large amount of fuzzy implications, especially when no high-level
accuracy is required. They are well suited to deal with vague and imprecise models, like for example some tasks that interface human
senses (eye, tactile nerves, ear,...) or replace human reasoning (pattern recognition, reflex processes,...). Accuracy of fuzzy systems is
actually not always so important since there is no thorough mathematical background usable to define precise and exhaustive fuzzy
methods. Imprecise but adjustable analog devices are consequently suitable for a great number of cases (tunable membership functions
are thus needed to optimize the performance). They are nevertheless much less flexible and adaptable than digital ones that are
programmable, and they must be designed according to the structure of a specific application. A basic programmability is afforded when
some analog external parameters can be adjusted or when some binary inputs allow the control of internal switches.
The choice of common fuzzy operators (§3.2) is not exhaustive, and other methods could just as well be implemented since there is no
mathematical proof than some of them are really optimal. Thus some other operators can be chosen instead of the common MIN and
MAX ones, in order to be efficiently performed with analog circuits. Some dense, fast and accurate hardware operators can give very
good practical results, even if they are theoretically not optimal. A novel way of evaluating the condition part of fuzzy rules has thus
been introduced in [Land93]. The degree of membership of an input vector to some fuzzy subspace is defined using a measure of
distance between this vector and some central point in the subspace. MIN and MAX operators are not implemented any more and very
dense and high-speed analog hardware may be realised (in current-mode).
Analog signals are represented either by voltages (voltage-mode, §5.2) or by currents (current-mode, §5.4). Stable and low-noise analog
technologies (n-well CMOS, BiCMOS) must be used in order to design analog circuits having sufficient accuracy with wide frequency
range. Reliable CAD tools for automatic design as fast verification and simulation tools are required for effective design.
5.2 Voltage-mode
Voltage-mode is attractive because it makes easy to distribute a signal in various parts of a circuit. Non-linear operators such as the MIN,
MAX and truncation ones are quite easy to implement in voltage-mode. Multiple-input MIN & MAX circuits constructed with bipolar
transistors are represented in figure 5.1 and 5.2 and are called emitter coupled fuzzy logic gates. These basic non-linear gates present
good characteristics and robustness [Yam93]. Such circuits are impractical with MOS transistors which cause anacceptable error
associated with the transition regions in which multiple devices are active. CMOS multiple-input MIN & MAX circuits using gain-
enhanced voltage followers based on differencial amplifiers are presented in [Fatt93] & [Tou94]. They are more complicated but have
high frequency and accurate performance according to the authors.
Yamakawa has designed a BiCMOS monolithic rule chip (TG004MC[Yam88], TG005MC[Hir93]) which architecture is shown on
figure 5.3. It is constructed with about 600 transistors and 800 resistors. The response time of fuzzy inference is about 1µs (1 Mega-
FLIPS). This chip implement one fuzzy inference including three variables in the antecedent and one variable in the consequent. The
antecedent part is made of membership function circuits (fig. 5.4) providing for each variable seven possible fuzzy linguistic values (one
of which is selected according to a voltage label VLAB). These values can have four possible shapes assigned by an external
binary signal, and the slopes can be changed by external resistors. A special label "not assigned" corresponding to a constant high level
(+5V) can stand for constant membership functions which value is 1.
A truncation circuit is not more than a certain number n of 2-inputs MIN circuits connected to a n-inputs MAX circuit. A voltage level
as truncation factor is applied to one input of each MIN circuit, and membership grades of output fuzzy sets are applied to the second
inputs. The output fuzzy sets are sampled in the consequent part by a membership function generator (MFG) in order to form fuzzy
words of 25 elements which can be processed by the truncation circuit. These words are realised by a switch array controlled by a switch
setting circuit [Gup88],[Yam93], and represent the consequent membership functions as analog voltages on 25 lines.
Yamakawa also designed an analog defuzzifier chip (TB005PL[Yam88], TB010PL[Hir93]) which implements the centre of gravity
calculation for 25 values (fig. 5.5). Its architecture consist of an ordinary addition circuit in parallel with a weighted addition circuit, and
followed by an analog divider. It is constructed with resistor arrays, OP amplifiers, an analog divider and capacitors in hybrid structure.
The response time of defuzzification is about 5µs and is almost determined by that of the divider. The sum calculation is proceeded by a
simple network of 25 identical resistors connected to a same node, which produce a current proportional to the desired sum. The
weighted addition is proceeded in the same way, but with 25 different resistors: 0,R, R/2, R/3,..., R/24. For an emitter junction of a
bipolar transistor, the base-to- emitter voltage is proportional to the logarithm of the emitter current (over the range where the effects of
series and shunt resistors and reverse saturation current are negligible). Thus the division of the two currents (proportional to the sum and
the weighted sum respectively) is implemented by the subtraction of two base-to-emitter voltages. Finally, the divider is followed by a
level shifter with current-voltage conversion [Yam88bis].
Several rule chips can be connected to such a defuzzifier chip which calculate a deterministic value from the maximal membership
function. Several defuzzifier chips can also be used to realise systems involving several conditions and several conclusions. The main
weak point of such systems is that they generally lead to non-optimal cumbersome implementations.
OMRON has developed the above analog fuzzy chips, as an analog rudimentary fuzzy computer based on these two circuits: the FZ-
5000. This last one consists of a multi-boards system, which each board includes four inference chips and one defuzzifier chip. Inference
time is about 15µs, including defuzzification. Programming is done using switches and jumper pins, or by a specific software tool (FT-
6000) using a personal computer.
Voltage-mode fuzzy circuit implies a large stored energy into the node parasitic capacitances (CV2/2) and speed is limited by charge
delays of various capacitors. They are moreover penalized by a certain lack of precision because signals are sensitive to changes of
supply voltages. This is especially significant when the voltage range is restricted in order to limit transistor functioning to a small parts
of their characteristic, or when the electrical consumption should be limited. The problems mainly lie in the sizing of some components.
Several functions are very difficult to build in voltage-mode, and it is also true for some basic ones as the algebraic sum. The above
described approach needs resistors to achieve additions and to convert voltages into currents. Integrated resistors are unfortunately
inaccurate, cumbersome and involve significant parasitic capacitances. The truncation of the consequent and the defuzzification pose an
important problem as regards the parallelism of the inference engine (especially when the number and size of output sets is large). This
approach implies high-power dissipation and large chip area, and leads to high-costs implementations.
5.3 Mixed-mode
5.3.1 Transconductance
An hybrid mode can be realized by implementing on a same chip both current-mode and voltage-mode operations. This allows to avoid
some problems inherent to voltage-mode while taking advantage of some benefits of current-mode. For example, the sum of analog
voltages is rather complex to realize, while it corresponds to simple wire connections in current-mode. This way to do has been used in
the defuzzifier chip in §5.2.1. The difficulty is to obtain linear and accurate conversions to swap between voltage - and current - modes
without loosing too much precision. Efficient transconductance elements (fig.5.6) should exhibit at the same time good linearity and
good frequency response to deal with fast and non-linearly varying membership values. They should moreover occupy few place and
consume few energy.
Operational transconductance amplifiers (OTA) can be used as basic building blocks to design analog CMOS circuits for fuzzy logic.
The use of OTAs, diodes and OTA-simulated resistors is sufficient to realize every useful fuzzy circuits. Fig.5.7 represents an OTA of
the same type that the ones used in [Inou91]. The fuzzy grade interval [0,1] is represented by [0V,1V] in order to assure a linear
operation of the OTAs. More effective OTA based on the transconductance element of fig.5.6 are also presented in [Park86]. It is well
linear and efficient in a wide range of frequencies, but at the expense of bigger consumption and chip area.
A high-speed bounded difference operation can easily be implemented by use of OTAs. For the circuit represented on fig.5.8, the
relationship between the input voltages (VIN1 & VIN2) and the current of the diode satisfies the definition of the bounded difference. As
the first OTA supplies a current proportional to the difference of input voltages, the bounded difference is realized for µ1 greater or equal
to µ2 when input voltages represent the membership functions µ1 and µ2. It is also realized for µ 1 < µ2 since the diode can only pass the
unidirectional output current of the OTA. The second OTA finally converts this current into a proportional output voltage (VOUT), in
order to obtain a voltage-mode bounded difference characteristic. Its output current is indeed identical to the diode one, and is also
proportional to the output voltage which is connected back to an input of the second OTA.
This OTA-based bounded difference operator can be effectively used to synthesise all other fuzzy function according to the relationships
described in §3.2. The algebraic sum used in these relationships is realised by simple wire connections before finally converting currents
into voltages with OTAs. The fuzzy complement operation is realized when the positive input voltage (VIN1) is connected to the high-
level (µ1 = 1) in the circuit of fig.5.8 (VOUT is then the complement of VIN2). Two-input MAX and MIN operators synthesis are
represented on fig.9. Since µ1 = (µ1 µ2) in any case, MIN(µ1 ,µ2) = µ1 (µ1 µ2) = µ1 - (µ1 µ2). The algebraic difference leads to a
faster MIN operator synthesis than two cascaded bounded differences. A current equal to the negative value -(µ1 µ2) is obtained by
permuting the two input voltages (VIN1 as µ1 & VIN2 as µ2) in a common bounded difference circuit operation which diode (D1) has
been connected in reverse sense. The diode D2 is just useful if the output voltage has to be limited to positive values.
Multiple-input MAX and MIN circuits are also described and analysed in [Inou91], where a fuzzy membership function circuit is also
presented. They are made of two-stages OTA structures which provide high-speed signal processing. OTA-based realisations of other
fuzzy functions are however often quite complicated and requires more OTAs stages, which radically deteriorates accuracy and speed of
processing. Several OTA are actually necessary to synthesise some relationships of §3.2 since many voltage-current or current-voltage
conversions are needed before performing bounded differences (which deals with voltages) and algebraic sums (which deals with
currents). There is moreover quite difficult to proceed simplifications when these different functions are combined together and the
principal optimisation consists of increasing parallelism of OTA stages. This OTA-based approach leads unfortunately rapidly to very
big and not so accurate circuits for implementing fuzzy processing of complex inference systems.
Multiple-input MIN and MAX circuits can be realized with current mirrors composed of a standard MOS transistor and a lateral MOS
transistor in bipolar operation [Tou94]. They are suitable for current-mode CMOS circuits and are based on the same principle than the
voltage- mode circuits of fig. 5.1 & 5.2. Input currents are converted into gate-sources voltages which are applied to the base terminals of
bipolar transistors connected as voltage followers. The voltages are processed according to MIN or MAX operation before being
converted into output current. Precision depends on the symmetry of the structures based on MOS-bipolar mirrors followed by the
inverse bipolar-MOS mirror (whose bipolar transistor also compensates the DC level shift of approximatively 0.7V).
5.4 Current-mode
Current-mode circuits do not need resistors and can achieve summation and subtraction in the most simple way, just by wire
connections. This leads to simple and intuitive configurations, which exhibits high speed and great functional density. They are used
more and more, especially for systems requiring a high level of interconnectivity (neural networks for example). High speed is provided
when capacitive nodes are not subject to great voltage fluctuations. Current-mode circuits can also exhibit advantages as low power
dissipation and low supply voltage, as good insensitivity to the fluctuation of the latter. Since they have a single fan-out, current
repeatability is of prime importance and the distribution of signals requires multiple-output current mirrors.
A basic realisation of multiple-output CMOS current mirror is shown on fig.5.11. This circuit is however not suitable for synthesising
accurate functions since each output current is slightly modulated by the output voltage throughout the Early conductance. The output
current should be independent of the output voltage, which is obtained by reducing the output conductance as for the three common
mirrors shown on fig.5.12. The drain voltage of the transistor which imposes the current (drain voltage of T2 on fig.5.12) is then
independent of the output voltage of the circuit. Multiple-output cascade mirrors are often used but Wilson ones are preferable for low
power applications because they require a single polarization voltage (VG(T1) on fig.5.12.a) instead of the two superposed voltages
VG(T1) & VG(T3) on fig.5.12.b). The Mod-Wilson mirror is obtained by adding a transistor to the Wilson mirror (T4 on fig.5.12.c) to
improve its symmetry. This mirror provides good accuracy and the input current is well reproduced with perfectly matched identical
transistors. The precision of all these mirrors depends on their output resistance (which must be as high as possible) and on the matching
of their transistors. The quality of current reproducibility is very important in cascade configurations to limit error accumulation.
Dynamic mirrors can be used to obtain a greater accuracy, but at the expense of a clocking scheme which considerably increases their
size and complexity.
[Click to enlarge image]Figure-5.11: Basic n-output CMOS current mirror and symbolic representation
All the above mentioned mirrors can be constructed in standard bipolar technology as well as in MOS one.
Multiple-output current mirrors can be realized by compact bipolar circuits when using multicollector transistors, whose accuracy is
however poorer towards structures with single- collector transistors. Bipolar transistors produce two type of significant errors, due to the
base current and to the reverse mode current. The latter causes the saturation of one collector to affect the other collectors, and the
circuits should be designed so that no collector in the multiple-output current mirrors may be saturable. These errors do not appear in
multiple-output MOS current mirrors since their input-output paths are separated and their drain currents are independent of each other.
The design of cascade MOS structures is then much easier than the one of bipolar structures whose stages are interdependent. It is
generally preferred, all the more since it is compatible with standard CMOS fabrication processes and efficient design tools.
The mismatch between two identical transistors is however smaller for bipolar than for MOS ones, since it does not depend on the
collector current level and since they are affected by a much lesser influence of surface effects. All MOS transistors must work in
saturation mode in order to reduce the mismatch effects. Bipolar current mirrors are then more appropriate to work with low voltages and
are more precise and fast with low currents (the speed of MOS ones decreases at low currents because of their intrinsic capacitances).
Current mirrors can be used as building blocks to synthesise fuzzy logic operations and relevant processing.
In this way, nine basic fuzzy functions (cf. §3.2: ) can be easily implemented on monolithic IC's with standard
CMOS [Zhij90],[Yam86] or bipolar technologies [Yam87]. These current-mode basic logic cells exhibit good linearity which can not be
easily achieved in voltage-mode, and lead to fuzzy integrated systems which are globally smaller than in voltage-mode.
[Click to enlarge image]Figure-5.13: CMOS and bipolar implementation of the bounded difference operation [Yam86],[Yam87]
A bounded difference circuit can be obtained by the combination of a current mirror and a diode (fig.5.13). The diode can easily be
realized in the CMOS circuit either by a single FET which gate and drain are connected together (fig.5.13) or by a current mirror
(fig.5.14). The first solution involves inevitable voltage drops due to the channel resistance and can influence the normal logic function
of the circuits. Nevertheless the diode can be omitted in cascade connection of such circuits because the input current mirror of the
following stage also achieves its task.
[Click to enlarge image]Figure-5.14: Two different implementations of the bounded difference operation [Zhij90]
Current mirrors are subdivided into two complementary types whether their transistors are n- or p-channel MOS FET (NPN or PNP in
bipolar technology). The directions of input and output currents depend on the type of the respective components (input mirror and
output diode in the bounded difference circuit). Thus there are four different configurations of input and output current directions (two of
which are shown on fig.5.14 and are suitable for cascade connections). To each configuration corresponds a complementary one which is
obtained by substituting p-channel current mirrors to n-channel ones and vice versa. This is convenient for designing circuits using such
fuzzy logic units as basic bricks without worrying about specific current directions between neighbouring bricks.
The circuits of fig.5.13 and 5.14 realize the bounded difference operation on two values of membership functions µx & µy represented by
the two currents I1 and I2 respectively. They also realize the complement operation on µy (represented by I2) when I1 has the maximum
value (that represents a membership grade equal to 1). The bounded product is realized when I1 represents 1 and I2 represents the sum µx
+ µy. As the bounded difference and the algebraic sum are sufficient to realize all other fuzzy functions (according to the relations
described in §3.2), fuzzy circuits can be designed only by specifying connections between bounded difference subcircuits. Multiple-
output current mirrors are also required to distribute current to several logic units (fig.5.15). Such basic cells are attractive prospects for
developing CAD tools and semicustom IC's (that are arrays of logic cells adaptable to various specifications according to the wire
configurations). This leads however to solutions which are generally not optimal as regard to the number of transistors.
Multiple-input MAX and MIN circuits are proposed in [Sas90] which aim at avoiding error accumulation and increasing speed relatively
to binary tree realizations based on two-input MIN and MAX subcircuits. The operation of these circuits can be formulated using
simultaneous bounded difference equations. A simpler multiple-input MAX circuit is described in [Batu94].
Three "primitive" operators have been introduced to obtain more elementary basic cells than the bounded difference
[Lem93],[Lem93bis],[Lem94]. These operators are defined in the following way :
All fuzzy functions can be formulated as an additive combination of these primitive operators since the bounded difference can be
expressed in the following way:
These operators are used to reduce the complexity of electrical realization of fuzzy functions, since they lead to simple relationship
between transistor-level circuits and symbolic representation of fuzzy formulae. They are actually realisable in a most simple way by
using current mirrors (fig.5.16) and exhibit special properties that help to obtain reduced forms of compositional expressions. Some of
these properties are as following:
Several relations are rather complicated when expressed from bounded difference equations and should be reduced with the help of the
above properties. This will at the same time decrease the number of cascade current mirror stages necessary to implement them. It is also
important to favour parallel operations in rule evaluations so that error accumulation will be reduced and processing speed increased.
As an example of such a process, the MIN operation can be implemented with two cascade current mirrors (fig.5.17), after having
carried out a certain number of simplifications into its mathematical formulation:
[Click to enlarge image]Figure-5.17: Synthesis of the MIN function based on primitive operators [Lem94]
There are various solutions to realize non-linear analog membership function generator in current mode but they are directly influenced
by their physical features and often exhibit bad temperature behaviour. Consequently it is quite attractive to design circuitry achieving
piecewise-linear membership functions [Kett93],[Yam88-3]. The representation of complicated membership functions can also be
subdivided into an appropriate composition of elementary logic subcircuits. The fuzzification of a crisp variable x thought membership
functions with triangular shapes can be expressed, for example, in terms of the primitive operators , P and N as follows [Lem94]:
The input value i determines the horizontal position of the triangular functions and designates in this manner one of linguistics labels
(negative high, negative low, zero,...).
As current-mode circuits are restricted to single fan-out, multiple current mirrors are required to share out signals among several
operational blocks. Voltage-mode inputs are thus preferable for fuzzy hardware systems since they must be distributed to the
membership function circuits of many rule blocks. Current-mode signals are appreciated afterwards, because of the advantages provided
by current-mode processing. Tunable voltage-input current-mode membership function circuits are consequently useful building blocks
to proceed fuzzification with current-mode analog hardware [Chen92],[Sas92],[Ishi92]. They can also be used with the above described
OTA-based approach. Such circuits can be realized with an OTA including variable source resistors which consist of integrated voltage-
controlled resistors and which can change the OTA transconductance characteristic [Chen92]. This achieves triangular membership
functions with variable heights, widths, slopes and horizontal positions. Such membership function circuits are suitable for realizing high-
speed and small-size systems. Nevertheless their sizing become difficult as their complexity increase and their characteristics are
affected by the variations of physical parameters (mismatches, temperature influence,...).
When achieved with the centre-of-gravity method, the defuzzification consists of a weighted average requiring a sum, a multiplication
(weighting) and a division. The sum operation consists of wire connections while the multiplication can be realized by scaling current by
means of asymmetrical current mirrors. The analog division is still the most complex operation and requires rather great chip area and
processing time. The defuzzification operation is consequently often simplified in hardware implementations.
However the required division can be easily and rapidly performed by current-mode normalization circuits [Ishi92]. For the circuit of
figure 5.18 there are the following relations:
The normalization circuit can directly be extended to more than three inputs. The sum of all output currents is normalized to I0 and so
each output current (representing a membership grade) is divided by this sum (sum of all membership grades). There is no more need of
dividing the weighted sum of these currents. The weight of each current can be realized in the normalization circuit by the W/L ration of
the respective output transistor. Good precision is obtained when such a circuit is implemented by means of bipolar transistors or bipolar
operated MOS. When using saturated MOS transistors operating in weak inversion with VS=0 the circuit is very inaccurate and variable
with the temperature, due to the large variation of VT0 from device to device.
A main disadvantage of analog circuits concerns the defuzzification stage (and especially the analog divider) in terms of size, accuracy
and processing speed. The quantity and complexity of calculations are however greatly decreased with the use of singleton
consequences. These ones are more appropriate for hardware implementations than complicated output sets that must consist of a limited
number of discrete analog values. It can moreover be observed thae singleton consequences facilitate a linear interpolation between
inference results of different control rules [Yam89],[Yam93].
Yamakawa has searched to eliminate the analog division in a voltage-mode singleton- consequent controller by using grade-controllable
membership function circuits (which are obtained by modifying ordinary membership function circuits [Yam89],[Yam93]). These can be
tuned in such a way that the sum of all membership grades is equal to the unity (or constant), and that the weighted sum doesn't need any
more division. The regulation consists of shifting down and up each membership function characteristic according to the variations of the
sum of membership grades, through a negative feedback loop.
The current-mode rule block with voltage interface (and OTA-based membership function circuits) described in [Sas92] also includes a
normalization locked loop which abolishes the division operation of the weighted average in a similar way (fig.5.19). The sum of
membership grades ( ) is regulated by using its fluctuations as modulation factor (Vm) of the membership function circuits
throughout a negative feedback loop. This regulation operation is attractive since it is faster than the analog division in fuzzy hardware
implementation. Nevertheless the implementation of such circuits requires complicated units and connections, and leads to difficult
sizing and testing. The normalization solution described in §5.4.4 is finally far simpler for current- and mixed-mode circuits.
The main weak point of analog circuits is the luck of reliable analog memory modules. Since there is no accurate and lasting way to store
analog fuzzy values, analog fuzzy computers exhibit bad features concerning programmability and multistage sequential processing.
Temporary memory elements have however been proposed to keep signals stable within a sampling period. This allowss to design fuzzy
inference engines with pipelined structures and consequently enhanced speed [Zhij90bis]. Such basic memory cells are however not
suitable for implementing complicated sequential algebra in the same way that digital circuits do thanks to the use of binary flip-flop and
register circuits. An analog value can however be stored in a more lasting way when represented by a voltage, by mean of a capacitor.
This last component is the core of sample-and-hold circuits which are basic cells of anolog memory elements.
A voltage-mode fuzzy flip-flop has been proposed as an extension form of binary J-K flip- flop [Gup88]. It is based on fuzzy negation, T-
norm and S-norm which are respectively restricted to complementation, MIN and MAX operations in order to make easier its hardware
implementation. The characteristics of fuzzy flip-flop based on other operations such as algebraic product and sum, bounded product and
sum, and drastic product and sum are also reported in [Hir89]. Its structure is described by the following set and reset equations which
generate the same state values than a digital J-K flip-flop in the case of boolean input and state values.
The fuzzy flip-flop can be implemented with a combinatorial part synthesising the above equation, and two sample-and-hold circuits
driven by two control pulse circuits with opposite phases (fig.5.20). Present output Q(t) is memorized by two sample-and-hold circuits
since the information is needed in the next state.
[Click to enlarge image]Figure-5.20: Fuzzy J-K flip-flop: circuit block diagram [Gup88]
Current-mode continuous valued memory elements can also be realized in the same way by using voltage-controlled current sources in
both sample-and-hold circuits to store a control voltage representing the sampled current [Balt93]. The capacitor of a sample-and-hold
circuit is thus charged according to the control voltage of a first current source. In the "hold mode" (after half a clock cycle), this stored
voltage sets the output current through a second current source in the image of the input one.
Storage capacitors are designed according to a trade-off between high speed and small silicon area on one hand, and sufficient accuracy
on the other hand (their capacitance should be high relatively to parasitic ones). High accuracy is however difficult to achieve since it is
affected by the imprecision of integrated capacitors, the mismatch in couples of current sources, and overall the charge injection in
parasitic capacitances. This last source of sampling error is however reduced with the master-slave configuration of the sample-and-hold
circuits, since the injected charges are opposite in sign (the first order sampling errors are thus cancelled).
It can actually be attractive to increase the complementarity between digital and analog features and to merge them into a single mixed
chip, in order to improve the weak points of both [Fatt93],[Zhij90bis]. A fuzzy knowledge base can be programmed in a digital memory
which consists of dedicated locations and stores a variable number of parameters characterizing membership function shapes and
inference rules notably. Highly-parallel and non-sequential analog processing is then afforded provided that D/A converters are used.
A/D converters can also be used when a digital computation of the centre-of-gravity is desired.
The VLSI Design group of SGS-THOMSON has also undertaken the design of an hybrid controller implemented by means of a mixed
analog/digital technology [Pag93bis],[Pag93-3]. It consists of a digital storage and distribution unit followed by an analog inference core.
The membership grades are converted into analog values by an internal D/A. This system does not need expensive A/D and D/A
converters in comparison with digital controllers. Its high speed should be suitable for very demanding real-time requirements with a
limited number of rules.
● define specific cells to implement fuzzy operations = standard cells approach with usual CAD environments
● THOMSON's approach [Pag93]
Design automation of analog fuzzy blocks provides a standard-cell approach allowing to build a fast and reliable design strategy (very
similar to a digital one). The use of p- and n- channel static current mirrors as building blocks is well appropriate to create a design
automation framework for generating the layout of fuzzy units. Such a development environment for current-mode CMOS fuzzy logic
circuits has been created from a standard graphical tool and a specific silicon compiler [Lem93],[Lem94]. The graphical tool provides a
logical simulation of fuzzy algorithms and helps to the design of fuzzy system architectures. The silicon compiler generates from the
mathematical expression of a given fuzzy algorithm its corresponding layout. The system is based on a three-level hierarchy, consisting
of current mirror circuits as generic cells for building elementary logical blocks (MIN, MAX, ,...), which ones can finally be assembled
into sophisticated fuzzy units. The use of the three primitive operators , P and N (cf.§5.4.3) leads to analytical expressions which are
suitable to describe a fuzzy units at all levels, from the current mirrors to high level fuzzy algorithms. The aim of this methodology is to
cancel redundant fuzzy function elements which exist in the trivial implementation of the fuzzy functions. The silicon compiler should
also replace each terms of the mathematical equations by its electrical counterpart. It tries then to reduce and place each electrical block
in such an optimized way that wire lengths and IC area are minimised. The interconnections of these wires determine the vertical current
flow between p- and n- mirrors from one stage to another. Certain configurations which cause functional problems should be forbidden
and trivial connections of p- and n- mirrors involving static consumption should be avoided (case where serial p- and n- mirrors of a
branch are driving a static current between the high and low levels). The automatic cell generator produces then the final layout as
physical representation of the fuzzy algorithms. As an application of this environment, the prototype fuzzy controller FLC#001 has been
designed in standard 1.2 µm CMOS technology [Lem94]. This low power and small size circuit achieves 10 MegaFLIPS and is quite
efficient for real- time control applications.
As the design issued of such a strategy is close to fuzzy algorithms, it makes easy the test of resulting circuits. Internal currents can
however not be measured without adding supplementary output current mirrors which increase the circuit size. Gate voltages can be
measured but they just indicate an imprecise estimation of the transistor currents.
The fusion of neural networks and fuzzy logic do not only concerns their similarities, but also mutual compensations between their
different features. Thus neural networks can offer a solution in fuzzy systems to the problems of structure identification (number and size
of rules making up a suitable fuzzy partition of the feature space) and parameter identification (number and characteristics of
membership functions). They are actually able to replace in some cases the tedious and sometimes hazardous identification scheme by a
supervised learning. The knowledge acquisition by self-organizing networks can be used to realize adaptive fuzzy systems, which are
quite attractive when linguistic rules cannot be easily derived or when a great deal of rules are required. However there is no guaranty of
convergence in the learning scheme. The necessary number of neurons is moreover unpredictable and the acquisition of new knowledge
is fairly difficult without beginning a new learning scheme.
There is several manner to combine neural networks and fuzzy logic, which differ according to the authors. The first idea consists of
using the high flexibility of neural networks produced by learning, in order to provide automatic design and fine tuning of the
membership functions used in fuzzy control. Such an adaptive neuro-fuzzy controller adjusts its performances automatically by
accumulating experience in a learning phase. National Semiconductor Corp. has developed the Neufuz embedded system that provides a
front-end neural networks suitable for fuzzy logic technology. A first layer performs fuzzification, a second creates the rule base and a
third, a single neuron, does rule evaluation and defuzzification. Neural networks involve a great deal of computations and hardware
investments which is prohibitive for many real-time applications. So they emulate and optimise in this case a fuzzy logic system rather
than implement directly the final application. The National Semiconductor solution includes then the capability to generate assembly
code for a strictly fuzzy logic solution. Recently Neuralogix has put on the market the NLX230 fuzzy microcontroller which consists of a
VLSI digital fuzzy logic engine based on the min-max method. It includes a neural network implementing a high-speed minimum
comparator block connected with 16 parallel fuzzifier on one side and a maximum comparator on the other side. An other approach aims
at solving the problems of consequence identification and of defuzzification by using the Takagi-Sugeno's method which lends itself
good to implementation framework based on adaptive neural networks [Jan92].
Instead of using neurons to implement some part of fuzzy systems, the structure of approximate reasoning can be applied to neural
networks. The aim of this approach is to improve neural network frameworks by bringing some advantages of fuzzy logic. The latter
allows an explicit representation of the knowledge and has a logic structure that allows to handle high-order processing easier. In pattern
recognition for example, the learning scheme is efficient to acquire a knowledge about reference objects. As for the structure of
approximate reasoning rules, it can give information on the knowledge distribution inside the whole network. It is then easier to find out
the internal part that cause poor performance by analysis of error according to the rule structure [Tak90bis].
References
Fuzzy Sets and Systems, Theory and Applications
● [Dub80] D. Dubois & H. Prade, Fuzzy Sets and Systems, Theory and Applications, Academic Press, New York, 1980
● [God94] J. Godjevac, Introduction à la logique floue, Cours de perfectionnement, Institut suisse de pédagogie pour la formation
professionelle, EPFL, Lausanne, 1994
● [Ngu93] H.T. Nguyen, V. Kreinovich & D. Tolbert, On Robustness of Fuzzy Logics, IEEE Int. Conf. on Fuzzy Systems, Vol.1,
pp.543-547, San Francisco, Ca, USA, 1993
● [Ter92] T. Terano, K. Asai & M. Sugeno, Fuzzy Systems Theory and its Applications, 1th. Ed., Academic Press, San Diego, 1992
Fuzzy Control
● [Büh94] H. Bühler, Réglage par logique floue, Presses Polytechniques Romandes, Lausanne, 1994
● [Dri93] D. Driankov, H. Hellendoorn & M. Reinfrank, An Introduction to Fuzzy Control, Springer-Verlag, Berlin Heidelberg,
1993
● [Gee92] H.P. Geering, Introduction to Fuzzy Control, Institut für Mess- und Regeltechnik, ETH, IMRT-Bericht Nr.24, Zürich,
1992
● [Mam74] E.H. Mamdani & S.Assilian, A Case Study on the Application of Fuzzy Set Theory to Automatic Control, pp.643-649,
● [Chan93] C.-H. Chang & J.-Y. Cheung, The Dimensions Effect of FAM Rule Table in Fuzzy PID Logic Control Systems, Second
IEEE Int. Conf. on Fuzzy Systems, Vol.1, pp.441-446, San Francisco, CA, USA, 1993
● [Gup88] Fuzzy Computing: Theory, Hardware, and Applications, M.M. Gupta & T. Yamakawa Eds., Elsevier Science Publishers
B.V. (North-Holland), 1988
● [Hir93] Industrial Applications of Fuzzy Technology, K. Hirota Ed., Tokyo, 1993
Digital Implementations
● [Eich92] H. Eichfeld, M. Löhner & M. Müller, Architecture of a CMOS Fuzzy Logic Controller with Optimized Memory
Organisation and Operator Design, Second IEEE Int. Conf. on Fuzzy Systems, pp.1317-1323, San Diego, CA, USA, March 1992
● [Eich93] H. Eichfeld, T. Künemund & M. Klimke, An 8b Fuzzy Coprocessor for Fuzzy Control, Proc. of the Int. Solid State
Circuits Conf.'93 Conference, San Francisco, CA, USA, Feb. 1993
● [Nak93] K. Nakamura, N. Sakeshita, Y. Nitta, K. Shimanura, T. Ohono, K. Egushi & T. Tokuda, Twelve Bit Resolution 200
kFLIPS Fuzzy Inference Processor, Proc. of the Int. Solid State Circuits Conf'93 Conference, San Francisco, CA, USA, Feb. 1993
● [Pag93] A. Pagni, R. Poluzzi & G. Rizzotto, Automatic Synthesis, Analysis and Implementation of a Fuzzy Controller, IEEE Int.
Conf. on Fuzzy Systems, Vol.1, pp.105-110, San Francisco, CA, USA, 1993
● [Pag93bis] A. Pagni, R. Poluzzi & G. Rizzotto, Integrated Development Environment for Fuzzy Logic Applications, Applications
of Fuzzy Logic Technology, B. Bosacchi & J.C. Bezdek Ed., pp.66-77, Boston, Massachusetts, USA, 1993
● [Pag93-3] A. Pagni, R. Poluzzi & G. Rizzotto, Fuzzy Logic Program at SGS-THOMSON, Applications of Fuzzy Logic
Technology, B. Bosacchi & J.C. Bezdek Ed., pp.80-90, Boston, Massachusetts, USA, 1993
● [Sas93] M. Sasaki, F. Ueno & T. Inoue, 7.5MFLIPS Fuzzy Microprocessor Using SIMD and Logic-in-Memory Structure, IEEE
Int. Conf. on Fuzzy Systems, Vol.1, pp.527-534, San Francisco, CA, USA, 8-10 Sept. 1993
● [Tog86] M. Togai & H. Watanabe, A VLSI Implementation of Fuzzy Inference Engine: toward an Expert System on a Chip,
Information Sciences, Vol.38, pp.147-163, 1986
● [Tog87] M. Togai & S. Chiu, A Fuzzy Chip and a Fuzzy Inference Accelerator for Real- Time Approximate Reasoning, Proc.
17th IEEE Int. Symp. Multiple-Valued Logic, pp.25-29, May 1987
● [Wat90] H. Watanabe, W.D. Dettloff & K.E. Yount, A VLSI Fuzzy Logic Controller with Reconfigurable Cascadable
Architecture, IEEE JSSC, Vol.25, No.2, pp.376-382, April 1990
● [Wat92] H. Watanabe, RISC Approach to Design of Fuzzy Processor Architecture, Second IEEE Int. Conf. on Fuzzy Systems,
pp.431-440, San Diego, CA, USA, March 1992
● [Wat93] H. Watanabe & D. Chen, Evaluation of Fuzzy Instructions in a RISC Processor, IEEE Int. Conf. on Fuzzy Systems,
Vol.1, pp.521-526, San Francisco, CA, USA, 1993
● [Hir89] K. Hirota & K. Ozawa, The Concept of Fuzzy Flip-Flop, IEEE Trans., SMC- 19, (5), pp.980-997, 1989
● [Tou94] C. Toumazou & T. Lande, Building Blocks for Fuzzy Processors, Voltage- and Current-Mode Min-Max circuits in
CMOS can operate from 3.3V Supply, IEEE Circuits & Devices, Vol.10, No.4, pp.48-50, July 1994
● [Yam88] T. Yamakawa, Fuzzy Microprocessors - Rule Chip and Defuzzifier Chip -&- How It Works -, Proc. Int. Workshop on
Fuzzy Syst. Appl., pp.51-52 & 79-87, Iizuka, Japan, Aug. 1988
● [Yam88bis] T. Yamakawa, High-Speed Fuzzy Controller Hardware System: The Mega Fips Machine, Information Sciences,
Vol.45, pp.113-128, 1988
● [Yam89] T. Yamakawa, An Application of a Grade-Controllable Membership Function Circuit to a Singleton-Consequent Fuzzy
Logic Controller, Proc. of the Third IFSA Congress, pp.296-302, Seattle, WA, USA, Aug. 6-11, 1989
● [Yam93] T. Yamakawa, A Fuzzy Inference Engine in Nonlinear Analog Mode and Its Application to a Fuzzy Logic Control,
IEEE Trans. on Neural Networks, Vol.4, No.3, May 1993
● [Chen92] J.-J. Chen, C.-C. Chen & H.-W. Tsao, Tunable Membership Function Circuit for Fuzzy Control Systems using CMOS
Technology, Electronics Letters, Vol.28, No.22, pp.2101-2103, Oct. 1992
● [Inou91] T. Inoue, F. Ueno, T. Motomura, R. Matsuo & O. Setoguchi, Analysis and Design of Analog CMOS Building Blocks
For Integrated Fuzzy Inference Circuits, Proc. IEEE International Symp. on Circuits and Systems, Vol.4, pp.2024-2027, 1991
● [Park86] C.-S. Park & R. Schaumann, A High-Frequency CMOS Linear Transconductance Element, IEEE Trans. on Circuits and
Systems, Vol.CAS- 33, No.11, Nov. 1986
● [Batu94] I. Baturone, J.L. Huertas, A. Barriga & S. Sanchez-Solano, Current-Mode Multiple-Input MAX Circuit, Electronics
Letters, Vol.30, No.9, pp.678-680, April 1994
● [Balt93] F. Balteanu, I. Opris & G. Kovacs, Current-Mode Fuzzy Memory Element, Electronics Letters, Vol.29, No.2, pp.236-
237, Jan. 1993
● [Ishi92] O. Ishizuka, K. Tanno, Z. Tang & H. Matsumoto, Design of a Fuzzy Controller with Normalization Circuits, Second
IEEE Int. Conf. on Fuzzy Systems, pp.1303-1308, San Diego, CA, USA, March 1992
● [Kett93] T. Kettner, C. Heite & K. Schumacher, Analog CMOS Realisation of Fuzzy Logic Membership Functions, IEEE Journal
of Solid-State Circuits, Vol.29, No.7, pp.857-861, July 1993
● [Land93] O. Landolt, Efficient Analog CMOS Implementation of Fuzzy Rules by Direct Synthesis of Multidimensional Fuzzy
Subspaces, IEEE Int. Conf. on Fuzzy Systems, Vol.1, pp.453-458, San Francisco, CA, USA, 1993
● [Lem93] L. Lemaître, M.J. Patyra & D. Mlynek, Synthesis and Design Automation of Analog Fuzzy Logic VLSI Circuits, IEEE
Symp. on Multiple-Valued Logic, pp.74-79, Sacramento, CA, USA, May 1993
● [Lem93bis] L. Lemaître, M.J. Patyra & D. Mlynek, Fuzzy Logic Functions Synthesis: A CMOS Current Mirror Based Solution,
IEEE ISCAS93, Part 3 (of 4), pp.2015- 2018, Chicago, IL, USA, 1993
● [Lem94] L. Lemaître, Theoretical Aspects of the VLSI Implementation of Fuzzy Algorithms, rapport de thèse nº1226,
Département d'Electricité, EPFL, Lausanne, 1994
● [Sas90] M.Sasaki, T. Inoue, Y. Shirai & F. Ueno, Fuzzy Multiple-Input Maximum and Minimum Circuits in Current Mode and
Their Analyses Using Bounded- Difference Equations, IEEE Trans. on Computers, Vol.39, No.6, pp.768-774, June 1990
● [Sas92] M. Sasaki, N. Ishikawa, F. Ueno & T. Inoue, Current-Mode Analog Fuzzy Hardware with Voltage Input Interface and
Normalization Locked Loop, Second IEEE Int. Conf. on Fuzzy Systems, pp.451-457, San Diego, CA, USA, March 1992
● [Yam86] T. Yamakawa & T. Miki, The Current Mode Fuzzy Logic Integrated Circuits Fabricated by the Standard CMOS
Process, Reprint of IEEE Trans. on Computers, Vol. C-35, No.2, pp.161-167, Feb. 1986
● [Yam87] T. Yamakawa, Fuzzy Logic Circuits in Current Mode, Analysis of Fuzzy Information, James C. Bezdek Ed., Vol.1,
pp.241-262, CRC Press, Boca Raton, Florida, 1987
● [Yam88-3] T. Yamakawa, H. Kabuo, A Programmable Fuzzifier Integrated Circuit - Synthesis, Design, and Fabrication,
Information Sciences, Vol.45, pp.75-112, 1988
● [Zhij90] L. Zhijian & J. Hong, CMOS Fuzzy Logic Circuits in Current-Mode toward Large Scale Integration, Proc. Int. Conf. on
Fuzzy Logic & Neural Networks, pp.155-158, Iizuka, Japan, July 20-24, 1990
● [Zhij90bis] L. Zhijian & J. Hong, A CMOS Current-Mode, High Speed Fuzzy Logic Microprocessor for a Real-Time Expert
System, IEEE Proc. of the 20th Int. Symp. on MVL, Charlotte, NC, USA, May 23-25, 1990
● [Fatt93] J. Fattarasu, S.S. Mahant-Shetti & J.B. Barton, A Fuzzy Inference Processor, 1993 Symposium on VLSI Circuits, Digest
of Technical Papers, pp.33-34, Kyoto, Japan, May 19-21, 1993
Implementation on FPGA
● [Manz92] M.A. Manzoul & D. Jayabharathi, Fuzzy Controller on FPGA Chip, IEEE Int. Conf. on Fuzzy Systems, pp.1309-1316,
San Diego, Ca, USA, Mar. 8-12, 1992
● [Ung93] A.P. Ungering, K. Thuener & K. Goser, Architecture of a PDM VLSI Fuzzy Logic Controller with Pipelining and
Optimized Chip Area, IEEE Int. Conf. on Fuzzy Systems, Vol.1, pp.447-452, San Francisco, Ca, USA, 1993
● [Yam92] T. Yamakawa, A Fuzzy Programmable Logic Array (Fuzzy PLA), IEEE Int. Conf. on Fuzzy Systems, pp.459-465, San
Diego, Ca, USA, Mar. 8-12, 1992
Technological Aspects
● [Pat90] M.J. Patyra, Fuzzy Properties of IC Subcircuits, Int. Conf. on Fuzzy Logic and Neural Networks, Iizuka, Japan, 1990
● [Oeh92] J. Oehm & K. Schumacher, Accuracy Optimization of Analog Fuzzy Circuitry in Network Analysis Environment,
ESSCIRC, 1992
● [Ber90] H. Berenji, Neural Networks and Fuzzy Logic in Intelligent Control, IEEE, 1990
● [God93] J. Godjevac, State of the Art in the Neuro Fuzzy Field, Technical Report No.93.25, Laboratoire de Microinformatique,
DI, EPFL, Lausanne, 1993
● [Jan92] J.-S. Roger Jang, ANFIS, Adaptive-Network-Based Fuzzy Inference Systems, IEEE Trans. on Systems, Man and
Cybernetics, 1992
● [Kos92] B. Kosko, Neural Networks and Fuzzy Systems, Prentice-Hall International, Inc., Englewood Cliffs N.J., 1992
● [Tak90] H. Takagi, Fusion Technology of Fuzzy Theory and Neural Networks, Survey and Future Directions, Proc. of the Int.
Conf. on Fuzzy Logic & Neural Networks, Vol.1, pp.13-26, Iizuka, Japan, July 1990
● [Tak90bis] H. Takagi, T. Kouda & Y. Kojima, Neural Network Designed on Approximate Reasoning Architecture and Its
Application to the Pettern Recognition, Proc. of the Int. Conf. on Fuzzy Logic & Neural Networks, Vol.2, pp.671-674, Iizuka,
Japan, July 1990
production of
Chapter 10
VLSI FOR MULTIMEDIA APPLICATIONS
Case Study: Digital TV
I. Introduction
IV. Conclusion
Today there is a race to design interoperable video systems for basic digital computer functions, involving multimedia applications in
areas such as media information, education, medecine and entertainment, to name but a few. This chapter provides an overview of the
current status in industry of digitized television including techniques used and their limitations, technological concerns and design
methodologies needed to achieve the goals for highly integrated systems. Digital TV functions can be optimized for encoding and
decoding and be implemented in silicon in a more dedicated way using a kind of automated custom design approach allowing enough
flexibility.
I. Introduction
When, at the 1981 Berlin Radio and TV Exhibition, the ITT Intermetall company exhibited to the public for the first time a digital
television VLSI concept [1], [2], opinions among experts were by no means unanimously favourable. Some were enthusiastic, while
others doubted the technological and economic feasibility. Today, after 13 years, more than 30 million TV sets worldwide have already
been equipped with this system. Today, the intensive use of VLSI chips does not need a particular justification, the main reasons being
increased reliability mainly because of the long-term stability of the color reproduction brought about by digital systems, and medium
and long-term cost advantages in manufacturing which are essential for ensuring international competitiveness.
Digital signal processing permits solutions that guarantee a high degree of compatibility with future developments, whether in terms of
quality improvements or new features like intermediate picture storage or adaptive comb filtering for example. In addition to these
benefits, a digital system offers a number of advantages with regard to the production of TV sets:
- Digital circuits are tolerance-free and are not subject to drift or aging phenomena. These well-known properties of digital technology
considerably simplify factory tuning of the sets and even permit fully automated, computer-controlled tuning.
- Digital components can be programmable. This means that the level of user convenience and the features offered by the set can be
tailored to the manufacturer's individual requirements via the software.
- A digital system is inherently modular with a standard circuit architecture. All the chips in a given system are compatible with each
other so that TV models of various specifications, from the low-cost basic model to the multi-standard satellite receiver, can be built
with a host of additional quality and performance features.
- Modular construction means that set assembly can be fully automated as well. Together with automatic tuning, the production process
can be greatly simplified and accelerated.
Macro-function Processing
The modular design of digital TV systems is reflected in its subdivision into largely independent functional blocks, with the possibility
having special data-bus structures. It is useful to divide the structure into a data-oriented flow and control-oriented flow, so that we have
four main groups of components:
1.- The control unit and peripherals, based on well-known microprocessor structures, with a central communication bus for flexibility
and ease to use. An arrangement around a central bus makes it possible to easily expand the system constantly and thereby add on
further quality-enhancing and special functions for the picture, text and/or sound processing at no great expense. A non-volatile storage
element, in which the factory settings are stored, is associated to this control processor.
2.- The video functions are mainly the video signal processing and some additional features like for example deflection, a detailed
description follows in the paper. However, the key point for VLSI implementations is a well-organized definition of the macro-blocks.
This serves to facilitate interconnection of circuit components, and minimizes power consumption, which can be considerable at the
processor speeds needed.
3.- The digital concept facilitates the decoding of today’s new digital sound broadcasting standards as well as the input of external signal
sources, such as Digital Audio Tape (DAT) and Compact Disk (CD). Programmability permits mono, stereo, and multilingual
broadcasts; the compatibility with other functions in the TV system is resolved with the common communication bus. This leads us to
part two which is dedicated to the description of this problem.
4.- With a digital system, it is possible to add some special or quality-enhancing functions simply by incorporating a single additional
macro-function or chip. Therefore, standards are no longer so important due to the high level of adaptability of digital solutions. For
example adaptation to a 16:9 picture tube is easy.
Figure 1 shows the computation power needed for typical consumer goods applications. Notice from the figure that when the data
changes at a frequency x, a digital treatment of that data must be an order of magnitude faster [3].
In this chapter we first discuss the digitization of TV functions by analyzing general concepts based on existing systems. The second
section deals with silicon technologies and, in particular design methodologies concerns. The intensive use of submicron technologies
associated with fast on chip clock frequencies and huge numbers of transistors on the same substrate affects traditional methods of
designing chips. As this chapter only outlines a general approach of the VLSI integration techniques for Digital TV, interested readers
will find more detailed descriptions of VLSI design methodologies and realizations in [9], [13], [15], [24], [26], [27], [28].
The idea of digitization of TV functions is not new. The time some companies have started to work on it, silicon technology was not
really adequate for the needed computing power so that the most effective solutions were full custom designs. This forced the block-
oriented architecture where the digital functions introduced were the one to one replacement of an existing analog function. In Figure 2
there is a simplified representation of the general concept.
The natural separation of video and audio resulted in some incompatibilities and duplication of primary functions. The emitting
principle is not changed, redundancy is a big handicap, for example the time a SECAM channel is running, the PAL functions are not in
operation. New generations of digital TV systems should re-think the whole concept top down before VLSI system partitioning.
In today’s state-of-the-art solution one can recognize all the basic functions of the analog TV set with, however, a modularity in the
concept, permitting additional features becomes possible, some special digital possibilities are exploited, e.g. storage and filtering
techniques to improve signal reproduction (adaptive filtering, 100 Hz technology), to integrate special functions (picture-in-picture,
zoom, still picture) or to receive digital broadcasting standards (MAC, NICAM). The Figure 3 shows the ITT Semiconductors solution
which was the first on the market in 1983 [4].
By its very nature, computer technology is digital, while consumer electronics are geared to the analog world. Starts have been made
only recently to digitize TV and radio broadcasts at the transmitter end (in form of DAB, DSR, D2-MAC, NICAM etc). The most
difficult technical tasks involved in the integration of different media are interface matching and data compression [5].
After this second step in the integration of multimedia signals, an attempt was made towards standardization, namely, the integration of
16 identical high speed processors with communication and programmability concepts comprised in the architecture (see Figure 4,
Photograph of the chip of ITT Semiconductor courtesy).
Many solutions proposed today (for MPEG 1 mainly) are derived from microprocessor architectures or DSPs, but there is a gap between
today’s circuits and the functions needed for a real fully HDTV system. The AT&T hybrid codec [29], for instance, introduces a new
way to design multimedia chips by optimizing the cost of the equipment considering both processing and memory requirements. Pirsch
[6] gives a detailed description of today’s digital principles and circuit integration. Other component manufacturers also provide
different solutions for VLSI system integration [35][36][37][38][39][40]. In part IV of this paper a full HDTV system, based on wavelet
transforms is described. The concept is to provide generic architectures that can be applied to a wide variety of systems taking into
account that certain functions have to be optimized and that some other complex algorithms have to be ported to generic processors.
Compression methods take advantage of both data redundancy and the non-linearity of human vision. They exploit correlation in space
for still images and in both space and time for video signals. Compression in space is known as intra-frame compression, while
compression in time is called inter-frame compression. Generally, methods that achieve high compression ratios (10:1 to 50:1 for still
images and 50:1 to 200:1 for video) use data approximations which lead to a reconstructed image not identical to the original.
Methods that cause no loss of data do exist, but their compression ratios are lower (no better than 3:1). Such techniques are used only in
sensitive applications such as medical imaging. For example, artifacts introduced by a lossy algorithm into a X-ray radiograph may
cause an incorrect interpretation and alter the diagnosis of a medical condition. Conversely, for commercial, industrial and consumer
applications, lossy algorithms are preferred because they save storage and communication bandwidth.
Lossy algorithms also generally exploit aspects of the human visual system. For instance, the eye is much more receptive to fine detail
in the luminance (or brightness) signal than in the chrominance (or color) signals. Consequently, the luminance signal is usually
sampled at a higher spatial resolution. Second, the encoded representation of the luminance signal is assigned more bits (a higher
dynamic) than are the chrominance signals. The eye is less sensitive to energy with high spatial frequency than with low spatial
frequency [7]. Indeed, if the images on a personal computer monitor were formed by an alternating spatial signal of black and white, the
human viewer would see a uniform gray instead of the alternating checkerboard pattern. This deficiency is exploited by coding the high
frequency coefficients with fewer bits and the low frequency coefficients with more bits.
All these techniques add up to powerful compression algorithms. In many subjective tests, reconstructed images that were encoded with
a 20:1 compression ratio are hard to distinguish from the original. Video data, even after compression at ratios of 100:1, can be
decompressed with close to analog videotape quality.
Lack of open standards could slow the growth of this technology and its applications. That is why several digital video standards have
been proposed:
● H.261 at p times 64 kbit/s was proposed by the CCITT (Consultative Committee on International Telephony and Telegraphy) for
teleconferencing
● MPEG-1 (Motion Picture Expert Group) up to 1,5 Mbit/s was proposed for full motion compression on digital storage media
● MPEG-2 was proposed for digital TV compression, the bandwith depends on the chosen level and profile [33].
Another standard, the MPEG-4 for very low bit rate coding (4 kbit/s up to 64 kbit/s) is currently being debated.
For more detail concerning different standards and their definition, please see the paper included in this Proceedings: "Digital Video
Coding Standards and their Role in Video Communication", by R. Schäfer, and T. Siroka.
Like aforsaid, the main idea is to think system-wise through the whole process of development; doing that, we had to select a suitable
architecture as a demonstrator for this coherent design methodology. It makes no sense to reproduce existing concepts or VLSI chips,
therefore we focused our demonstrator on the subband coding principle, where the DCT is only a particular case of. Following this line,
there is no interest to focus on blocks only considering the motion problem to solve, but rather to consider the entire screen in a first
global approach. This gives us the ability to define macro-functions which are not restricted in their design limits, the only restriction
will come from practical parameters like block area or processing speed for example, which depends from the technology selected for
developing the chips but does not depend from the architecture or the specific functionality.
Before going into the detail of the system architecture, we like to discuss in this session the main design related and technology
depending factors which will influence the hardware design process and the use of some CAD tools. We are proposing a list of major
concerns one should consider when going to integrate digital TV functions. The purpose is to give a feeling for the decision process of
the management of such a project. In a first step, we discuss RLC effects, down scaling, power management, requirements for the
process technology, design effects like parallelism and logic styles, and we conclude the session with some criterias for the proposed
methodology.
R,L,C effects
In computer systems today, the clocks that drive ICs are increasingly fast; 100MHz is a " standard " clocking frequency, and several
chip and system manufacturers are already working on microprocessors with Ghz clocks. By the end of this decade, maybe earlier,
Digital Equipment will provide a 1-2 GHz version of the Alpha AXP chip; Intel promises faster Pentiums; and Fujitsu will have a 1-
GHz Sparc processor.
When working frequencies are so high, the wires that connect the circuits boards and modules, and even the wires inside integrated
circuits start behaving like transmission lines. New analysis tools become necessary to circumvent and to master the high-speed effects.
As long as the electrical connections are short, with low clock rates, the wires can be modeled as RC circuits. In such cases, the designer
will have to care that rise and fall times are sufficiently short with respect to the internal clock frequency. This method is still used in
fast clock rate designs by deriving clock trees to manage a good signal synchronization on big chips or boards. However, when the wire
lengths increase, their inductance starts to play a major role. This is what transmission lines deal with. Transmission line effects include
reflections, overshoot, undershoot, and crosstalk. RLC effects have to be analyzed in a first step but it might be necessary to use another
circuit analysis to gain better insight into circuit behaviour. The interconnect will behave like a distributed transmission line and coupled
lines (electromagnetic characteristics also have to be taken into account). But with low clock-rate systems, transmission line effects can
also appear. Let's take for example a 1MHz system with a rise time of 1ns. Capacitor loading will be dictated according to the timings,
but during the transition time, reflections and ringing will occur causing some false triggering of other circuits. As a rule of thumb, high
speed design techniques should be applied when the propagation delay in the interconnect is 20-25% of the signal rise and fall time [30],
[34].
1. short transition time compared to the total clock period. The effect was described above.
2. inaccurate semiconductor models. It is important to take into account physical, electrical and electromagnetic characteristics of
the connections and the transistors. Important electrical parameters are metal-line resistivity and skin depth, dielectric constant
and dielectric loss factor.
3. inappropriate geometry of the interconnect. Width, spacing, and thickness of the lines and thickness of the dielectric are of real
importance.
4. lack of a good ground is often a key problem. Inductance often exist between virtual and real ground due for instance to
interconnect and lead inductance.
A solution to these problems could be a higher degree of integration by eliminating the number of long wires. The MCM (Multi Chip
Module technology) is an example of this alternative. MCM simplifies the component, improves the yield and shrinks the footprint.
Nevertheless, the word alternative is not entirely correct since MCM eliminates a certain type of problems by replacing them with
another type. The narrowness of the wires introduced with MCMs tends to engender significant losses due to the resistance and the skin
effect. Furthermore, due to the VIA structure, the effect of crossover and via parasitics is stronger than in traditional board design.
Finally, ground plane effects need a special study since severe ground bounce and an effective shift in the threshold voltage can result
from the combination of high current densities with the thinner metallizations for power distribution.
How then does one get a reliable high speed design? The best way is to analyze the circuit as deeply as possible. The problem here is
that circuit analysis is usually very costly in CPU time. Circuit analysis can be carried out in three steps, first EM and EMI analysis, then
according to the component models available in the database, electrical analysis can be performed using two general approaches: one
that relies on versions of Spice, the other are the direct-solution methods using fixed-time increments.
EM (Electromagnetic field solver, or Maxwell's equations solver) and EMI (Electromagnetic interference) analyzers perform a scanning
of the layout database for unique interconnect and coupling structures and discontinuities. Then EM field solvers use the layout
information to solve Maxwell's equations by numerical methods. Inputs to the solver includes physical data about the printed circuit or
multichip module such as the layer stack-up dielectrics and their thickness, placement of power and ground layers, and interconnect
metal width and spacing. The output is the mathematical representation of these electrical properties. In this way, field solvers analyze
the structure in two, two and a half or three dimensions. In choosing among the variety of field solvers, the complexity of the structure,
and the accuracy of the computation must be weighted against the performance and computational cost.
Electrical models are important for analysis tools and they can be automatically generated from measurements in the time domain or
from mathematical equations.
Finally, the time complexity of solving the system matrices that represent a large electrical network is an order of magnitude larger for
an electrical simulator like Spice, than for a digital simulator.
Down Scaling
As CMOS devices scale down into the submicron region, the intrinsic speed of the transistors increases (frequencies between 60 and
100 MHz are common). The transients are also reduced, so that the increase in output switching speed increases the rate of change of
switching current (di/dt). Due to parallelization, the simultaneous switching of the I/O's creates a so called simultaneous switching noise
(SSN), also known as Delta-I noise or ground bounce [8]. It is important that SSN be limited within a maximum allowable level to avoid
spurious errors like false triggering, double clocking or missing clock pulses. The output driver design is now no longer trivial, and
techniques like current controlled circuits or controlled slew rate driver designs are used to minimize the effect of the switching noise
[8]. An in-depth analysis of the chip-package interface is required to ensure the functionality of high-speed chips (Figure 5).
The lesson is that some important parts of each digital submicron chip have to be considered to be working in analog mode rather than
digital. This applies not only for the I/O's but also for the timing and clocking blocks in a system. The entire floor plan has to be
analyzed in the context of noise immunity, parasitics and also propagation and reflection in the buses and main communication lines.
Our idea was to reduce the number of primitive cells and to structure the layout in such a way to be able to use the common software
tools for electrical optimizations of the interconnections (abutments optimization). Down Scaling of the silicon technology is a common
way today to obtain in a short time a new technology to be able to compete in the digital market, but this shrinking is only linear in x
and y (with some differences and restrictions like VIAS for example). The third dimension is not shrinked linearly for technical and
physical reasons. The designer has to make sure that the models describing the devices and the parasitics are valid for the considered
Power Management
As chips grow in size and speed, power consumption is drastically amplified. The actual demand for portable consumer products implies
that power consumption must be controlled at the same time that complex user interfaces and multimedia applications are driving up
computational complexity. But there are limits to how much power can be slashed for analog and digital circuits. In analog circuits, a
desired signal-to-noise ratio must be maintained, while for digital IC power, the lower limit is set by cycle time, operating voltage, and
circuit capacitance [9].
A smaller supply voltage is not the only way to reduce power for digital circuits. Minimizing the number of device transitions needed to
perform a given function, local suppression of the clock, reduction of clock frequency, elimination of system clock in favor of self-timed
modules are other means to reduce the power. This means that for the cell-based design technology there is a crucial need to design the
cell library to minimize energy consumption. There are various ways to reduce the energy in each transition which is proportional to the
capacitance and the supply voltage to the power of two (E=CV2). Capacitance's are being reduced along with the feature size in scale
down processes, but this reduction is not linear. With the appropriate use of design techniques like minimization of the interconnections
or use of abutment or optimum placement it is possible to reduce the capacitance's in a more effective way. So what are the main
techniques used to decrease the power consumption? Decrease the frequency, the size and the power supply. Technology has evolved to
3.3V processes in production and most current processors take advantage of this progress. Nevertheless, reducing the number of
transistors and the operating frequency cannot be performed in so a brutal manner and so trade-off have to be found. Let us bring some
insight into power management by looking at different approaches found in actual products. A wise combination of those approaches
will eventually lead even to new methods.
The MicroSparcII uses a 3.3V power supply and a fully static logic. It can cut power to the caches by 75% when they're not being
accessed, and in standby mode, it can stop the clock to all logic blocks. At 85MHz it is expected to consume about 5W.
Motorola and IBM had the goal of providing high performance while consuming little power when they produced the PowerPC603.
Using a voltage of 3.3V, and 0.5 micron features CMOS process, four level metal and static design technology, the chip is smaller as its
predecessor (85.5mm2 in 0.5 micron for 1.6 Million transistors instead of 132mm2 in 0.6 micron for 2.8 Million transistors). The new
load/store unit and the SRU (System-Register Unit) is used to implement dynamic power management with a maximum of 3W power
consumption at 80MHz. But a lot more can be expected from reduction of power consumption associated to a reduction of voltage
swing on buses for example or on memory bit lines either. To achieve a reasonable operating power for the VLSI it is necessary to
decrease drastically the power consumption of internal bus drivers, a circuit technology with a reduced voltage swing for the internal
buses is a good solution. Nakagome [10] proposed a new internal bus architecture that reduce operating power by suppressing the bus
signal swing to less than 1 V, this architecture can achieve a low power dissipation while maintaining high speed and low standby
current. This principle is shown in Figure 6.
An electrothermal analysis of the IC will show the non-homogeneous local power dissipations. This leads to avoid hot-spots in the chip
itself (or in a multichip solution) to secure good yield since the failure rate of microelectronic devices doubles for every 10ºC increase in
temperature. To optimize both long-term reliability and performance, it has become essential to perform both thermal and electrothermal
simulations during the chip design. For example, undesirable thermal feedback due to temperature gradients across a chip degrades
performance of electrical circuit such as reduced-swing bus driver or mixed analog-digital component, where matching and symmetry
are important parameters [11].
Reduce chip power consumption is not the only issue. When targeting low system cost and power consumption, it becomes interesting
to include a PLL (Phase Locked Loop) allowing the processor to run at higher frequencies than the main system clock. By multiplying
the system clock by 1, 2, 3 and 4, the PowerPC603 operates properly when slower system clock is used, three software controllable
modes can be selected to reduce power consumption. Most of the processor can be switched off, only the bus snooping is disabled or the
time-based register is switched-off. It takes, naturally some clock cycles to bring the processor into a full power mode. Dynamic power
management is also used to switch-off only certain processor subsystems, and even the cache protocol has been reduced from four to
three states, being still compatible with the previous one.
Silicon Graphics goal has been to maintain RISC performances at a reduced price. Being nor super scalar nor super-pipelined (only 5
stages of pipeline) it combines integer and floating-point unit into a single unit. The result is a degraded performance but with a big
saving on the number of transistors. It can also power down unused execution units, this is maybe even more necessary since dynamic
logic is used. The chip should draw typically about 1.5W. Table I lists four RISC chips competing with the top-end of the 80x86
microprocessor line. What is drawn from CPUs considerations is also applicable for television systems. The requirements for
compression algorithms and their pre- and post-processing leads to very similar system sizes to computer workstations. Our
methodology was to reduce the size of the primitive cells of the library by using an optimizing software developed in-house.
Table I Four RISC chips competing with the top-end of the 80x86 line.
It is important to notice that the today process technologies are not adapted to the new task in the consumer field: low power, high
speed, huge amount of data's. Historically the most progress was done for the memory process because of the potential business and the
real need of storage since the microprocessor architecture exists. More or less all the so called ASIC Process Technologies have been
extrapolated from the original DRAM technologies with some drastic simplifications because of yield sensitivity. Now the new
algorithms for multimedia applications are requiring parallel architectures and because of the computation needs local memorization
which means a drastic increase of interconnections. New ways to improve the efficiency of the designs are in the improvement of the
technologies but not only by shrinking linearly the features or decreasing the supply voltage, but also by giving the possibility to store at
the needed place the information, avoiding the interconnection. This could be done by using floating gate or better ferroelectric
memories. This material allows a memorization on a small capacitance placed on the top of the flip-flop which generate the data to be
memorized; in addition, the information will not be destroyed and the material is not sensitive to radiations. Another way is the use of
SOI (Silicon On Insulator). In this case the parasitic capacitances of the active devices are reduced near to zero so that it is possible to
work with very minimal feature size (0.1µm to 0.3µm) and to achieve very high speed at very low power consumption [12].
Another concern are the multilayer interconnects. Due to the ASIC oriented methodologie it was useful to have more than one metal
interconnection-layer, this permits the so called prediffused wafer technique (like Gate-Arrays or Sea-of-Gates). Software tools
developed for this methodology enabled users to use an automatic router. Bad news for high speed circuits is that wires are done in a
non-predictive way so that their length are often not compatible with the speed requirements. It has been demonstrated a long time ago
that 2 interconnection-layers are sufficient to solve any problem for digital circuits, one of this could be also in polysilicon or better
salicide material, so that only one metalization is really needed for high speed digital circuits, and maybe another for power supply, and
the main clock system. If the designer is using power minimization techniques for basic layout cells and if he takes into account the poly
layer for cell to cell interconnections, the reduction of the power consumption will be significant due to mainly the reduction of the size
of the cell.
Effects of parallelism
From the power management point of view, it is interesting to notice that for a CMOS gate the delay time is approximately inversely
proportional to voltage. Therefore, to maintain the same operational frequency, a reduction of supply voltage (for power saving) must be
compensated for by computing in n parallel functions, each of them operating n times slower. This parallelism is inherent in some tasks
like image processing. Bit-parallelism is the most immediate, pipelining, systolic arrays are other approaches. The good news is that
they don't need much overhead for communication and controlling. The bad news is that they are not applicable when increasing the
latency is not acceptable, for example if loops are required in some algorithms. If multiprocessors are used for more general
computation, the circuit overhead for controlling and communication task is growing more than linearly and the overall speed in the
chip slows down by several order of magnitude, this is, by the way, the handicap of the standard DSP's applied to the multimedia tasks.
The heavy use of parallelism means also a need for memorization on chip or, if memory blocks are outside, an increase of wires which
means more power and less speed.
Logic styles
Architecture's of the developed systems are usually described at a high level to ensure a correct functionality. These descriptions cannot
generally take into account low level considerations such as logic design. Usually the tools used are C descriptions converted into
VHDL codes. Such codes are then remodeled into more structured VHDL descriptions, known as RTL descriptions (Register Transfer
Level). These new model of the given circuit or system is then coupled with standard cell CMOS libraries. Finally the layout generation
is produced. In such a process, the designer is faced with successive operations where the only decisions are made at the highest level of
abstraction. Even with the freedom of changing or influencing the RTL model, the lowest level of abstraction, i.e. the logic style, will
not be influenced. It is contained in the use of standard library, and is rarely different from pure CMOS style of design.
Nevertheless, the fact that the logic style is frozen can lead to some aberrations or at least to certain difficulties when designing a VLSI
system. Clock loads may be too high, pipeline may not be easy to manage, special tricks have to be used to satisfy the gate delays.
Particular circuitry has to be developed for system clocking management. One way to circumvent clock generation units it to use only
one clock. The TSPC technique (True Single Phase Clock) [13] performs with only one clock. It is particularly suited to fast pipelined
circuits when correctly sized with a non prohibitive cell area.
Other enhancements
In the whole plethora of logic families (pure CMOS, Pseudo-NMOS, Dynamic CMOS, Cascade Voltage Switch logic, Pass Transistor,
etc...) it is not possible to obtain excellent speed performances with minimal gate size. There is always a critical path to optimize. Piguet
[14], introduces a technique to minimize the critical path of basic cells. All cells are exclusively made of branches that contain one or
several transistors in series connected between a power line and a logical node. Piguet demonstrates that any logical function can be
implemented with only branches. The result is that generally, the number of transistors is greater than for conventional schematics.
However, it shows that by decomposing complex cells into several simple cells, the best schematics can be found in terms of speed,
power consumption and chip area. This concept of minimization of the number of transistors between a power node and a logical node
is used in our concept.
Asynchronous designs tend also to speed up systems. The clocking strategy is forgotten at the expense of local switches which enable a
wave of signals propagating as soon as they are available. The communication is equivalent to "handshaking". The general drawback of
this technique is the overhead in the area, and special consideration is required to avoid races and hazards [15], [16]. It is also necessary
to carry the clocking information with data, which increases the number of communication lines. Finally, detecting state transistors
requires an additional delay even if this can be kept to a minimum.
Redundancy of information enables interesting realizations [16] in asynchronous and synchronous designs. This technique consists in
creating additional information to permit choosing the number representation in the calculating process which will fit the minimum
delay. Avizienis [17] introduced this field and research has continued in this subject [18], [19], since it is not difficult to convert the
binary representation into the redundant one, though it is more complex to do the reverse. While there is no carry propagation in such a
representation, the conversion from redundant binary into a binary number is there to "absorb" the carry propagation [20]. Enhancement
can also be obtained by changing the technology. Implementation in BiCMOS or GaAs [21] will also allow better performance than
pure CMOS. But the trade-off of price versus performance has to be carefully studied before making any such decision. Tri-dimensional
designing of physical transistors could also be a possibility to enhance a system. The capacitive load could be decreased and the density
increased, but such methods are not yet reliable [22].
Design Methodology
The aim of the proposed methodology is to show a simple and powerful way for very high speed ASIC implementations in the
framework of algorithms for image compression as well as pre- and post-processing.
• Generality: The approach used should be as general as possible without making the implementation methodology too complex. The
range of applications covered by the strategy should be as extensive as possible, concerning both speed and types of algorithms
implemented.
• Flexibility: Here we consider the flexibility for a given task. For instance, if a processing element is designed, it is essential that
different solutions can be generated all with slightly different parameters. This includes speed, word length, accuracy, etc.
• Efficiency: This is an indispensable goal and implies not only efficiency of the used algorithm, but also efficiency of performing the
design task itself. The efficiency is commonly measured as the performance of the chip compared to its relative cost.
• Simplicity: This point is a milestone to the previous one. By using simple procedures, or simple macro blocks, the design task will be
simplified as well. Restrictions will occur, but if the strategy itself is well structured, it will also be simple to use.
• CAD portability: It is a must for the methodology to be fully supported by CAD tools. A design and implementation approach that is
not supported by a CAD environment, cannot claim to conform to the points given above. The methodology must be defined such that it
is feasible and simple to introduce the elements in these tools. So it is important that the existing CAD tools and systems can adopt and
incorporate the concepts developed earlier.
ASIC's are desirable for their high potential performance, their reliability and their suitability for high volume production. On the other
hand, considering the complexity of development and design, Micro- or DSP-processor based implementations usually represent
cheaper solutions. However, the performance of the system is the most decisive factor here. For consumer applications generally cost is
defined as a measurement of the required chip area. This is the most common and important factor. Other cost measures take into
account the design time, the testing and verification of the chip: complex chips cannot be redesigned several times. This reduces the
time-to-market and gives opportunity to adapt to the evolving tools to follow the technology. Some physical constraints can also be
imposed on the architecture such as power dissipation, the reliability under radiation's, etc. Modularity and regularity are two additional
factors that improve the cost and flexibility of a design (it is also much simpler to link these architecture's with CAD tools).
The different points developped above where intended to show in a general way the complexity of the designs of modern systems. For
this reason we focused on different sensitive problems of VLSI development. Today the design methodology is too much oriented by
the final product. It is usually justified by historical reasons and natural parallel development of CAD tools and technology processes.
The complexity of the tools inhibits the needed methodology for modern system requirements. To prove the feasibility of concurrent
engineering with the present CAD tools, the natural approach is the reuse policy. It means that to reduce the development time, one
reuses the already existing designs and architectures not necessary adapted to the needs of future systems. This behaviour is lead only by
the commercial constraint to sell the already possessed product slightly modified.
On the contrary EPFL solution presents a global approach with a complex system (from low bit-rate to HDTV) using a design
methodology which takes into account the requirements mentioned above. It shows that architectures bottlenecks are removed if
powerful macrocells and macrofunctions are developped. Several functions have been hardwired, but libraries of powerful macrocells
are not enough. The arising problem being the complex control of these functions and the data bandwidth. That is why a certain balance
between hard and soft functions is to be found. System analysis and optimization tools are needed to achieve this goal. We’ve
developped software tools enabling fast and easy system analysis by giving the optimal configuration of the architecture of a given
function. This tool takes into account functionality, power consumption and area. The access to the hardwired functions needs to be
controlled by dedicated but embedded microcontroller cores. The way of designing these cores has to be generic since each
microcontroller will be dedicated to certain subtasks of the algorithm. On the other hand the same core will be used to achieve tasks at
higher levels.
Because it is very expensive to build dedicated processors for each application and dedicated macrofunctions, it is necessary to provide
these functions with the optimal genericity to allow their use in a large spectrum of applications and in the same time with an amount of
customization to allow optimal system performance. This was achieved by using in-house hierarchical analysis tools adapted to the sub-
system giving a figure of the " flexibility " of the considered structure in the global system context.
IV. Conclusion
Digitalization of the fundamental TV functions is of great interest since more than 10 years. Several million of TV sets have been
produced containing digital systems. However, the real and full digital system is for the future. A lot of work is done in this field today,
the considerations are more technical than economical which is a normal situation for an emerging technology. The success of this new
multimedia technology will be given by the applications running with this techniques.
The needed technologies and methodologies were discussed to emphasize the main parameters influencing the design of VLSI chips for
Digital TV Applications like parallelization, electrical constraints, power management, scalability and so on.
REFERENCES
[1] Fischer T., "Digital VLSI breeds next generation TV-Receiver", Electronics, H.16 (11.8.1981)
[2] Fischer T., "What is the impact of digital TV?", Transaction of 1982 IEEE ICCE
[3] "Ecomomy and Intelligence of Microchips", Extract of "Funkchau" 12/31. May 1991.
[5] Heberle K., "Multimedia and digital signal processing", Elektronik Industrie, No. 11, 1993.
[6] Pirsch P., Demassieux N., Gehrke W.,"VLSI Architectures for Video Compression - A Survey", Special Issue Proccedings of the IEEE, Advances in Image and Video
Compression, Early '95.
[7] Kunt M., Ikonomopoulos A., Kocher M., "Second-Generation Image-Coding Techniques", Proceedings of the IEEE, Vol. 73, No 4, pp. 549-574, April 1985.
[8] Senthinathan R. and Prince J.L., "Application Specific CMOS Output Driver Circuit Design Techniques to Reduce Simultaneous Switching Noise", Journal of Solid-States
Circuits, Vol. 28, No. 12, pp 1383-1388, December 1993.
[9] Vittoz E., "Low Power Design: ways to approach the limits", Digest of Technical Papers, ISSCC'94, pp 14-18, 1994.
[10] Nakagome Y. and al., "Sub 1 V Swing Internal Bus Architecture for Future Low-Power ULSI's", Journal of Solid-States Circuits, Vol. 28, No. 4, pp 414-419, April 1993.
[11] Sang Soo Lee, Allstot D., "Electrothermal Simulation of Integrated Circuits", Journal of Solid-States Circuits, Vol. 28, No. 12, pp 1283-1293, December 1993.
[12] Fujishima M. and al., "Low-Power 1/2 Frequency Dividers using 0.1-mm CMOS Circuits built with ultrathin SIMOX Substrates", Journal of Solid-States Circuits, Vol. 28, No.
4, pp 510-512, April 1993.
[13] Kowalczuk J., "On the Design and Implementation of Algorithms for Multi-Media Systems", PhD Thesis, EPFL, December 1993.
[14] Masgonty J.-M., Mosch P., Piguet C., "Branch-Based Digital Cell Libraries", Internal Report, CSEM, 1990.
[15] Yuan J., Svensson C., "High-Speed CMOS Circuit Technique", IEEE Journal of Solid-State Circuits, Vol. sc-24, No +, pp 62-70, February 1989.
[16] McAuley A. J., "Dynamic Asynchronous Logic for High-Speed CMOS Systems", IEEE Journal of Solid-States Circuits, Vol. sc-27 No3, pp 382-388, March 1992.
[17] Avizienis "Signed-Digit Number Representations for Fast Parallel Arithmetic", IRE Trans. Electron. Compute., Vol EC-10, pp 389-400, 1961.
[18] Takagi N., & al., "A High-Speed Multiplier using a Redundant Binary Adder Tree", IEEE Journal of Solid-State Circuits, Vol. sc-22, No. 1, pp 28-34, February 1987.
[19] McAuley A. J., "Four State Asynchrinous Architectures", IEEE Journal of Transactions on Computers Vol. sc-41 No. 2 , pp 129-142, February 1992.
[20] Ercegovac M. D., Lang T., "On line Arithmetic: A Design Methodology and Applications in Digital Signal Processing", Journal of VLSI Signal Processing, Vol. III, pp 252-
163, 1988.
[21] Hoe D. H. K., Salama C. A. T., "Dynamic GaAs Capacitively Coupled Domino Logic (CCDL)", IEEE Journal of Solid-State Circuits, Vol. sc-26, No. 1, pp 844-849, June
1991.
[22] Roos G., Hoefflinger B., and Zingg R., "Complex 3D-CMOS Circuits based on a Triple-Decker Cell", CICC 1991, pp 125-128, 1991.
[23] Ebrahimi T., & al. "EPFL Proposal for MPEG-2", Kurihama, Japan, November 1991.
[24] Hervigo R., Kowalczuk J., Mlynek D., "A Multiprocessor Architecture for a HDTV Motion Estimation System", Transaction on Consumer Electronics, Vol. 38, No. 3, pp 690-
697, August 1992.
[25] Langdon G. G., "An introduction to Arithmetic coding", IBM Journal of Research and Development, Vol. 28, Nr2, pp 135-149, March 1984.
[26] Duc P., Nicoulaz D., Mlynek D., "A RISC Controller with customization facility for Flexible System Integration" ISCAS '94, Edinburgh, June 1994.
[27] Hervigo R., Kowalczuk J., Ebrahimi T., Mlynek D., Kunt M., "A VLSI Architecture for Digital HDTV Codecs", ISCAS '92, Vol. 3, pp 1077-1080, 1992.
[28] Kowalczuk J., Mlynek D.,"Implementation of multipurpose VLSI Filters for HDTV codecs", IEEE Transactions on Consumer Electronics, Vol. 38, No. 3, , pp 546-551, August
1992.
[29] Duardo O. and al., "Architecture and Implementation of Ics for a DSC-HDTV Video Decoder System", IEEE Micro,pp 22-27, October 1992
[30] Goyal R., "Managing Signal Integrity", IEEE Spectrum,pp 54-58, March 1994
[31] Rissanen J. J. and Langdon G. G., "Arithmetic Coding", IBM Journal of Research and Development, Vol. 23, pp 149-162, 1979.
[32] Daugman J. G., "Complete Discrete {2-D} {Gabor} Transforms by Neural Networks for Image Analysis and Compression", IEEE Transactions on Acoustics, Speech, and
Signal Processing, Vol. 36,Nr 7, pp 1169-1179, July 1988.
[34] Forstner P., "Timing Measurement with Fast Logic Circuit", TI Technical Journal - Engineering Technology, pp 29-39, May-June 1993
[35] Rijns H., "Analog CMOS Teletext Data Slicer", Digest of Technical Papers - ISSCC94, pp 70-71, February 1994
[36] Demura T., "A Single Chip MPEG2 Video Decoder LSI", Digest of Technical Papers - ISSCC94, pp 72-73, February 1994
[37] Toyokura M., "A Video DSP with Macroblock-Level-Pipeline and a SIMD Type Vector Pipeline Architecture for MPEG2 CODEC", Digest of Technical Papers - ISSCC94, pp
74-75, February 1994
[38] SGS-Thomson Microelectronics, "MPEG2/CCIR 601 Video Decoder - STi3500", Preliminary Data Sheet, January 1994
[39] Array Microsystems, "Image Compression Coprocessor (ICC) - a77100", Advanced information Data Sheet, Rev. 1.1, July 1993
[40] Array Microsystems, "Motion Estimation Coprocessor (MEC) - a77300", Product Preview Data Sheet, Rev. 0.1, April 1993
Chapter 11
VLSI FOR TELECOMMUNICATION
SYSTEMS
● Introduction
● Telecommunication Fundamentals
● General Telecommunication Network Taxonomy
● Comparison Between Different Switching Techniques
● ATM Networks
● Case Study: ATM Switch
● Case study: ATM Transmission of Multiplexed-MPEG Streams
● Conclusions
● Bibliography
11.1. Introduction
This document is organised as follows: a review of telecommunication fundamentals and a network taxonomy is done
in sections 11.2 and 11.3. In section 11.4 switching networks are explained as introduction to section 11.5, in which
ATM network concepts are visited. Sections 11.6 and 11.7 explain two case studies to show the main elements that will
be found in a telecommunication system-on-a-chip. The former is an ATM switch, the latter a system to transmit over
ATM networks MPEG streams.
Chapter 11
VLSI FOR TELECOMMUNICATION
SYSTEMS
● Introduction
● Telecommunication Fundamentals
● General Telecommunication Network Taxonomy
● Comparison Between Different Switching Techniques
● ATM Networks
● Case Study: ATM Switch
● Case study: ATM Transmission of Multiplexed-MPEG Streams
● Conclusions
● Bibliography
Figure 11.1 shows a switching network. Lines are the media links. Ovals are called network nodes. Media links simply
carry data from one point to other. Nodes take the incoming data and route them to an output port.
If two different communication paths intersect through this network they have to share some resources. Two paths can
share a media link or a network node. Next sections describe these sharing techniques.
Media sharing occurs when two communication channels use the same media.
This section presents how some communication channels can use the same media link without architecture
considerations. There are three main techniques.
This simple method consists on multiplexing data in time. Each user transmits a period of time equal to 1/(number of
possible channels) in full bandwidth W. This sharing mode can be synchronous or asynchronous.
Figure 11.3 shows a synchronous TDMA system. Each channel uses a time slot each T periods. Selecting a time slot
identifies one channel. Classical wired phone uses this technique.
In synchronous TDMA, if an established channel stops transferring data without freeing the assigned time slot, the
unused bandwidth is lost and hence, other channels can not take advantage of this. This technique has evolved to
asynchronous TDMA to avoid this problem.
Figure 11.4 shows an asynchronous TDMA system. Each channel uses a time slot when the user needs to transfer data
and when a time slot is unused. The header of each time slot data stream identifies the channel identification. ATM
networks use this technique.
These two techniques are used to connect users. Providing broadcast channels in TDMA can not be done easily.
Frequency Division Multiple Access technique avoids this problem. Next section presents this sharing mode.
This sharing method consists on giving to each channel a piece of available bandwidth.
Each user transmits over a constant bandwidth equal to W/(number of possible channels). Filtering with a bandwidth
equal to W’ = W/(number of possible channels) the whole W bandwidth spectrum selects one channel. TV and radio
broadcasters use this media sharing technique. Figure 11.5 shows an FDMA spectrum diagram.
Another method has been developed based on the frequency dimension. This method called Code Division Multiple
Access uses an encoding-decoding system used, initially, for military communications. Today consumer market
applications also use this technique. Next section presents this method.
Each user transmits using the full bandwidth. Demodulating the whole W band using a given identification code selects
one channel out of the others. Next mobile phones standard (IS-95 or W-CDMA) uses this media sharing technique.
Figure 11.6 shows a CDMA spectrum diagram.
These techniques can be merged together. For example, the Global System Mobile (GSM = Natel D) phone standard
uses an FDMA-TDMA technique.
After this description, we will present in next section how a network node routes data from an input port to a given
output port.
Node sharing occurs when two communication channels use the same network node. The question is how some
communication channels can use the same node in a cell switching network, i.e. an ATM network.
Before answering this question, we have to define the specification of the switching function. Next section presents this
concept.
As shown in figure 11.8, a switch has N input ports and N output ports. Data come in the lines attached to the input
ports. After identifying their destination, data are routed through the switch to the appropriate output port. After this
stage, data can be sent to the communication line attach to the output port.
We can directly implement on hardware this canonical switch. However, this technological solution poses some
throughput problems. In section 11.6.1.2.1 (the one describing the crossbar switch architectures) we will see why. In
section 11.6.1.2.2 (the one describing the Batcher-Banyan network) we will see how the throughput problems can be
solved.
Furthermore, the incoming data sequence can pose some routing problems. Next part of this section shows these critical
scenarios.
Figure 11.9 shows some switching scenarios. Scenario 1 shows two cells from two different input ports going through
the switch to two different output ports. These two cells can be simultaneously routed. Scenario 2 shows two cells from
the same input port going through the switch to two different output ports. Both cells are routed to their output
destinations.
Scenario 3 shows two cells from two different input ports going trough the switch to the same output port. There are
five possible strategies to solve this problem:
● To drop one cell and route the other. This solution involves a data lost, hence, it is not a good approach.
● To route simultaneously both cells and memorize in the output port the cell that has not been sent on the attached
line. This technique is called output buffering.
● To memorize the incoming cells in the input ports and route them. This technique is called input buffering.
● The two other solutions consist on memorizing the extra cells during the routing task. These techniques are
derived from input buffering.
Section 11.6.1 considers why output buffering is better than input buffering.
Chapter 11
VLSI FOR TELECOMMUNICATION
SYSTEMS
● Introduction
● Telecommunication Fundamentals
● General Telecommunication Network Taxonomy
● Comparison Between Different Switching Techniques
● ATM Networks
● Case Study: ATM Switch
● Case study: ATM Transmission of Multiplexed-MPEG Streams
● Conclusions
● Bibliography
Telecommunication Networks can be mainly classified into two groups based on the criteria of who has made the
decision of which nodes are not going to receive the transmitted information. When the network takes the responsibility
of this decision, we have a switching network. When this decision is left to the end-nodes, we have a broadcast network
that can be divided in packet radio networks, satellite networks and local area networks.
Switching networks use any of the following switching techniques: circuit, message or packet switching, this last one
implemented as either virtual circuit or datagram. Let us compare these techniques.
Chapter 11
VLSI FOR TELECOMMUNICATION
SYSTEMS
● Introduction
● Telecommunication Fundamentals
● General Telecommunication Network Taxonomy
● Comparison Between Different Switching Techniques
● ATM Networks
● Case Study: ATM Switch
● Case study: ATM Transmission of Multiplexed-MPEG Streams
● Conclusions
● Bibliography
We can begin with two rough classifications. If a connection (path) between the origin and the end node
is established at the beginning of a session we are talking about circuit or packet (virtual circuit)
switching. In case it does not, we refer to message and packet (datagram) switching. On the other hand,
when considering how a message is transmitted, if the whole message is divided into pieces we have
packet switching (based either on virtual circuit or datagram) but if it does not, we have circuit and
message switching.
In the following paragraphs we get into the details of different switching techniques
In figure 11.11, the most import events in the life of a connection in a four-node circuit switching
network (see figure 11.10) are shown. When a connection is established, the origin-node identifies the
first intermediate node (node A) in the path to the end-node and sends it a communication request signal.
After the first intermediate node receives this signal the process is repeated as many times as needed to
reach the end-node. Afterwards, the end-node sends a communication acknowledge signal to the origin-
node through all the intermediate nodes that have been used in the communication request. Then, a full
duplex transmission line, that it is going to be kept for the whole communication, is set-up between the
origin-node and the end-node. To release the communication the origin-node sends a communication end
signal to the end-node.
Figure-11.10:
Figure-11.11:
Figure 11.12 shows life connection events for a message switching network. When a connection is
established, the origin-node identifies the first intermediate node in the path to the end-node and sends it
the whole message. After receiving and storing this message, the first intermediate node (node A)
identifies the second one (node B) and, when the transmission line is not busy, the former sends the
whole message (store-and-forward philosophy). This process is repeated up to the end-node. As can be
seen in figure 11.12 no communication release or establishment is needed.
Figure-11.12:
Figure 11.13 shows the same events for a virtual circuit (packet) switching network. When a connection
is established, the origin-node identifies the first intermediate node (node A) in the path to the end-node
and sends it a communication request packet. This process is repeated as many times as needed to reach.
Then, the end-node sends a communication acknowledge packet to the origin-node through the
intermediate nodes (A, B, C and D) that have been traversed in the communication request. The virtual
circuit established on this way will be kept for the whole communication. Once a virtual circuit has been
established, the origin-node begins to send packets (each of them has a virtual circuit identifier) to the
first intermediate node. Then, the first intermediate node (node A) begins to send packets to the following
node in the virtual circuit without waiting to store all message packets received from the origin-node.
This process is repeated until all message packets arrive to the end-node. In the communication release,
when the origin-node sends to the end-node a communication end packet, the latter answers with an
acknowledge packet. There are two possibilities to release a connection:
● No trace of the virtual circuit information is left, so every communication is set-up as if it were the
first one.
● The virtual circuit information is kept for future connections.
Figure-11.13:
The most important events in the life of a communication in a datagram switching network are shown in
figure 11.14. The origin-node identifies the first intermediate node in the path and begins to send packets.
Each packet carries an origin-node and end-node identifier. The first intermediate node (node A) begins
to send packets, without storing the whole message, to the following intermediate node. This process is
repeated up to the end-node. As there are neither connection establishment nor connection release, the
path follow for each packet from the origin-node to the end-node can be different and therefore, as a
consequence of different propagation delays, they can arrive disordered.
Figure-11.14:
EJM 17/2/1999
Chapter 11
VLSI FOR TELECOMMUNICATION
SYSTEMS
● Introduction
● Telecommunication Fundamentals
● General Telecommunication Network Taxonomy
● Comparison Between Different Switching Techniques
● ATM Networks
● Case Study: ATM Switch
● Case study: ATM Transmission of Multiplexed-MPEG Streams
● Conclusions
● Bibliography
Before describing the fundamentals of ATM networks, we will define a few concepts such as transfer
mode and multiplexing needed to understand the main ATM points.
The concept of transfer mode summarizes two ideas related to information transmission in
telecommunication networks: how information is multiplexed, i.e. how different messages share the same
communication circuit, and how information is switched, i.e. how the messages are routed to the
destination-node.
The concept of multiplexing is related to the way in which several communications can share the same
transmission medium. As seen in 2.1, different techniques used are time-division multiplexing (TDM)
and frequency-division multiplexing (FDM). The former can be synchronous or asynchronous.
In STD (synchronous time-division) multiplexing, a periodic structure divided in time intervals, called
frame, is defined and each time interval is assigned to a communication channel. As the number of time
intervals in each frame is fixed, each channel has a fixed capacity. The information delay is just function
of the distance and the access time because there is no conflict to access the resources (time intervals).
In ATD (asynchronous time-division) multiplexing, the time intervals used in a communication channel
are neither inside a frame nor previously assigned. Every time interval can be assigned to every channel.
The channel assigned to each information unit has an appropriate label as identifier. With this scheme,
every source might transmit information at every time given that there are enough free resources in the
network.
The switching concept is assigned to the idea of information routing from an origin-node to an end-node.
We have already talked about the different switching techniques in 11.4.1-11.4.4.
ATM networks use ATD (asynchronous time-division) as multiplexing technique and cell switching as
switching technique.
With ATD multiplexing, variable binary rate sources can be connected to the network because of the
dynamic assignment of time intervals to channels.
Circuit switching is not a suitable technique if variable binary rate sources want to be used because after
the connection establishment the binary rate with this switching technique must be constant. This fixed
assignment is not just an inefficient usage of available resources but a contradiction to the main goal of B-
ISDN (broadband integrated services digital network) where each service has different requirements.
ATM networks will be a key element in the development of B-ISDN as stated in the ITU (International
Telecommunication Union) recommendation I.121.
Neither general packet switching is a suitable solution in ATM networks because of the difficulty to
integrate real-time services. However, as it has the advantage of an efficient resource usage for bursty
sources, the switching technique adopted in ATM networks is a variant of this one: cell switching.
Cell Switching works similar than packet switching. The differences between both are the following:
● All information -data, voice, video- is transported from the origin-node to the end-node in small
and constant-size packets (in traditional packet switching the packet size is variable) - 53 octets -
called cells.
● Just lightened protocols are used in order to allow nodes fast switching. As a drawback protocols
will be less efficient.
● Signaling is completely separated from information flow in contrast to packet switching in which
both, information and signaling are mixed.
● Arbitrary binary rate traffic flows can be integrated in the same network.
The size of the ATM cell header is 5 octets (approx. 10 % of the total size of the cell). With this small
header, fast processing is allowed in the network. The size of the cell payload is 48 octets. This small
payload allows low store-and-forward delays in network switching nodes (see figure 11.15).
The decision about the payload size was a trade-off between different proposals. While in conventional
data communication it is preferred longer payloads to reduce information overhead, in video
communication, more sensitive to delays, smaller ones are desired. The election of the current payload
size was a salomonic decision: in Europe, the preferred payload size was 32 octets but in USA and Japan,
the preferred load size was 64 octets. Finally, in a meeting hold in Geneva in June 1989, people agreed to
have as payload size the average of those two proposals: 48 octets.
In ATM networks, the interface between the network user (either an end-node or a gateway to another
network) and the network it is called UNI (User-Network Interface). UNI specifies the possible physical
media, cell format, mechanisms to identify different connections established through the same interface,
total access rate and mechanisms to define the parameters that determine the quality of service.
The interface between a pair of network nodes is called NNI (Network-Node Interface). This interface is
mainly dedicated to routing and switching between nodes. Besides, it is designed to allow interoperability
between switching fabrics of different companies.
Header format depends on whether or not a cell is at the UNI or the NNI. The functions of each cell
header field are the following (Fig 11.15):
● GFC Generic Flow Control. This field appears just at the UNI and it is responsible of medium
access control, as there is the possibility that more than one end-user might be connected to the
same UNI.
● VPI, VCI Virtual Path Identifier, Virtual Channel Identifier. A connection in an ATM network is
defined uniquely thanks to the combination of these two fields. It allows routing and addressing at
two levels. The network routing function considers them as labels that can be changed in each
node.
● PT Payload Type. The main objective of this field is to distinguish between user information and
OAM (Operation & Maintenance) information.
● CLP Cell Loss Priority. This field allows assigning two priorities (high or low) to cells. For
example, if a user does not meet the set-up requirements, the network can mark a cell as a low
priority one. Low priority cells will be the first to be dropped when a congestion state is detected
in any of the ATM network node queues.
● HEC Header Error Control. This field allows error checking and detection of header information.
Figure-11.15:
● Non-assigned cells: cells with no useful information. They pass transparently the physical layer
and arrive to the remote ATM layer without modification.
● Empty cells: they are also cells with no useful information. When information sources have no
cell to be sent, the physical layer introduces these cells to match the cell flow to the maximum
transmission capacity. They will never arrive to the remote ATM layer because the physical layer
will filter them.
● Metasignaling cells: cells to negotiate the establishment of a virtual circuit between the network
and the end-user. Once the virtual circuit has been established, all set-up and release operations
will use this circuit.
● Broadcasting cells: cells whose end-node is every node connected to the same interface.
● Physical layer cells: cells with OAM (Operations & Maintenance) information for the network
physical layer.
The protocol stack architecture used in ATM Networks considers three different planes:
We will describe now the functions of different layers in the user plane of the protocol stack.
This is the layer responsible for information transport. It is divided into two sublayers.
The TC sublayer adapts the cells received from the ATM layer to the specific format used in the
transmission.
This layer provides a connection-oriented service, independently of the transmission media used. Its main
functions are the following:
● Cell multiplexing and demultiplexing from several connections into a unique cell flow. A pair of
identifiers, VCI/VPI, characterizes each connection.
● Cell switching. This function consists on changing the input VCI/VPI pair for a different output
pair.
● Cell header generation/extraction (except the HEC field whose generation/checking is competence
of the physical layer).
● Flow control and medium access control for those UNIs shared by more than one terminal.
This layer adapts either, in the transmitter side, the information coming from higher layers to the ATM
layer or, in the receiver side, the ATM services to higher level requirements. It is divided into three
sublayers:
As cell switching networks, ATM networks require a connection establishment. It is here, at this moment,
where the entire communication requirements are specified: bandwidth, delay, information priority and so
on. These parameters are defined for each connection and, independently of what is happening in other
network points, they determine the connection quality of service (QoS). A connection is established if and
only if the network can guarantee the quality demanded by the user without disturbing the quality of
already existing connections.
In ATM networks it is possible to distinguish two levels in each virtual connection. Each of them defined
with an identifier:
Virtual paths are associated to the highest level of the virtual connection hierarchy. A virtual path is a set
of virtual channels connecting ATM switches to ATM switches or ATM switches to end-nodes.
Virtual channels are associated to the lowest level of the virtual connection hierarchy. A virtual channel
allows a unidirectional communication between end-nodes, gateways and end-nodes and between LANs
(Local Area Networks) and ATM networks. As the provided communication is unidirectional, each full-
duplex communication will consist of two virtual channels (each of them with the same path through the
network).
Virtual channels and paths can be established dynamically, by signaling protocols, or permanently.
Usually, paths are permanent connections while channels are dynamic ones. In an ATM virtual
connection, the input cell sequence is always guaranteed at the output.
In ATM Networks, cell routing is achieved thanks to the information pair VPI/VCI. This information is
not an explicit address but a label, i.e. Cells do not have in their headers the end-node address but
identifiers that change from switch to switch before arriving to the end-node. Switching in a node begins
reading the VPI/VCI fields of the input cell header (Empty cells are managed in a special way. After they
are identified, they are just dropped at the switch input). This pair of identifiers is used to access the
routing table in the switch to obtain, as a result, the output port and a new assigned pair VPI/VCI. Next
switch in the path will use this new pair of identifiers in the same way and the procedure will be repeated.
● VPs switches: they analyze just the VPI to route cells. As a virtual path (VP) groups several
virtual channels (VC), if the VCIs are not considered all VCs associated to a VP are switched
together.
● VCs switches: both identifiers are analyzed, VPI and VCI, to route cells.
In an ATM network it is possible to negotiate different levels or qualities of service to adapt the network
to many applications and to offer to the users a flexible way to access the resources.
If we study the main service characteristics, we can establish a service classification and define different
adaptation levels for each service. Four different service class are defined for ATM networks (Table 1)
Table-11.1:
Once the different services have been characterized it is possible to define the different adaptation layers.
There are four adaptation layers in ATM networks.
The main objective of traffic control function in ATM networks is to guarantee an optimal network
performance in the following aspects:
Basically, network traffic control in ATM networks is a preventive approach: it avoids congestion states
whose immediate effects are excessive cell dropping and unacceptable end-to-end delays.
Traffic control can be applied from two different sides: on the network side, it incorporates two main
functions: Call Acceptance Control (CAC) and Usage Parameter Control (UPC). On the user side, it
mainly takes the form of either source rate control or layer source coding (prioritization) to conform to
the service contract specification.
CAC (call acceptance control) is implemented during the call setup to ensure that the admission of a call
will not disturb the existing connections and also that enough network resources are available for this call.
It is also referred to as call admission control. The CAC results in a service contract.
UPC (usage parameter control) is performed during a connection life. It is performed to check if the
source traffic characteristics respect the service contract specification. If excessive traffic is detected, it
can be either immediately discarded or tagged for selective discarding if congestion is encountered in the
network. It is also referred to as traffic monitoring, traffic shaping, bandwidth enforcement or cell
admission control. The Leaky Bucket (LB) scheme is a widely accepted implementation of an UPC
function.
EJM 17/2/1999
Chapter 11
VLSI FOR TELECOMMUNICATION
SYSTEMS
● Introduction
● Telecommunication Fundamentals
● General Telecommunication Network Taxonomy
● Comparison Between Different Switching Techniques
● ATM Networks
● Case Study: ATM Switch
● Case study: ATM Transmission of Multiplexed-MPEG Streams
● Conclusions
● Bibliography
This section shows the architecture of the critical routing part in an ATM switch. Before talking about an
existent ATM chip, we will present the technological constrains that drive the design.
● A routing function to carry data from one input port to an output port.
● A queuing function to temporally memorise incoming data causing the blocking problem.
This section show why output buffering is a better solution to solve blocking problems (section 11.2.2.1
shows the blocking scenario)
Consider a simple 2X2 (2 input ports and 2 output ports) switch (see figure 11.16). Each number
represents the destination port address. Queued cells are in yellow and routed cells are in blue.
With an input buffering technique we need four cycles to route all cells.
● First cycle shows the queuing of one cell and the routing of the other.
● Second cycle shows the routing of the previously queued cell and the queuing of two incoming
cells.
● Third cycle shows the routing of the two previously queued cells and the queuing of the incoming
cell.
● The last cycle shows the routing of last cell.
With an output buffering technique we need three cycles to route all cells.
● First cycle shows the routing of all incoming cells. One is queued, the other is sent through the
connected output line #2.
● Second cycle shows the routing of the second couple of incoming cells. One is queued in queue
#2, the previously queued is sent through the connected output line #2 and the last one is directly
sent through the output line #1.
● Last cycle shows the sending of the queued cells through the line #2 and the routing and sending
of the cell to the line #1.
In certain cases, output buffering allows smaller cell latency. Therefore, a lower memory capacity in the
switch is needed. To solve the blocking problem the use of the output buffering technique has been
chosen.
After this choice, we need to know how the routing function can be implemented. Next section presents
the currently used techniques.
The simplest technique to implement the routing function is to link all the inputs to all the outputs. By
programming this array of connection the data can be routed from any of the input ports to any of the
output ports. We can implement this function using crossbar architecture.
A crossbar is an array of buses and transmission gates implementing paths from any input port to any
output port. This section describes this technique. To understand the limitations of such technique we
first describe the transmission gate.
Figure 11.17 shows an electric view of a transmission gate. Figure 11.18 shows a schematic view of the
same transmission gate. Two complementary transistors transmit the input signal without degradation
(the NMOS transmit the VSS and the PMOS transmit the VDD). Command input enables or disables the
transmission function. For instance:
Cin represents the parasitic load on the input line and Cout represents the parasitic load on the output
line.
If we wire an array of transmission gates as shown in figure 11.19, we obtain a programmable system
capable of routing any incoming data to any output port.
We can implement a 4X4 switch repeating this 2X2 structure (see figure 11.20).
We can repeat this structure N times to obtain the required number of input and output ports. This
approach causes a bus load problem. The more the number of input and output ports is, the more the load
and length of each bus is. For example, in figure 11.20 load on the input bus #1 is four times the input
load of one transmission gate plus the parasitic capacitance of the wire. Therefore, the routing delay from
an input to an output is long. We can not use this technique to implement high throughput switches with
a large number of ports.
To solve this problem a switch based on a 2X2 switches network has been developed. Next section
shows how these switches are implemented.
Figure 11.21 shows the 2X2-switch module. This switch is composed of one 2X2 crossbar implementing
the routing function and four FIFO memories implementing the output buffer function. The delay to
carry data from an input to an output is lower than that of the crossbar switch because buses are short and
are loaded by only two transmission gates.
Figure 11.22 shows an 8X8 Banyan switch. Input ports are connected to output ports by a three stage
routing network. There is exactly one path from any input to any output port. Each 2X2-switch module
simply routes one input to one of their two outputs.
A blocking scenario in a Banyan switch is shown in figure 11.23. In this figure red paths show successful
routing cells and blue ones show blocking cells. The numbers at the inputs represent cell destination
output port number.
All the incoming cells have a different output destination, but only two cells are routed. Some internal
collision causes this problem.
A solution to this problem is to make sure that this internal collision scenario never appears. This can be
achieved if incoming cells are sorted before the Banyan routing network. The sorter should sort the
incoming cells according to bitonic sequence rules. A Batcher sorter using a 2X2 comparators network
implements this function.
● An ascending order sequence, {0, 1, 2, 3, 4, 5, 6, 7}, like in the first scenario of figure 11.24.
● A descending order sequence.
● An ascending order sequence followed by a descending order sequence.
● A descending order sequence followed by an ascending order sequence {7, 5, 2, 1, 0, 3, 4, 6}, like
in the second scenario of figure 11.24.
This well-known architecture is currently used to implement the switching function. Next section
comments an existent switching chip using this technique.
Cell synchronization PL
Transmission adaptation
Synchronization PM
Figure 11.25 shows a switch high-level architecture. Each block implements some of the functions
describe in Table 1.
An explanation of the general functionality of each layer can be found in section 11.5.4.
The management block drives and synchronizes other layers, for instance, it drives the control check and
the administrative functions. High data transfer rates can be reached (up to some gigabits per second).
One of the critical blocks of this architecture is the switching module (surround in bold in figure 11.25).
Previous section discusses one of the most currently used techniques to implement this function. In next
section we will comment an existent chip designed with the previously described techniques.
Figure 11.26, Yam[97], shows the mapping between the chip architecture and the functional architecture.
The switching network module is mainly composed of the following blocks: a Batcher-Banyan network,
one input multiplexer bank and one output demultiplexer bank. The Batcher-Banyan network implements
the switching function. The Multiplexer-Demultiplexer banks are used to reduce the internal Batcher-
Banyan network bus width. (From 8 bits to 2 bits and vice versa).
This means that to switch one incoming 8-bit-word in one cycle, four internal Batcher-Banyan network
cycles are needed. A drawback for the bus width reduction is a four times increase in the internal switch
frequency. Therefore, the chip designers had to choose a faster technology to keep a high throughput
switching function. In this case they choose the Ga-As technology, usually used for high-frequency
systems.
EJM 17/2/1999
Chapter 11
VLSI FOR TELECOMMUNICATION SYSTEMS
● Introduction
● Telecommunication Fundamentals
● General Telecommunication Network Taxonomy
● Comparison Between Different Switching Techniques
● ATM Networks
● Case Study: ATM Switch
● Case study: ATM Transmission of Multiplexed-MPEG Streams
● Conclusions
● Bibliography
Available ATM network throughputs, in the order of Gb/s, allow broadband applications to interconnect using ATM infrastructures. We will consider, as a case study to give some intuition
about the main elements that will be found in a telecommunication system-on-a-chip, the architectural design of an ATM ASIC. The architecture is conceived to give service to applications
in which we will need to multiplex and transport multimedia information to an end-node through an ATM network. Interactive multimedia and mobile multimedia are examples of
applications that will use such a system.
Interactive multimedia (INM) relates to the network delivery of rich digital content, including audio and video, to client devices (e.g. desktop computer, TV and set-top box), typically as
part of an application having user-controlled interactions. It includes interactive movies, where viewers can explore different subplots, interactive games, where players take different paths
based on previous event outcomes, training-on-demand, in which training content tunes to each student existing knowledge, experience, and rate of information absorption, interactive
marketing and shopping, digital libraries, video-on-demand and so on.
Mobile multimedia applies in general to every scenario in which a remote delivery of expertise to mobile agents will be needed. It includes applications in computer supported cooperative
work (CSCW) where mobile workers with difficult problems receive advice to enhance the efficiency and quality of their tasks or emergency-response applications (ambulance services,
police, fire brigades).
A system offering this service of multiplexing and transport through ATM networks should meet the following requirements if it wants to cover applications as explained above:
● The system should easily scale the number of streams and the bandwidth associated to each of them to accommodate future service demand increases.
http://lsiwww.epfl.ch/LSI2001/teaching/webcourse/ch11/ch11.7.html (1 of 19) [2004-3-14 13:26:39]
ch11.7
● The system should fairly share the available multiplex bandwidth between all different sources. This feature will enable either to increase the number of streams to be multiplexed
when the available bandwidth is fixed or to reduce the necessary bandwidth to multiplex a fixed number of them.
● The system should guarantee a bandwidth reservation if sources with heterogeneous traffic patterns want to be simultaneously served.
● The system should be able to give service to mobile/portable sources connected by either wireless or infrared links.
● The system should be able to control the quality of the service (QoS) offered because if no control is applied in order to keep it constant, image quality degradation will depend
sharply on transient congestion conditions in the network when information is dropped randomly.
● At last, but not least, the system should be integrated on a single chip.
Distributing the multiplexing function between the different sources allows meeting efficiently the requirements of mobility/portability and streaming scalability.
Figure-11.28:
This distribution can be achieved with a basic unit that applies locally the multiplexing function to each source, as can be seen in figure 11.28. This basic unit is repeated for each stream
that we want to multiplex. Figure 11.29 shows how the basic unit works: there is a queue, where cells carrying information from the source wait until the MAC (Medium Access Control)
unit gives permission to the cells to be inserted. When an empty cell is found and the MAC unit allows insertion, this empty cell disappears from the flow and a new cell is inserted.
Figure 11.30 shows the details of this basic unit. There are four main blocks:
● Cell multiplexing unit: where empty cells are substituted by source cells when MAC makes the decision.
● MAC: decides when the information coming from the video source is introduced into the high-speed flow.
● QoS control: manages video information in order to produce a smooth quality of service degradation when network suffers from congestion.
● Protocol processing & DMA blocks: they, respectively, adapt information coming from the source for ATM transmission and communicate with the software running in the host
processor.
Figure-11.29:
The path followed by a cell from the source to the output module when is multiplexed is also shown in figure 11.30.
Figure-11.30:
In what follows, we will get into the details of the QoS block, MAC block and protocol processing and DMA block, leaving up to the end the cell multiplexing unit block to explain the
main design features of telecommunication ASICs.
One potential problem in ATM networks, caused by the bursty nature of traffic is cell loss. When several sources transmit at their peak rates simultaneously, the buffers available at some
switches may cause overflow. The subsequent drops of cells lead to severe degradation in service quality (multiplicative effect) due to the loss of synchronization at the decoder. In figure
Rather than randomly dropping cells during network congestion, we might specify to the ATM network the relative importance of different cells (prioritization) so that only less important
ones are dropped. This is possible in ATM networks thanks to the CLP (cell loss priority) cell header bit. Thus, if we do so, when the network enters a period of congestion, cells are
dropped in an intelligent fashion (non-priority cells first) so that the end-user only perceives a small degradation in the service's QoS.
Figure-11.31:
However, when the network is operating under normal conditions, both high priority and low priority data are successfully transmitted and a high quality service is available to the end
user. In the worst-case scenario, the end user is guaranteed a predetermined minimum QoS dictated by the high priority packets.
Figure-11.32:
Figure-11.33:
In figures 11.32, 11.33, the effect in the quality of the image received due to cell drops is shown. However, as the priority mechanism is applied (low frequency image information as high
priority data and high frequency image information as low priority data) an improvement in the quality of the decoded image is observed.
Figure 11.34 shows the effect of non-priority cell drops in the high frequency portion of the decoded image information.
Figure-11.34:
The basic functionality of the distributed multiplexing algorithm is to incorporate low speed ATM sources into a single ATM flow. When two or more sources try to access to the common
resource a conflict can occur.
The MAC algorithm can adopt the DQDB (Distributed Queue Dual Bus) philosophy, taking into account that there is just one information flow (downstream). A dedicated channel is
responsible for sending requests upstream.
The main objective of the DQDB protocol is to create and maintain a global queue of access requests to the shared bus. That queue is distributed among all connected basic units. If a basic
unit wants to send an ATM cell, a request to all its predecessors is sent. Therefore, each basic unit receives, from the neighbor on the right, access requests coming from every basic unit on
the right. These requests and the requests of the current basic unit are sent to the neighbor on the left. For each request, an empty cell passes through a basic unit without being assigned.
When QoS control is applied, these algorithms should be modified to allow all HP cells to be sent before any LP cell queued at any basic unit. This mechanism achieves critical information
to be sent first when congestion appears.
11.7.4. Communication with the host processor: protocol processing & DMA.
Another important point to face is the information exchange between the software running on the host processor and the basic unit. The main mechanism used for these transactions is
DMA (Direct Memory Access). In this technique all communications passes through special shared data structures - they can be read from or written to by both the processor and the basic
unit - that are allocated in the system's main memory.
Any time any data is read from or written to main memory is consider to be "touched". A design should try to minimize data touches because of the large negative impact they can have on
performance.
Let us imagine we are running, on a typical monolithic Unix Kernel machine, an INM application over an implementation of the AAL/ATM protocol. Figure 11.35 shows all data touch
operations involved in transmitting a cell from host main memory to the basic unit. The sequence of events is as follows:
1. The application generates data to be sent and writes it to its user-space buffer. Afterwards, It produces a system call to the socket layer to transmit data.
To copy data from the user buffer into a set of kernel buffers, both of them located in main memory, steps 2 and 3 are needed:
Figure-11.35:
4. AAL layer implementation reads data so that it can segment it and compute the checksum that has to be inserted in the AAL_PDU trailer.
5. The basic unit reads data from kernel buffers, adds the ATM cell header and transmits it.
Figure 11.36 shows what happens in hardware for the events explained above. Some of
Figure-11.36:
the lines are dashed to indicate that the corresponding read operation might be satisfied from the cache memory rather than from the main memory. In the best case, there are three data
touches for any given piece of data and in the worst case, there are five data touches.
Why is so important the number of data touches? Let us consider a main memory bandwidth of about 1.5 GB/s for sequential writes and 0.7 GB/s for sequential reads. If we assume that on
the average there are three reads for every two writes (see figure 11.36), the resulting average memory bandwidth is ~ 1.0 GB/s . If our basic unit requires five data touch operations for
every word in every cell, then the average throughput we can expect will be only a fifth of the average memory bandwidth, e.g. 0.2 GB/s . Clearly, every data touch that we can save will
provide for significant improvements in throughput.
The number of data touches can be reduced if either kernel buffers or user and kernel buffers are allocated from extra on-chip memory added to the basic unit.
In figure 11.37, kernel buffers are allocated from memory on the basic unit to reduce data touches form 5 to 2. Programmed I/O is the technique used to move data from the user buffer to
these on-chip kernel buffers (data is touched by the processor before is transfer to the basic unit).
Figure-11.37:
Figure 11.38 shows the same data touch reduction but with DMA being used instead of programmed I/O . In this case, as data arriving from main memory to the basic unit is not touched
by the processor, it cannot compute the checksum needed in the AAL layer; therefore, this computation will have to be implemented in hardware in the basic unit.
Figure-11.38:
Figure 11.39 shows an alternative that involves no main memory accesses at all (zero data touches). Both, user and kernel buffers are allocated from on-chip memory. Although this
approach reduces drastically the number of data touches, it has two disadvantages:
● Very large amount of memory will be needed to allocate user and kernel buffers.
● The API (Application Programming Interface) that will be presented to programmers in this kind of framework will be incompatible with existing socket-based API.
Figure-11.39:
11.7.5. Cell multiplexing unit: explanation of main design features of Tcomm. ASICs.
There are four modules in the Cell Multiplexing Unit (figure 11.40):
● Input module
● Input FIFO module
● Multiplexing module
● Output module
Figure-11.40:
Input and Output modules implement UTOPIA protocol (level one and two), the ATM-Forum standard communication protocol between an ATM layer and a Physical layer entity.
Common design elements used in both modules are registers, finite-state machines, counters, and logic to compare register values, as shown in the following figures (figure 11.41 and
figure 11.42).
Figure-11.41:
Figure-11.42:
FIFO module isolates two different clock domains: input cell clock domain from output cell clock domain. Besides, it allows cell storing (First Input, First Output) when UTOPIA protocol
stops cell flow.
Having different clock domains is a characteristic feature of telecommunication systems-on-a-chip that adds a new dimension to the design complexity: unsynchronized clock domains
The FIFO queue is implemented with a dual-port RAM memory and two registers to store addresses: the write and read pointer. Part of this queue is shown in figure 11.43.
Figure-11.43:
Multiplexing module changes empty cells by assigned ones. The insertion module has two registers to avoid the lost of parts of a cell when the UTOPIA protocol stops, another two
registers to delay the information coming from the network and one register for pipelining the module (figure 11.44)
Figure-11.44:
EJM 17/2/1999
Chapter 11
VLSI FOR TELECOMMUNICATION
SYSTEMS
● Introduction
● Telecommunication Fundamentals
● General Telecommunication Network Taxonomy
● Comparison Between Different Switching Techniques
● ATM Networks
● Case Study: ATM Switch
● Case study: ATM Transmission of Multiplexed-MPEG Streams
● Conclusions
● Bibliography
11.8. Conclusiones
Through these two case studies within the ATM domain, we have shown the main common
characteristics to telecommunication ASIC design. Briefly speaking, these features are the following:
● Different clock domains can coexist and, therefore, techniques to reduce the probability of having
a metaestable behavior have to be applied in the design.
● High throughput networks imply dealing with high frequency clock designs (hundreds of
megahertzs).
● FIFO Memories are usually needed to either separate different clock domains or store information
before accessing a common resource.
● Designs are mainly dominated by the presence of registers.
EJM 17/2/1999
Chapter 11
VLSI FOR TELECOMMUNICATION
SYSTEMS
● Introduction
● Telecommunication Fundamentals
● General Telecommunication Network Taxonomy
● Comparison Between Different Switching Techniques
● ATM Networks
● Case Study: ATM Switch
● Case study: ATM Transmission of Multiplexed-MPEG Streams
● Conclusions
● Bibliography
11.9. Bibliography
[Yam, 97]
Yamada, H.,M. Tunotani, F. Kaneyama and S. Seki. [1997] "20.8 Gb/s LSI's
Self-Routing Switch for ATM Switching Systems" Journal of Solid State
Circuits, Vol. 32, No. 1, January 97, pp. 31-38.
[Pry, 95]
[Tan, 96]
[Par, 94]
[Bha, 95]
EJM 17/2/1999
Chapter 12
Digital Signal Processing Architectures
● Introduction
● History
● Typical DSP applications
● The FIR Example
● General Architectures
● Data Path
● Addressing
● Peripherals
● How is a DSP different from a general-purpose processor
● Superscalar Architectures
12.1 Introduction
Digital signal processing is concerned with the representation of signals in digital form and the transformation or processing of such
signal representation using numerical computation.
Sophisticated signal processing functions can be realized using digital techniques – numerous important signal processing techniques
are difficult or impossible to implement using analog (continuous time) methods. Their reprogrammability is a strong advantage over
conventional analog systems. Furthermore digital systems are inherently more reliable, more compact, and less sensitive to
environmental conditions and component aging than analog systems. The digital approach allows the possibility of time-sharing (or
multiplexing) a given DSP microprocessor among a number of different signal processing functions.
12.2 History
Since the invention of the transistor and integrated circuit, digital signal processing functions have been implemented on many hardware
platforms ranging from special-purpose architectures to general-purpose computers. One of the earliest descriptions of a special-purpose
hardware architecture for digital filtering was described by Bell Labs in 1968.[1] The problem with such architectures, however, is their
lack of flexibility. In order to realize a complete application, one needs to be able to perform functions that go beyond simple filtering
such as control, adaptive coefficient generation, and non-linear functions such as detection.
The solution is to use an architecture that is more like a general-purpose computer, but which can perform basic signal processing
operations very efficiently. This means satisfying the following criteria:
● The ability to perform a multiply and add operation in parallel in the time of one instruction.[2]
● The ability to perform data moves to and from the arithmetic unit in parallel with arithmetic operations and modification of
address pointers.
● The ability to perform logical operations on data and alter control flow based on the results of these operations.
In the 1960s and 1970s, multiple chips or special-purpose computers were designed for computing DSP algorithms efficiently. These
systems were too costly to be used for anything but research or military radar applications. It was not until all of this functionality
(arithmetic, addressing, control, I/O, data storage, control storage) could be realized on a single chip that DSP could become an
alternative to analog signal processing for the wide span of applications that we see today.
In the late 1970s large-scale integration technology reached the point of maturity that it became practical to consider realizing a single
chip DSP. Several companies developed products along these lines including AMI, Intel, NEC, and Bell Labs.
AMI S2811
AMI announced a "Signal Processing Peripheral" in 1978.[1] The S2811 was designed to operate in conjunction with a microprocessor
such as the 6800 and depended upon it for initialization and configuration.[2] With a small, nonexpandable program memory of only
256 words, the S2811 was intended to be used to offload some math intensive subroutines from the microprocessor. Therefore, as a
peripheral, it could not "stand alone" as could DSPs from Bell Labs, NEC, and other companies. The part was to be implemented in an
exotic process technology called "V-groove." Availability of first silicon was after 1979 and was never used in any volume product.[3]
Intel 2920
Intel announced an "Analog Signal Processor," the 2920, at the 1979 Institute of Electrical and Electronics Engineers (IEEE) Solid State
Circuits Conference.[4] A unique feature of this device was the on-chip analog/digital and digital/analog converter capability. The
drawback was the lack of a multiplier. Multiplication was performed by a series of instructions involving shifting (scaling) and adding
partial products to an accumulator. Multiplication of two variables was even more involved—requiring conditional instruction
execution.
In addition, the mechanism for addressing memory was limited to direct addressing. The program could not perform branching.[5] As
such, while it could perform some signal processing calculations a little more efficiently than a general-purpose microprocessor, it
greatly sacrificed flexibility and has little resemblance to today's single-chip DSP. Too slow for any complete application, it was used as
a component for part of a modem.[6]
NECµPD7720
NEC announced a digital signal processor, the 7720, at the IEEE Solid State Circuits Conference in February 1980 (the same conference
that Bell Labs disclosed its first single-chip DSP). The 7720 does have all of the attributes of a modern single chip DSP as described
above. However, devices and tools were not available in the U.S. until as late as April 1981.[7]
The genesis of Bell Labs' first single-chip DSP was the recommendation of a study group that began to consider the possibility of
developing a multipurpose, large-scale integration circuit for digital signal processing in January 1977.[8] Their report, issued in
October 1977, outlined the basic elements of a minimal DSP architecture which consisted of multiplier/accumulator, addressing unit,
and control. The plan was for the I/O, data, and control memories to be external to the 40-pin DIP until large-scale integration
technology could support their integration. The spec was completed in April 1978 and the design a year later. First samples were tested
in May 1979. By October, devices and tools were distributed to other Bell Labs development groups. It became a key component in
AT&T's first digital switch, 5ESS-, and many other telecommunications products. Devices with this architecture are still in manufacture
today.
The first Bell Labs DSP was different from what was in the report. The DSP1 contained all of the functional elements found in today's
DSPs including a multiplier-accumulator (MAC), parallel addressing unit, control, control memory, data memory, and I/O. It fully
meets the above criteria for a single-chip DSPs.
The DSP1 was first disclosed outside AT&T at the IEEE Solid State Circuits Conference in February 1980.[9] A special issue of the
BellSystem Technical Journal was published in 1981 which described the architecture, tools, and nine fully developed
telecommunications applications for the device.[10]
First generation : 1979 - 1985 Harvard architecture, NECµPD7720, Intel 2920, Bell
hardwired multiplier Labs DSP1, Texas Instruments
TMS320C10
Philips TriMedia,
Motorola Starcore
Digital signal processing in general, and DSP processors in particular, are used in a wide variety of applications from military radar
systems to consumer electronics. Naturally, no one processor can meet the needs of all applications. Criteria such as performance, cost,
integration, ease of development, power consumptions are key points to examine when designing or selecting a particular DSP for a
class of applications. The table below summarizes different processor applications.
Hi-fi audio encoding and decoding Consumer audio & video, digital audio broadcast,
professional audio, multimedia computers
The Finite Impulse Response filter (FIR) is a convenient way to introduce features needed in typical DSP systems. The FIR filter is
described by the following equation:
The following diagram shows an FIR filter. This illustrates the basic DSP operations:
The simplest processor memory structure is a single bank of memory, which the processor accesses through a single set of address and
data lines. This structure which is common among non-DSP processors is referred as the Von Neuman architecture. In this
implementation, data and instruction are stored in the same single bank and one access to memory is performed during each instruction
cycle. As seen previously to perform a typical operation for a DSP is to have a MAC operation executed in one cycle. This operation
requires to fetch two data from memory, multiply them together and add the result to the previous result. With such a Von Neuman
model it is not possible to fetch the instruction and the data in the same cycle. This is one reason why conventional processors do not
perform well on DSP applications in general.
The solution to solve memory accesses is known as the Harvard architecture and the modified Harvard architecture. The following
diagram shows the Harvard architecture. The program counter fetch an instruction from the program memory using the program counter
and stores it in the instruction register. In parallel, the Address Calculation Unit fetch one operand from the memory and feed the
execution unit with it. This simple architecture allows one instruction word and one data word to be fetched in a single cycle. This
system requires 4 buses: 2 address bus and 2 data bus.
The next picture represents the modified Harvard architecture. Two data are now fetched in the memory in a single cycle. Since it is not
possible to access the same memory in the same cycle, this implementation requires three memory banks: a program memory bank and
two data memory bank commonly designed X and Y, each with its own set of buses: address and data.
The data path of a DSP processor is where the vital arithmetic manipulations of signals take place. DSP processor data paths are highly
specialized to achieve high performance on the types of computation most common in DSP applications, such as multiply-accumulate
operations. Registers, Adders, Multipliers, Comparators, Logic operators, Multiplexers, Buffers represent 95% of a typical DSP data
path.
Multiplier
A single-cycle multiplier is the essence of a DSP since multiplication is an essential operation in all DSP applications. An important
distinction between multipliers in DSPs is the size of the product according to the size of the operands. In general, multiplying two n-bit
fixed-point numbers requires a 2xn bits to represent the correct result. For this reason DSPs have in general a multiplier, which is twice
the word length of the native operands.
Accumulators Registers
Accumulators registers hold intermediate and final results of multiply-accumulate and other arithmetic operations. Most DSP processors
have two or more accumulators. In general, the size of the accumulator is larger than the size of the result of a product. These additional
bits are called guard bits. These bits allow accumulating values without the risk of overflow and without rescaling. N additional bits
allow up to 2n accumulations to be performed without overflow. Guards bits method is more advantageous than scaling the multiplier
product since it allows the maximum precision to be retained in intermediate steps of computations.
ALU
Arithmetic logic units implement basic arithmetic and logical operations. Operations such as addition, subtraction, and, or are performed
in the ALU.
Shifter
In fixed-point arithmetic, multiplications and accumulations often induce a growth in the bit width of results. Scaling is then necessary
to pass results from stage to stage and is performed through the use of shifters.
X1, X0, Y1, and Y0 are four 24-bit, general-purpose data registers. They can be treated as four independent, 24-bit registers or as two
48-bit registers called X and Y, developed by concatenating X1:X0 and Y1:Y0, respectively. X1 is the most significant word in X and
Y1 is the most significant word in Y. The registers serve as input buffer registers between the X Data Bus or Y Data Bus and the MAC
unit. They act as Data ALU source operands and allow new operands to be loaded for the next instruction while the current instruction
uses the register contents. The registers may also be read back out to the appropriate data bus to implement memory-delay operations
and save/restore operations for interrupt service routines.
The MAC and logic unit shown in the figure below conduct the main arithmetic processing and perform all calculations on data
operands in the DSP.
For arithmetic instructions, the unit accepts up to three input operands and outputs one 56-bit result in the following form:
extension:most significant product:least significant product (EXT:MSP:LSP). The operation of the MAC unit occurs independently and
in parallel with XDB and YDB activity, and its registers facilitate buffering for Data ALU inputs and outputs. Latches on the MAC unit
input permit writing an input register which is the source for a Data ALU operation in the same instruction. The arithmetic unit contains
a multiplier and two accumulators. The input to the multiplier can only come from the X or Y registers (X1, X0, Y1, Y0). The multiplier
executes 24-bit x 24-bit, parallel, twos-complement fractional multiplies. The 48-bit product is right justified and added to the 56-bit
contents of either the A or B accumulator. The 56-bit sum is stored back in the same accumulator. An 8-bit adder, which acts as an
extension accumulator for the MAC array, accommodates overflow of up to 256 and allows the two 56-bit accumulators to be added to
and subtracted from each other. The extension adder output is the EXT portion of the MAC unit output. This multiply/accumulate
operation is not pipelined, but is a single-cycle operation. If the instruction specifies a multiply without accumulation (MPY), the MAC
clears the accumulator and then adds the contents to the product.
In summary, the results of all arithmetic instructions are valid (sign-extended and zero-filled) 56-bit operands in the form of
EXT:MSP:LSP (A2:A1:A0 or B2:B1:B0). When a 56-bit result is to be stored as a 24-bit operand, the LSP can be simply truncated, or it
can be rounded (using convergent rounding) into the MSP. Convergent rounding (round-to-nearest) is performed when the instruction
(for example, the signed multiply-accumulate and round (MACR) instruction) specifies adding the multiplier’s product to the contents
of the accumulator. The scaling mode bits in the status register specify which bit in the accumulator shall be rounded. The logic unit
performs the logical operations AND, OR, EOR, and NOT on Data ALU registers. It is 24 bits wide and operates on data in the MSP
portion of the accumulator. The LSP and EXT portions of the accumulator are not affected. The Data ALU features two general-
purpose, 56-bit accumulators, A and B. Each consists of three concatenated registers (A2:A1:A0 and B2:B1:B0, respectively). The 8-bit
sign extension (EXT) is stored in A2 or B2 and is used when more than 48-bit accuracy is needed; the 24-bit most significant product
(MSP) is stored in A1 or B1; the 24-bit least significant product (LSP) is stored in A0 or B0. Overflow occurs when a source operand
requires more bits for accurate representation than are available in the destination. The 8-bit extension registers offer protection against
overflow. In the DSP56K chip family, the extreme values that a word operand can assume are - 1 and + 0.9999998. If the sum of two
numbers is less than - 1 or greater than + 0.9999998, the result (which cannot be represented in a 24 bit word operand) has underflowed
or overflowed. The 8-bit extension registers can accurately represent the result of 255 overflows or 255 underflows. Whenever the
accumulator extension registers are in use, the V bit in the status register is set.
Automatic sign extension occurs when the 56-bit accumulator is written with a smaller operand of 48 or 24 bits. A 24-bit operand is
written to the MSP (A1 or B1) portion of the accumulator, the LSP (A0 or B0) portion is zero filled, and the EXT (A2 or B2) portion is
sign extended from MSP. A 48-bit operand is written into the MSP:LSP portion (A1:A0 or B1:B0) of the accumulator, and the EXT
portion is sign extended from MSP. No sign extension occurs if an individual 24-bit register is written (A1, A0, B1, or B0).When either
A or B is read, it may be optionally scaled one bit left or one bit right for block floating-point arithmetic. Sign extension can also occur
when writing A or B from the XDB and/or YDB or with the results of certain Data ALU operations (such as the transfer conditionally
(Tcc) or transfer Data ALU register (TFR) instructions).
Overflow protection occurs when the contents of A or B are transferred over the XDB and YDB by substituting a limiting constant for
the data. Limiting does not affect the content of A or B – only the value transferred over the XDB or YDB is limited. This overflow
protection occurs after the content of the accumulator has been shifted according to the scaling mode. Shifting and limiting occur only
when the entire 56-bit A or B accumulator is specified as the source for a parallel data move over the XDB or YDB. When individual
registers A0, A1, A2, B0, B1, or B2 are specified as the source for a parallel data move, shifting and limiting are not performed.
The accumulator shifter is an asynchronous parallel shifter with a 56-bit input and a 56-bit output that is implemented immediately
before the MAC accumulator input. The source accumulator shifting operations are as follows:
● No Shift (Unmodified)
● Force to zero
12.7. Addressing
The ability to generate new addresses efficiently is a characteristic feature of DSP processors. Most DSP processors include one or more
special address generation units (AGUs) that are dedicated to calculating addresses. An AGU can perform one or more special address
generation per instruction cycle without using the processor main data path. The calculation of addresses takes place in parallel with
arithmetic operations on data, improving processor performance.
On of the main addressing mode is the register-indirect addressing. The data addressed is in memory and the address of the memory
location containing the data is held in a register. This gives a natural way to work with arrays of data. Another advantage is the
efficiency from an instruction-set point of view since it allows powerful and flexible addressing with relatively few bits in the
instruction word.
Whenever an operand is fetched from memory using register indirect addressing, the address register can be incremented to point to the
next needed value in the array. The following table summarizes most common increment method in DSPs:
*rP register indirect read the data pointed to by the address in register rP
*rP++ postincrement Having read the data, postincrement the address pointer to point to
the next value in the array
*rP-- postdecrement Having read the data, postdecrement the address pointer to point to
the previous value in the array
*rP++rI register Having read the data, postincrement the address pointer by the
postincrement amount held in register rI to point to rI values further down the
array
*rP++rIr bit reversed (FFT) having read the data, postincrement the address pointer to point to
the next value in the array, as if the address bits were in bit
reversed order
An additional convenient feature in AGU is the presence of modulo addressing modes. It is extensively used for circular addressing.
Instead of comparing the address to a calculated value to see whether or not the end of the buffer has been reached, dedicated registers
are used to automatically perform this check and take necessary actions (i.e. reset the register to the start address of the buffer).
The following picture represents the address generation unit of the Motorola 56002
This AGU uses integer arithmetic to perform the effective address calculations necessary to address data operands in memory, and
contains the registers used to generate the addresses. It implements linear, modulo, and reverse-carry arithmetic, and operates in parallel
with other chip resources to minimize address-generation overhead. The AGU is divided into two identical halves, each of which has an
address arithmetic logic unit (ALU) and four sets of three registers. They are the address registers (R0 - R3 and R4 - R7), offset registers
(N0 - N3 and N4 - N7), and the modifier registers (M0 - M3 and M4 - M7). The eight Rn, Nn, and Mn registers are treated as register
triplets — e.g., only N2 and M2 can be used to update R2. The eight triplets are R0:N0:M0, R1:N1:M1, R2:N2:M2, R3:N3:M3,
R4:N4:M4, R5:N5:M5, R6:N6:M6, and R7:N7:M7.
The two arithmetic units can generate two 16-bit addresses every instruction cycle — one for any two of the XAB, YAB, or PAB. The
AGU can directly address 65,536 locations on the XAB, 65,536 locations on the YAB, and 65,536 locations on the PAB. The two
independent addresses ALUs work with the two data memories to feed the data ALU two operands in a single cycle. Each operand may
be addressed by a Rn, Nn, and Mn triplet.
12.8. Peripherals
Most DSP processors provides on-chip peripherals and interfaces to allow the DSP to be used in an embedded system with a minimum
amount of external hardware to support its operation and interfacing.
Serial port
A serial interface transmits and receives data one bit at a time. These ports have a variety of applications like sending and receiving data
samples to and from A/D and D/A converters and codecs, sending and receiving data to and from other microprocessors or DSPs,
communicating with other hardware. The two main categories are synchronous and asynchronous interface. The synchronous serial
ports transmit a bit clock signal in addition to the serial bits. The receiver uses this clock to decide when to sample received data. On the
opposite, asynchronous serial interfaces do not transmit a separate clock signal; they rely on the receiver deducing a clock signal from
the data itself.
Direct extension of serial interfaces leads to parallel ports where data are transmitted in parallel instead of sequentially. Faster
communication is obtained through costly additional pins.
Host Port
Some DSPs provide a host port for connection to a general-purpose processor or another DSP. Host ports are usually specialized 8 or 16
bit bi-directional parallel ports that can be used to transfer data between the DSP and the host processor.
This kind of port is dedicated to multiprocessor operations. It is in general a parallel port intended for communication between the same
types of DSPs.
Interrupt controller
An interrupt is an external event that causes the processor to stop executing its current program and branch to a special block of code
called an interrupt service routine. Typically this code deals with the origin of the interrupt and then returns from the interrupt. There are
different interrupt sources:
External interrupt lines: dedicated pins on the chip to be asserted by external circuitry
Software interrupts: also called exceptions or traps, these interrupts are generated under software control or occurs for example for
floating-point exceptions (division-by-zero, overflow and so on).
DSPs associates interrupts with different memory locations. These locations are called interrupt vectors. These vectors contain the
address of the interrupt routines. When an interrupt occurs, the following scenario is often encountered:
Priority levels can be assigned to the different interrupt through the use of dedicated registers. An interrupt is acknowledged when its
priority level is strictly higher that current priority level.
Timers
Programmable timers are often used as a source of periodic interrupts. Completely software-controlled to activate specific tasks at
chosen times. It is generally a counter that is preloaded with a desired value and decremented on clock cycles. When zero is reached, an
interrupt is issued.
DMA
Direct Memory Access is a technique whereby data can be transferred to or from the processor’s memory without the involvement of
the processor itself. DMA is commonly used to provide improved performance with input/output devices. Rather than have the
processor read data from an I/O device and copy the data into memory or vice versa, a separate DMA controller can handle such
transfers in parallel. Typically, the processor loads the DMA controller with control information including the starting address for the
transfer, the number of words to be transferred, the source and the destination. The DMA controller uses the bus request pin to notify the
DSP core that it is ready to make a transfer to or from external memory. The DSP core completes its current instruction, releases control
of external memory and signals the DMA controller via the bus grant pin that the DMA transfer can proceed. The DMA controller then
transfers the specified number of data words and optionally signals completion through an interrupt. Some processor can also have
multiple channels DMA managing DMA transfers in parallel.
● DSPs intended for real-time embedded control/signal processing applications, not general-purpose computing
● DSPs strictly non-user-programmable (typically no memory management, no operating system, no cache, no shared variables,
single-process oriented)
● DSPs usually employ some form of "Harvard Architecture" to allow simultaneous code and data fetches
● Salient characteristic of all DSPs is devoting significant chip real estate to the "multiply-accumulate"(MAC) operation – most
DSPs perform a MAC operation in a single clock
● DSP programs often resident in fast on-chip ROM and/or RAM (although off-chip bus expansion is usually possible)
● most DSPs have at least two multi-ported on-chip RAMs for storing operand data
● DSP interrupt handling is simple, fast, and efficient (minimum context switch overhead)
● many DSP applications assembly-coded, due to real-time processing constraints(although C compilers exist for most DSPs)
● DSP address bus widths typically smaller than those of general-purpose processors (code size tends to be small, "tight-loop"
oriented)
● fixed-point DSPs utilize saturation arithmetic (rather than allowing 2’s complement overflow to occur)
● DSP addressing modes geared toward signal processing applications (direct support for circular buffers, "butterfly" access
patterns)
● DSPs often provide direct hardware support for implementation of "do" loops many DSPs employ an on-chip hardware stack to
facilitate subroutine linkage
● most "lower end" DSPs have integrated program ROM and scratchpad RAM to facilitate single-chip solutions most DSPs do not
have integrated ADCs and DACs – these interfaces (if desired) are usually implemented externally
● benchmark suites used to compare DSPs totally different than those used to compare general-purpose (RISC/CISC) processors :
• FIR/IIR filters
• FFTs
• convolution
• dot product
The term "superscalar" is commonly used to designate architectures that enable more than one instruction to be executed per clock cycle
Nowadays multimedia architectures, supported by the continuous improvement in technologies, are rapidly moving towards highly
parallel solutions like SIMD and VLIW machines. What do these acronyms mean ?
SIMD stands for Single Instruction on Multiple Data. In simple words it is possible to say that the architecture has a single program
control unit that fetches and decodes the program instructions to multiple execution units, i.e. multiple sets of datapaths, registers and
data memories. Of course a SIMD architecture can be realized by a multiprocessor configuration, but the exploitation of deep submicron
technologies has made possible to integrate such architectures in a single chip. It is easy at this point to imagine each execution unit to
be driven by different program control units, permitting the possibility to execute in parallel different instructions of the same program
or different programs at the same time; in this case the resulting architecture is called Multiple Instructions on Multiple Data (MIMD).
Again a MIMD machine can be implemented by a multiprocessor structure or integrated in a single chip.
Historically the first examples of the so-called multiple-issue machines were typified in the early '80s, and they were called VLIW
machines (for Very Long Instruction Word). These machines exploit an instruction word consisting of several (up to 8) instruction
fragments. Each fragment controls a precise execution unit; in this way the register set must be multiported to support simultaneous
access, because the multiple instructions could need to share the same variables. In order to accommodate the multiple instruction
fragments, the instruction word is often over 100 bits long. [12]
The reasons that push towards these parallel approaches are essentially two; first of all many scientific and processing algorithms either
for calculus or, more recently, for communication and multimedia applications contain a high degree of parallelism. Secondly a parallel
architecture is a cost-effective way to compute (when the program is parallelizable), since internal and on-chip communications are
much faster and much more efficient than external communication channels.
On the other hand parallel architectures bring with them a number of problems and new challenges that are not present in simple
processors. First of all, if it is true that many programs are parallelizable, extensive researches have shown that often the level of
parallelism that can be achieved is theoretically not greater than 3; this means that on actual architectures the speedup factor is not
greater than 2. Based upon this, it would seem that in the absence of significant compiler breakthroughs available speedup is limited. A
second problem concerns memories and registers; highly parallel routines require a high memory access rate, and then a very difficult
optimization for register set, cache memory and data buses in order to feed the necessary amount of data into the execution units.
Finally, such complex architectures with hardly optimized datapaths and data transfers are very difficult to program. Normally DSP
programmers used to develop applications directly in assembly language, very similar for some aspects to the natural sequential way of
thinking of the human beings and specifically conceived for smart optimizations. Machines like the MIMD and VLIW ones are not
programmable in assembly anymore, and then processor designers have to spend a great amount of resources (often more than the time
to develop the chip itself) in order to provide the Software Development Kits able to exploit the full potential of the processor, taking
care of every aspect from the powerful optimization techniques until understandable user interfaces.
More recent attempts at multiple-issue processors have been directed at rather lower amounts of concurrency than in the first VLIW
architectures (4-5 parallel execution blocks). Three examples of this new generation of superscalar machines will be briefly discussed in
the next subsections, underlying architectural aspects and specific solutions to deal with the problems of parallelization.
The Pentium processor explicitly supports multimedia, since the introduction of the so-called MMX (MultiMedia eXtension) family.
The well-known key enhancement of this new technology consists of exploiting the 64-bit floating point registers of the processor to
"pack" 8-, 16-, or 32-bit data, that can be processed in parallel by a SIMD operating unit. Fifty-seven new instructions are implemented
in the processor in order to exploit these new functionalities, and among them "multiply and add", the basic operation in the case of
digital convolutions (FIR filters) or FFT algorithms. [13] Two considerations can be made about this processor. First of all the packed
data are fixed point, and then the use of these extensions for a DSP oriented task limits the use of the floating point arithmetic;
conversely a full use of floating-point operations does not allow any boost in performance in comparison with the common Pentium
family.
Moreover MMX technology has been conceived to specifically support multimedia algorithms, but at the same time to completely
preserve code compatibility with previous processors; in this way an increased potential in fixed-point processing power is not
supported by the necessary memory and bus re-design, and then it is often not possible to "feed" the registers with the correct data.
Extensive tests conducted after the disclosing of the MMX technology have shown that for typical video application it is often a hard
matter to achieve a speedup factor of the 50%.
Figure 1. How the Pentium MMX exploits the 64-bit floating-point registers to "pack" data in parallel and send them to a SIMD
execution unit
Another multimedia processor other than Intel MMX that is growing in interest is the TriMedia by Philips Electronics. This chip is not
designed as a completely general purpose CPU, but with the double functionality of CPU and DSP in the same chip, and its core
processing master unit presents a VLIW architecture.
● A very powerful, general-purpose VLIW processor core (the DSPCPU) that coordinates all on-chip activities. In addition to
implementing the non-trivial parts of multimedia algorithms, this processor runs a small real-time operating system that is driven
● DMA-driven multimedia input/output units that operate independently and that properly format data to make software media
processing efficient.
● DMA-driven multimedia coprocessors that operate independently and in parallel with the DSPCPU to perform operations
specific to important multimedia algorithms.
● A high-performance bus and memory system that provides communication between TM1000’s processing units.
The real DSP processing must be implemented in the master CPU/DSP, which is also responsible for the whole algorithm direction.
This unit is a 32-bit floating-point, 133 MHz general-purpose unit whose VLIW instructions can address up to five instructions out of
the 27 functional operations (integer and floating point, 2 multipliers and 5 ALUs).
The DSPCPU is provided with a 32 Kbytes Instruction cache memory and a dual port 16 Kbytes Data cache memory. [14]
TriMedia also provides a set of multimedia instructions, mostly targeted at MPEG-2 video decoding;
Some of the programming challenges for parallel architectures are solved in the DSPCPU through the concept of guarded conditional
operations. An instruction takes the following form
In this instruction the integer multiplication of the two registers is put into the destination register under the condition contained in the
"guard" register Rg. This allow to better control the optimization strategies of the parallel compiler, since for instance the problem of
branches is relaxed and the result is accepted or not only at the last execution stage of the pipeline.
As mentioned above, complex processors/DSPs like TriMedia need a big amount of development tools and software support. For this
reasons the TriMedia comes with a huge amount of tools to deal with the real-time kernel, the DSPCPU programming and the complete
system exploitation.
The TriMedia Software Development Environment provides a comprehensive suite of system software tools to compile and debug
multimedia applications, analyse and optimize performance, and simulate execution on the TriMedia processor. The main features are:
● MPEG- 1 decoder
● V. 34 modem
A very interesting solution recently developed for advanced audio applications is the PUMA (Parallel Universal Music Architecture)
DSP by Studer Professional Audio. This chip was conceived and realised in collaboration with the Integrated Systems Center (C3I) of
the EPFL.
This integrated circuit is designed and optimized for digital mixing consoles; it is provided with 4 independent channel processors, and
then with four 33-MHz, 24-bit fixed point multipliers and adders fully dedicated to data processing (another multiplier is provided in the
master DSP, which is charged of the final data processing and directs the whole chip functionalities and I/O unities); the important
feature of this chip relies in the multiple processing units that can work in parallel on similar blocks of data; each channel processor has
its own internal data memory (256 24-bit words for each processor), the Master DSP and the Array DSP has independent program
memories and program control units. The design of the I/O units deserved a great care: digital audio input and output are supported by
20 serial lines each; interprocessor communication is supported through fully independent units (the Cascade Data Input and Cascade
Data Output) providing 64 channels on 8 lines at full processor speed. A general purpose DRAM/SRAM External Memory Interface and
the External Host Interface permit memory extension and flexible programmability via an external host processor. The following figure
shows the top-level architecture of the PUMA DSP.
The following figure shows the internal datapath of each channel processor; three units can work in parallel in a single clock cycle: a
24x24-bit multiplier, a comparator and the general purpose ALU (adder, shifter, logical operations).
To conclude it is interesting to spend a few words about the PUMA design flow, to understand how a modern and complex architecture,
characterised by several million of transistors, can be practically realised.
First of all the functional specification of the processor is developed, defining functionalities, basic blocks and instruction set; at the
same time the C-model of the architecture is implemented, in order to test with a simple methodology the algorithms and the
architecture.
The second step is the VHDL description and simulation of the C-model at the RTL level, followed by the synthesis to the gate level.
All of this was accomplished exploiting the Synopsys Design Compiler and Design Analyzer.
After that an optimization technique called hierarchical compiling is used: after setting the boundary constraints for the main blocks, the
constraints for the inner blocks are derived hierarchically by the compiler, and this permits to relax the time paths everywhere it is not
strictly necessary.
The preliminary place & route follows; then the parasitic parameters (R and C) for each wire are extracted, and the so called back-
annotation, or in-place compilation is performed, in order to better adapt each load to the real netlist placement. The place & route was
made by the Compass tool, the back-annotation again in Synopsys Design Compiler.
Finally the last place&route is made, and extensive simulations are performed for every part of the chip, in order to verify the timing of
every specific operation. Now the design is ready for the foundry.
References
1. Nicholson, Blasco and Reddy, "The S2811 Signal Processing Peripheral," WESCOM Tech Papers, Vol. 22, 1978, pp. 1-12.
2. S2811 Signal Processing Peripheral, Advanced Product Description, AMI, May 1979.
3. Strauss, DSP Strategies 2000, Forward Concepts, Tempe, AZ, November 1996, p. 24.
4. Hoff and Townsend, "An analog input/output microprocessor for signal processing," ISSCC Digest of Tech. Papers, February 1979,
p. 220.
5. 2920 Analog Signal Processor Design Handbook, Intel, Santa Clara, CA, August 1980.
6. Strauss, DSP Strategies 2000, Forward Concepts, Tempe, AZ, November 1996, p.24.
7. Brodersen, "VLSI for Signal Processing," Trends and Perspectives in Signal Processing, Vol. 1, No. 1, January 1981, p. 7.
8. Stanzione et al, "Final Report Study Group on Digital Integrated Signal Processors," Bell Labs Internal Memorandum, October 1977.
9. Boddie, Daryanani, Eldumtan, Gadenz, Thompson, Walters, Pedersen, "A Digital Signal Processor for Telecommunications
Applications," ISSCC Digest of Technical Papers, February 1980, p.44.
10. Bell System Technical Journal, Vol. 60, No. 7, September 1981.
11. Phil Lapsley, Jeff Bier, Amit Shoham "DSP Processor Fundamentals, Arhitectures and Features", IEEE Press.
12. Michael J. Flynn "Computer Architecture. Pipelined and parallel processor design", Jones and Bartlett, 1995.
13. Peleg A., Wilkie S., Weiser U. Intel MMX for Multimedia PCs. Communications of the ACM, Vol. 40, No. 1, Jan 1997.
ARCHITECTURES FOR
VIDEO PROCESSING
Integrated System Laboratory C3I
Swiss Federal Institute of Technology, EPFL
The first question we would like to answer is: what do we mean nowadays for video processing? In the
past, more or less till the end of the 80’s there where two distinct worlds: an analog TV world and a
digital computer world. All TV processing from the camera to the receiver was based on analog
processing, analog modulation and analog recording. With the progress of digital technology, a part of the
analog processing could be implemented by digital circuits with consistent advantages in terms of
reproducibility of the circuits leading to cost and stability advantages, and noise sensitivity leading to
quality advantages. At the end of the 80’s completely new video processing possibilities became feasible
by digital circuits. Today, image compression and decompression is the dominant digital video processing
in term of importance and complexity of the all TV chain.
In the near future digital processing will be used to pass from standard resolution TV to HDTV for which
compression and decompression is a must, considering the bandwidth that it would require for
transmission. Other applications will be found at the level of the camera to increase the image quality by
increasing the number of bit from 8 to 10 or 12 for each pixel, or by using appropriate processing aiming
at compensating the sensors limitations (image enhancement by non-linear filtering and processing) .
Digital processing will also enter into the studio for digital recording, editing and 50/60 Hz standard
conversions. Today the high communications bandwidth required by uncompressed digital video
necessary for editing and recording operations, between the studio devices limits the use of full digital
video and digital video processing at studio level.
Video Compression
Why video compression has become the dominant video processing application of TV? An analog TV
channel only needs a 5 MHz analog channel for the transmission, conversely in case of digital video with:
8 bit A/D, 720 pixels for 576 lines (54 MHz sampling rate) we need a transmission channel with a
capacity of 168.8 Mbit/s!!! In case of digital HDTV the capacity for: 10 bit A/D, 1920 pixels 1152 lines
raise up to1.1 Gbit/s!!! No affordable applications, in terms of cost, are thus possible without video
compression.
These reasons have raised also the need of worldwide standards for video compression so as to achieve
interoperability and compatibility among devices and operators. H.261 is the names given to the first
digital video compression standard specifically designed for videoconference applications, MPEG-1 is the
name for the one designed for CD storage (up to 1.5 Mbit/s) applications, MPEG-2 for digital TV and
HDTV respectively from 4 up to 9 Mb/s for TV, or up to 20 Mb/s for HDTV; H.263 for
videoconferencing at very low bit rates (16 - 128 kb/s). All these standards can be better considered as a
family of standards sharing quite similar processing algorithms and features.
For TV and HDTV while we have very few encoders used by broadcaster companies (at limit just
one for each channel), we must have a decoder on each TV set.
This means that any compressed video bit-stream can be decoded without any ambiguity yielding
the same video result.
This means that a decoder must be able to decode any video bit-stream that respects the decoding
syntax.
This means that an encoder must encode a video content in a conformant syntax.
This means that encoding algorithms are a competitive issues, encoders can be optimized aiming
at achieving higher quality of compressed video or aiming at simplifying the encoding algorithm
so as to have simple encoder. It also mean that in future disposing of more processing power we
can use more and more sophisticated and processing demanding encoding algorithms to find the
best choices of the available encoding syntax.
These basic principles of the video compression standards have clearly strong consequences on the
architectures implementing video compression. So as to understand what are the main processing and
architectural issues in video compression we briefly analyze in more details the basic processing of
MPEG-2 standard.
In more details (see Figures 4 and 5) spatial redundancy is reduced applying 8 times horizontally and 8
times vertically a 8x1 DCT transform. Then transform coefficients are quantized, thus reducing to zero
small high frequency coefficients, scanned in zig-zag order starting from the DC coefficient at the upper
left corner of the block and coded using Huffman tables referred also as Variable Length Coding (VLC).
The reduction of temporal redundancy is the process that drastically reduces the bit rate and enables to
achieve high compression rates. It is based on the principle of finding the current macro-block in already
transmitted pictures at the same position in the image or displaced by a so-called "motion vector" (see
figure 6). Since an exact copy of the macro-block is not guaranteed to be found, the macro-block that has
the lowest average error is chosen as reference macro-block. The "error macro-block" is then processed
so as to reduce the spatial redundancy, if any, by means of the above mentioned procedure and
transmitted so as to be able to reconstruct the desired macro-block disposing of the "motion vector"
indicating the reference and the relative error.
Figure 7 reports the so-called MPEG-2 Group of Picture Structure that shows that images are classified as
I (Intra), P (Predicted) and B (Bi-directionally interpolated). The standard specifies that Intra image
macro-block can only be processed to reduce spatial redundancy, P image macro-block can also be
processed to reduce the temporal redundancy referring only to past I or P frames, B image macro-block
can also be processed using an interpolation of past and future reference macro-block. Obviously B macro-
block can also be coded as Intra or Predicted if it is found to convenient for the compression. Note that
since B picture can use as reference both past and future I or P frames, the MPEG-2 image transmission
order is different from the display order, B picture are transmitted in the compressed bit-stream after the
relative I and P pictures.
Figure 7. Structure of an MPEG-2 GOP, showing the reference pictured for motion compensated
prediction of P and B pictures.
Nowadays, digital technology has made many progresses in terms of speed increase and processing
performance for which the DCT coding or decoding is not anymore a critical issue. If we look to Figure 8
we can find a schematic block diagram of an MPEG-2 decoder that is very similar to the ones of the other
compression standards. A buffer is needed to receive at a constant bit-rate the compressed bits that during
decoding are not "consumed" at a constant rate. VLD is a relatively simple processing that can be
implemented by means of look-up tables or memories. Being a bit-wise processing, it cannot be
parallelized and results quite inefficient to be implemented in general purpose processors. This is the
reason for which new multimedia processors such as Philips "Trimedia" use specific VLC/VLD units for
entropy coding. The more costly elements of the MPEG-2 decoder are the memories for the storage of
past and future reference frames and the handling of the data flow between the Motion Compensated
Interpolator unit and the Reference video memories.
For an MPEG-2 encoder, see Figure 9, the situation is very different. First of all we can recognize a path
that implements a complete MPEG-2 decoder, necessary to reconstruct reference images as they are
found at the decoder size. Then we have a motion estimation block (Bi-directional motion estimator) that
has the goal of finding the motion vector, and a block that selects and controls the macro-block encoding
modes. As discussed in the previous paragraphs, the way to find the best motion vectors as well as the
way to chose the right coding for each macro-block is not specified by the standard. Therefore, very
simple (with limited quality performance), or extremely complex algorithms (with high quality
performance) can be implemented for these functions. Moreover, MPEG-2 allows the dynamic definition
of the GOP structure making possible many possibilities of coding modes. In general two are the critical
issues of an MPEG-2 encoder: the motion estimation processor and the handling of the complex data flow
with relative bandwidth problems between original and coded frame memories, motion estimation
processor and the coding control unit.
We have also to mention that the coding modes of MPEG-2 are much more complex of what could seem
from this brief description. In fact, existing TV is based on interlaced images and the processing all
coding modes can be applied in distinct ways to "frame" blocks and macro-blocks or on "field" blocks
and macro-blocks. The same applies for motion estimation for which we can use both field-based or
frame-based vectors. Moreover all references for predictions can be made on true image pixels or on
"virtual" image pixels obtained by bi-linear interpolations as shown in Figure 10.
Figure 10. MPEG-2 macro-block references can be made also on "virtual" pixels (in red) obtained by
bilinear interpolations, instead of image pixels from the original raster (gray).
In this case also, motion vectors with half pixel precision need to be estimated. The possibility of using all
these possible encoding modes largely increases the quality of the compressed video, but it might become
extremely demanding in terms of processing complexity.
The challenge of MPEG-2 encoder designer is to find the best trade-off between the complexity of the
implemented algorithms and the quality of the compressed video. Architectural and algorithmic issues are
very strictly related in MPEG-2 encoder architectures.
Figure 12. processing requirements of 3-D graphic content in terms of pixel and polygon per second.
Computer graphic applications strongly rely on the performance of acceleration cards that are specialized
to treat in parallel with a high level of pipelines all these numerous but simple pixel operations. Figure 12
reports a diagram of the processing requirements in terms of polygons/s and pixel/s of various graphic
contents.
In MPEG-4 we can find in fact both natural compressed video and 2-D and 3-D models. The standard is
based on the concept of elementary streams that represents and carry the information of a single "object"
that can be of any type "natural" or "synthetic", audio or video.
Figure 13, reports an example of what can be the content of a MPEG-4 scene. Natural and 2-D and 3-D
synthetic audio-visual objects are received and composed in a scene as seen by an hypothetical viewer.
Figure 14. Diagram of MPEG-4 System layer and interface with the network layer.
Two virtual levels are necessary to interface the "elementary stream" level with the network level. The
first is necessary to multiplex/demultiplex each communication stream into packets and the second to
synchronize each packet and build the "elementary streams" carrying the "object" information as shown
in Figure 14.
The processing related to MPEG-4 Systems layer cannot considered as video processing and is very
similar to the packet processing typical to network communication.
An MPEG-4 terminal can be schematized as shown in Figure 15. The communication network provides
the stream that is demultiplexed into a set of "elementary streams". Each "elementary stream" is decoded
into audio/video objects. Using the scene description transmitted with the elementary streams all object
are "composed" in the video memory all together according to the size, view angle and position in the
space and then "rendered" on the display, which can be interactive and originating a upstream data due to
the user interaction and sent back to the MPEG-4 encoder.
MPEG-4 systems, therefore implement not only the classical MPEG-2-like compression/decompression
processing and functionality but also computer graphics processing such as "composition" and
"rendering". The main difference comparing to natural video of MPEG-1, MPEG-2, H.263, is the
introduction of "shape coding" enabling the use of arbitrarily shaped video objects as illustrated in Figure
16. Shape coding information is based on macro-block data structures and arithmetic coding for the
contour information associated at each boundary block.
Figure 15. Illustration of the processing and functionality implemented in an MPEG-4 terminal.
Figure 16. Compressed shape information is necessary for arbitrarily shaped objects.
Figure 17. MPEG-4 decoder block diagram, shape coding is coded in parallel to the DCT based texture
coding. Shape coding can be of "Intra" type, or with motion compensation and prediction error like
texture coding.
The block diagram of an MPEG-4 encoder is depicted in Figure 17. In general it is very similar as
architecture to an MPEG-2 encoder block diagram. We can notice a new "shape coding" block in the
motion estimation loop that produce shape coding information transmitted in parallel to the classical
texture coding information.
To this group belong all hardwired circuits specifically designed for a single processing task. The
level of programmability is very low and the circuits are usually clocked at the frequency or
multiples of the input/output data sampling rates.
These architectures are based on a DSP cores plus special functions (such as 1-D, 2-D filters, FFT,
graphics accelerators, block matching engines) that are specific to a selected application.
These are the classical processors architectures specialized and efficient for multiply-accumulate
operations on 16-24-32 bit data. The classical well-known families are the ones of Motorola and
Texas Instruments. The level of programmability of these processors is very high. They are also
employed for real-time applications with constant input/output rates.
These are the classical PC processors (Intel, IBM PowerPC) and workstation processors (Alpha
Digital, Sun UltraSparc,). Originally they were designed for general purpose software applications
and in general, although very powerful, are not really adapted for video processing. Moreover the
operating systems employed are not real-time OS. The design of real-time video application on
these architectures is not a simple task as it could appear.
Considering the video processing implementations of the last years, in general, we can observe the trend
versus the time illustrated in Figure 18. If we consider different video processing algorithms (indicated as
Proc.1, Proc.2 etc… in order of increasing complexity.) such as DCT on a 8x8 block for instance, we find
first in time to appear implementations based on ASIC architectures. After some years with the evolution
of IC technology these functions can then be implemented in real-time by AS DSPs, then by standard
DSPs, and then by GPPs. This trend corresponds to the desire of transferring the complexity of the
processing from the hardware architecture to the software implementation. However, this trend does not
present only advantages and does not apply to all the implementation cases. Figures 19, 22 and 23 reports
an illustration of advantages and disadvantages for each class of architectures that should be considered
case by case. Let us analyze in detail and discuss each feature.
Figure 18. Trend of algorithm implementations versus the time on different architectures.
These considerations lead to clear advantages in terms of cost for ASICs when high volumes are required
(see Figure 23). Simpler circuits that require smaller silicon surface areas are the right solution for set-top
boxes and application for high volumes (MPEG-2 decoders for digital TV broadcasting for instance). In
these cases the high development costs and the lack of debugging and software tools for the simulation
and design do not constitute a serious drawback. Modifications of the algorithms and the introduction of
new versions are not possible, but are not required by this kind of applications. Conversely, for low
volume applications, the use of programmable solutions immediately available on the market, well
supported by compilers, debuggers and simulation tools that can effectively speed up the development
time and cost, might be the right solution. The much higher cost of the programmable processor, in some
cases become acceptable for relatively low volume of devices.
Another conflicting trend between hardwired and programmable solutions can be found by the need of
designing low-power solutions required by the increasing importance of portable device applications and
necessary to reduce the increasing power dissipated by high performance processors (see Figure 24). This
trend conflicts with the need of transferring the increasing complexity of processing algorithms from the
architecture to the software which is much easier and faster to be modified corrected and debugged. The
optimization of memory size and accesses, clock frequency, and other architectural features that yield low-
power consumption are only possible on ASICs architectures.
What is the range of power consumption reduction that can be reached passing from a GPP to an ASIC?
It is difficult to answer to this question with a single figure, it depend by architecture to architecture,
processing by processing. For instance Figure 24 reports the power dissipation of a 2-D convolution with
3x3 filter kernel on a 256x256 image on three different architectures. The result is that a ARM RISC
implementation, beside being slower that the other alternatives and so providing a under-estimated result,
is about 3 times more demanding than a FPGA implementation and 18 times more than an ASIC based
one. The example of the IMAGE motion estimation chip that is reported at the end of this document
shows that much higher reduction factors (even more than two orders of magnitude) can be reached by
low-power optimized ASIC architectures for specific processing tasks when compared to GPPs providing
the same performance.
Figure 24. Power dissipation reduction for the same processing (2-D convolution 3x3) on three different
architectures.
A last general consideration about the efficiency of the various architectures for video processing regards
the memory usage. Video processing applications, as we have seen in more detail for MPEG-2, require
the handling of very large amount of data (pixels) that need to be processed and accesses several time in a
video encoder or decoder. Images are filtered, coded, decoded, used as reference for motion
compensation and motion estimation for different frames, in other words accessed in order or "randomly"
several times in a compression/decompression stage. If we observe the speed of processors, and the speed
to access cache SRAM and Synch. DRAM data in the last year we observe two distinct trends (see Figure
25). Speed of processors was similar to memory access speed in 1990, but now it is more than the double
and the trend is towards a even higher speed ratios. It means that the performance bottleneck of nowadays
video processing architectures is given by the efficiency of the data flow. A correct design of the software
for GPPs and a careful evaluation of the achievable memory bandwidth of the various data exchanges is
necessary to avoid the risk that the largest fraction of time is used by the processing unit just to wait for
the correct data to be processed. For graphic accelerators performance the data flow handling is the basic
objective of the processing. Figure 26 reports the performance of some state of the art devices versus the
graphic content.
Figure 25. Evolution of the processing speed of processors, SRAM and Synch. DRAM in the last years.
Memory access speed has become the performance bottleneck of data-intensive processing systems.
Figure 26. Performance and power dissipation of state of the art graphic accelerators (AS-DSPs) versus
polygons and pixel/s.
Motion estimation is indeed the most computational demanding stage of video compression at the
encoder side. For normal resolution TV we have to encode 1620 macro-block per frame, with 25 frame
per second. Roughly, to search a motion vector error we need to perform about 510 arithmetic operations
on data from 8 to 16 bits. The number of vector displacements depends on the search window size that
should be large for guaranteeing high quality coding. For instance for sport sequences a size of about
100x100 is required. This leads to about 206 x 109 arithmetic operations per second on 8 to 16 data. Even
if we are able to select an "intelligent" search algorithm that reduces from one up to two orders of
magnitude the number of search points the number of operations remain extremely high and not feasible
by for state of the art GPPs. Moreover, 32 or 64 bit processing arithmetic cores are wasted when
operations only on 8 to 16 bits are necessary. Completely different architectures that implement a high
level of parallelism at bit level are necessary.
If we want to be more accurate, we can notice that B pictures require both forward and backward motion
estimation, and for instance for TV applications each macro-block can use the best between frame-based
or field-based motion vectors at full or half pixel resolution level. Therefore, we realize that the real
processing needs can increase of more than a factor 10, if all possible motion vectors are estimated.
Another reason for which ASICS or AP-DSPs are an interesting and actual choice for motion estimation
is the still unsolved need of motion estimation for TV displays. Large TV displays require to double the
refresh rate to avoid the annoying flickering phenomenon appearing on the side portions of large screens.
A conversion of interlaced content from 50 to 100 Hz by the simple doubling of each field provides
satisfactory results in there is no motion. In case of moving objects the image quality provided by field
doubling is low and motion compensated interpolation is necessary to reconstruct the movement phase of
the interpolated images. An efficient and low-cost motion estimation stage is necessary for high quality
up-conversion on TV displays.
The basic architectural idea has been to design a processing engine extremely efficient in getting the
mean absolute difference between macro-blocks (matching error) with fast access to a large image section
(search window size). By extremely efficient it is meant exploiting as much as possible the parallelism
intrinsic to pixel operations on 16x16 block of pixels and able to access randomly any position in the
search window without useless waiting times (i.e. providing the engine with the sufficient memory
bandwidth to fully exploit its processing power). Figure 29 reports the block diagram of the "block-
matching" engine. We can notice in the center the "pixel processor" for the parallel execution of the
macro-block difference, two cache memory banks for the storage of the current macro-block and for the
search window reference, a RISC processor for the handling of the genetic motion estimation algorithm
and for the communications between processing units. The basic processing unit of Figure 29 is then
reported in the general architecture of the chip reported in Figure 30. We can notice two macro-block
processing units in parallel, the various I/O modules for the communication with the external frame
memory and the communication interfaces for cascading the chip for forward and backward motion
estimation and for larger search window sizes. As mentioned discussion data intensive applications one of
the main difficulty of the chip design is the correct balancing of the processing time of the various units
and the optimization of the various communications between modules. It is fundamental that all module
processing are scheduled so as to avoid wait times and the communication busses have the necessary
bandwidth.
Low power optimizations are summarized in figure 31. Deactivation of processing units, local gated
clocks and implementation of a low-power internal SRAM as cache memory enabled to keep power
dissipation below 1W. Figure 32 reports the final layout of the chip with the main design parameters.
In conclusion, the chip IMAGE can be classified as an AS-DSP for its high programmability where the
application specific for which a special hardware is used is the calculation of macro-block differences. Its
performance for motion estimation are much higher than any state of the art GPPs and obtained with a
relatively small chip dissipating less than 1W when providing real-time motion estimation for MPEG-2
video compression. More details about the IMAGE chip can be found in: F. Mombers, M. Gumm and Al.
"IMAGE: a low-cost low-power video processor for high quality motion estimation in MPEG-2
encoding", IEEE Trans. on Consumer Electronics, Vol 44, No. 3 August 1998, pp. 774-783.
Figure 28. Requirements of a motion estimation/prediction selection chip for MPEG-2 encoding.
Figure 30. High level architecture of the IMAGE chip with the indication of the critical communication
paths.
This web-based advanced course on VLSI system design has evolved from the lecture notes and the additional material used by Prof.
Daniel Mlynek and Prof. Yusuf Leblebici in their senior-year course offerings at the Swiss Federal Intitute of Technology - Lausanne,
over the past several years. The aim of the course series is to present a unified view of technological, architectural and design-related
aspects of VLSI systems, and to familiarize the audience with the state-of-the-art issues in VLSI system design.
The course series is primarily intended for senior-level undergraduate and/or graduate student audiences, as well as for practicing
engineers and designers in the microelectronics industry. The expected background includes basic knowledge of MOSFET device
operation, analysis and design of basic digital/logic circuits (such as elementary Boolean gates), and a sufficient knowledge of IC design
tools, all of which can be covered in introductory Semiconductor Device and Microelectronics courses at the undergraduate level.
Ideally, the course work is to be supplemented with laboratory exercises to reinforce the essential design issues and problems using an
industry-standard IC design environment.
production of
Important Remarks
Some sections of this work are based on previously published material by both authors. In particular, Chapters 2, 4, 5 and 7 are largely
based on material presented in "CMOS Digital Integrated Circuits: Analysis and Design", Second Edition (by S.M. Kang and Y.
Leblebici, ISBN 0-07-292507-8).
This web-site is currently under construction, and its contents are subject to change without prior notice. Please reload the main page to
see the latest changes.
The authors welcome all comments and suggestions of their web-audience regarding the technical contents and the presentation of the
material.
Navigation
Use the following icons to navigate through this website. Clicking on the main Logo will always return you to the index page.
Previous Section
Next Section
Enlarge Image
production of
The following graphic is a clickable image map that outlines the major steps of a typical VLSI design flow. You can branch out to more detailed partial flow diagrams for Top-Down,
Bottom-Up and Fabrication/Test flows by clicking on the corresponding portion of this diagram.
a joint production of
Daniel Mlynek
Dr. Daniel J. MLYNEK obtained his Ph.D. degree from the University of Strasbourg, France in 1972. He joined ITT Semiconductors in
1973 as a Design Engineer for MOS circuits in the Telecommunication field.
He was with ITT Semiconductors until 1989 and held several positions in the R&D, including that of the Technical Director in charge of
the IC developments and the associate technologies. The main activities in the Design were in the area of Digital TV Systems where ITT
is a World leader an in some of the advanced HDTV concepts.
He has several patents on digital TV Systems. Dr. Mlynek was awarded the Eduard Rhein Price for his innovation in signal processing
principles that have been implemented in the digital TV system "Digit 2000".
In June 1989, Dr. Mlynek joined the Swiss Federal Institute of Technology (EPFL), Lausanne Switzerland, where he is a Professor
responsible for the VLSI Integrated Circuits.
● Email : [email protected]
● Homepage: http://c3iwww.epfl.ch/people/daniel/dan.html
● Tel: +41 021 693 4681
Yusuf Leblebici
Yusuf Leblebici (born 1962 in Istanbul) received the B.S. and M.S. degrees in electrical engineering from Istanbul Technical University
(ITU) in 1984 and 1986, respectively, and the Ph.D. degree in electrical and computer engineering from the University of Illinois at
Urbana-Champaign in 1990. From 1991 to 1993 he worked as a Visiting Assistant Professor of Electrical and Computer Engineering,
and a Visiting Research Assistant Professor of Coordinated Science Laboratory at the University of Illinois at Urbana-Champaign.
During this period, he was a member of the VLSI/OEIC Design Group at Beckman Institute. In 1993 he joined the Department of
Electrical and Electronics Engineering, Istanbul Technical University as an Associate Professor. He also worked as a senior designer
and project manager at ETA ASIC Design Center, Istanbul. From September 1996 to March 1998, he was an Invited Professor at the
Integrated Systems Center, Swiss Federal Institute of Technology in Lausanne, Switzerland.
Currently, Y. Leblebici is an Associate Professor in the Department of Electrical and Computer Engineering at Worcester Polytechnic
Institute. His research interests include design of high-speed CMOS digital integrated circuits, modeling and simulation of
semiconductor devices, computer-aided design of VLSI circuits, and VLSI reliability analysis.
Dr. Leblebici served on the organizing committee of the 1995 European Conference on Circuit Theory and Design. He received a
NATO Science Fellowship award in 1986, he has been an Honors Scholar of the Turkish Scientific and Technological Research Council
in 1987-1990, and he received the Junior Scientist Award of the same Council in 1995.
● Email : [email protected]
● Homepage: http://ece.wpi.edu/~leblebic/index.html
● Tel: +1 508 831 5494
production of
Internet Links
This is an online Master's Thesis, which describes the Full Custom design of a DSP Macro block. It discusses various
issues of adder design and RAM design.
● Guide to Synthesis and Implementation Tools for VHDL Modeling and Design
This document focuses on the steps needed to actually map a VHDL design onto FPGA chips. It describes various
configuration files and scripts needed during this process.
Publications of WPI Microelectronics Group members in PDF, PS or MS-PowerPoint format are available. Some papers
have also been converted to HTML format. Topics include Analog/Digital Converters, Threshold Logic Circuits and High-
Speed Mixed Signal Circuits.
University of Idaho has a very nice collection of VLSI related links. The above link points to the list of VLSI related
courses.
An excellent site in Poland featuring complete colorful lecture slides of the VLSI Digital Circuits Course.
This is a complete book on Computer Aided design of Integrated Circuits available online. The book was originally
published in 1987, this is a revised second edition.
A huge website which is designed to be a companion to the book "ASIC's". According to the author the material consists
of nearly 300 figures and 100 tables, as well as the complete text of the 1040 page book.
Cadence design environment of Cadence Design Systemsis an industrial grade design environment for VLSI circuits.
These pages provide a very nice tutorial for this complex program.
This is the web page for "Digital Integrated Circuits", a popular textbook by Jan Rabaey. It includes, among other things,
transparency slides for all lectures.
● Microchips Presentation
A very good introductory level presentation of the microelectronic circuits. Excellent graphics and animations, a must see.
The only drawbacks, are the size of images (patience or high speed links required) and the presentation language, which is
in German.
Reto Zimmermann has complete course slides (Computer Arithmetic: Principles, Architectures, and VLSI Design) in
Postscript format on his homepage. The slides give an excelent overview of current computation architectures. Reto
Zimmermann also maintains the Emacs VHDL mode.
● Chip Directory
If you are looking for pinouts, general information about chips this is definetely the place to take a look at. The site
maintains a large collection of links.
This is a cute site. Acronyms for the masses: the glossary includes more than 15000 entries.
Everything about CPU's: history, news, photographs, performance charts and comprehensive list of online CPU
documentation.
MAGIC is a popular layout design systems for VLSI circuits. It is extensively being used by educational institutes and
even in industry. The best part is that it runs on any ordinary PC.
Powered by..
● Linux
These pages are hosted and created on machines running the Linux operating system. This machine runs Turkuaz, a
localized linux distribution.
● GIMP
GIMP is a wonderful image manipulation tool. It has been used to create and/or modify all graphics on this site.
production of