Thanks to visit codestin.com
Credit goes to www.scribd.com

0% found this document useful (0 votes)
48 views38 pages

Module 5 - DSP

Uploaded by

stumiki
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
48 views38 pages

Module 5 - DSP

Uploaded by

stumiki
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 38

Module 5

DSP Processors Architecture: DSP architecture for signal


processing - Harvard architecture, pipelining, hardware
multiplier-accumulator.
Digital Signal Processors
● The programmable digital signal processors (PDSPs) are general purpose
microprocessors designed specifically for digital signal processing
applications.
● They contain special architecture and instruction set so as to execute
computation - intensive DSP algorithms more efficiently.
● The programmable DSPs can be divided into two broad categories.
1. General purpose digital signal processors
2. Special purpose digital signal processors.
● General purpose digital signal processors : These are basically high speed
microprocessors with architecture and instruction sets optimized for DSP
operations. They include fixed point processors such as Texas Instruments
TMS320C5x, TMS320C54x and Motorola DSP563x and floating point
processors such as Texas Instruments TMS320C4x, TMS320C67xx and
analog devices such as ADSP21xxx.
● Special purpose digital signal processors : These type of processors consist
of hardware i) designed for specific DSP algorithms such as FFT, ii) hardware
designed for specific applications such as PCM and filtering. Examples for
special purpose DSPs are Mitel’s multi channel telephony voice echo
canceller (MT93001), FFT processor (PDSP 1655A,tm-44, tm-66) and
programmable FIR filter (UPDSP 16256, Model3092).
● In 1979, Intel introduced the first Digital Signal Processor (The Intel 2920),
featuring on-chip ADC and DACs.
● Texas instruments introduced the TMS32010 the first generation fixed point
DSP in the TMS320 family. Later it introduced TMS30C20.
Selecting Digital Signal Processors
The factors that influence the selection of a DSP processor for a given
application are architectural features, execution speed, type of arithmetic and
word length.
● Architectural features
● Execution Speed
● Type of arithmetic
● Word length
● Architectural features :
1. Key features include size of on-chip memory, special instructions and I/O
capability.
2. In applications where large memory is required on-chip memory is essential. It
helps in accessing the data at high speeds and executing the program rapidly.
For memory hungry applications, the size of the internal RAM should be high.
3. For applications that require fast and efficient communication of data flow with
the outside world, I/O features such as interface to ADC and DACs, DMA
capability and support for multiprocessing may be important.
4. Depending on the application, a rich set of special instructions to support DSP
operations are important, e.g. zero-overhead looping capability, dedicated
DSP instructions, and circular addressing.
● Execution Speed
1. The execution speed is measured in terms of the clock speed of the
processor, in MHz, and the number of instructions performed, in millions of
instructions per second (MIPS) or in the case of floating point digital signal
processors, in millions of floating point operations per second (MFLOPS).
2. Comparison of execution of speed of processors based on such measures
may not be meaningful.
3. An alternative measure is based on the execution speed of benchmark
algorithms such as FFT, FIR and IIR filters.
● Type of arithmetic
1. The two most common type of arithmetic used in modern digital signal
processors are fixed and floating point arithmetic.
2. Fixed point processors are favoured in low cost, high volume applications
(e.g. cellular phones and computer disk drives).
3. Floating point arithmetic is the natural choice for applications with wide and
variable dynamic range requirements.
4. Floating processors are more expensive than fixed point processors.
● Word Length
1. The longer the data word the lower the errors that are introduced by digital
signal processing.
2. Fixed point digital signal processors aimed at telecommunications markets
tend to use a 16-bit word length, whereas those aimed at high quality audio
applications tend to use 24-bits.
3. In most floating point DSP processors, a 32-bit data size (24-bit mantissa and
8-bit exponents) are used for single-precision arithmetic.
4. Most floating point DSP processors also have fixed point arithmetic capability,
and often support variable data size, fixed point arithmetic.
Applications of PDSPs
Applications are divided into three categories :
● Communication Systems
● Multimedia
● Control/data acquisitions
● Communication systems
1. Caller ID, cordless handsets, and many others.
2. In voice communication, an acoustic-echo canceller based on hands-free
wireless system is developed using TMS320Cx.
3. TMS320C50 fixed point processor can be used to implement a low bit-rate
(1.4 Kbps), real-time vocoder (voice coder).
4. A telephone voice dialer is implemented with a 16-bit fixed-point TMS320C5x
PDSP.
5. Modern PDSPs are also suitable for error correction in digital communication.
6. Digital baseband signal processing is another important application of PDSPs.
7. For defense system applications, a linear array of TMS320C30 as the
front-end and a Transputer processor array as the back-end for
programmable radar signal processing are developed
● Audio Signal Processing
1. PDSP applications to audio signal processing can be classified into three
categories according to the qualities and audible range of the signal
professional audio products, consumer audio products, and computer audio
multimedia systems.
● Control and data acquisition
1. Several control applications are implemented using Motorola DSP56000
PDSPs that function as both powerful microcontrollers and as fast digital
signal processors.
2. Design examples include a PID controller and an adaptive controller of
TMS320C40 for distributed-memory parallel PDSP.

● Biometric Information Processing


1. Handwritten signature verification, one of the biometric authentication
techniques is cheap, reliable and non-intrusive to the person being
authorized. This verification method can be part of a variety of entrance
monitoring and security systems.
● Image/Video processing
1. Existing image and video compression standards such as JPEG,and MPEG
are based on the DCT (discrete cosine transform) algorithm.
2. The JPEG 2000 image coding standard is based on the discrete wavelet
transform (DWT). These standards are often implemented in modern digital
cameras and digital camcorders where PDSPs will play an important role.
3. Medical imaging has become another fast growing application area of PDSPs.
It can be used as on-line data processor for processing MRI.
4. It can perform real-time dynamic imaging such as the cardiac imaging,
angio-graphy, and abdominal imaging.
Von Neumann Architecture
● In 1946, John Von Neumann developed the first computer architecture that
allowed the computer to be programmed by codes residing in memory.
● In this, program instructions were stored in Read Only Memory (ROM).
● The Von Neumann architecture is most widely used in majority of
microprocessors.
● In a computer with Von Neumann architecture, the CPU can be either reading
an instruction or reading/writing data from/to memory. Both cannot occur at
the same time since the instruction and data use the same signal pathways
and memory.
● The Von Neumann architecture consists of three buses : the data bus, the
address bus and the control bus.
● The Data bus : Transports data between CPU and its peripherals. It is
bidirectional. The CPU can read or write data in the peripherals.
● The address bus : The CPU uses the address bus to indicate which
peripherals it wants to access and within each peripheral which specific
register. The address bus is unidirectional. The CPU always writes the
address, which is read by the peripherals.
● Control bus : The bus carrier signals that are used to manage and
synchronize the exchanges between the CPU and its peripherals, as well as
that indicates if the CPU wants to read or write the peripheral.
● The main characteristics of the Von Neumann architecture is that it only
possesses 1 bus system. The same bus carries all the information exchanged
between the CPU and the peripherals including the instruction codes as well
as the data processed by the CPU.
Harvard Architecture
● The Harvard architecture physically separates memories for their instructions
and data, requiring dedicated buses for each of them. Instructions and
operands can therefore be fetched simultaneously.
● Most DSP processors use a modified Harvard architecture with two or three
memory buses; allowing access to filter coefficients and input signals in the
same cycle.
● Since it possesses two independent bus systems, the Harvard architecture is
capable of simultaneous reading an instruction code and reading or writing a
memory or peripheral as part of the execution of the previous instruction.
● Since it has two memories, it is not possible for the CPU to mistakenly write
codes into the program memory and therefore compute the code while it is
executing.
● It is less flexible. It needs two independent memory banks. These two
resources are not interchangeable.
● The modified Harvard architecture used DSPs multiport memory that has
separate bus systems for program memory and data memory and
input/output peripherals.
● It may also have multiple bus system for program memory alone or for data
memory alone. These multiple bus system increases complexity of the CPU,
but allow it to access several memory locations simultaneously, thereby
increasing the data throughput between memory and CPU.
VLIW Architecture
● The Very Long Instruction Word (VLIW) processing increase the number of
instructions that are processed per cycle.
● It is essentially a concatenation of several short instructions and require
multiple execution units, running in parallel, to carry out the instructions in a
single cycle.
● VLIW architecture executes multiple instructions/cycle and use simple, regular
instruction sets.
● The VLIW processor consists of architecture that reads a relatively large
group of instructions and executes them at the same time.
● The VLIW processor combines many simple instructions into a single long
instruction word that uses different registers.
● A language compiler or preprocessor separates program instructions into
basic operations that are performed by the processor in parallel.
● These operations are placed into a “very long instruction word” that the
processor can disassemble, and then transfer each operation to an
appropriate execution unit.
● For example, group might contain four instructions, and the compiler ensures
that those four instructions are not dependent on each other so they can be
executed simultaneously. Otherwise, it places “no-ops” (blank instructions) in
the group when necessary.
Advantages of VLIW architecture
● Increased performance
● Better compiler targets
● Potentially easier to program
● Potentially scalable
● Can add more execution units, allow more instructions to be packed into the
VLIW instruction.
Disadvantages of VLIW architecture
● New kind of programmer or compiler complexity
● Program must keep track of instruction scheduling
● Increased memory use
● High power consumption
● Misleading MIPS ratings
Multiply Accumulate Unit
● The Multiply-Accumulate (MAC) operation is the basis of many digital signal
processing algorithms, notably digital filtering.
● The term “digital filter” refers to an algorithm by which a digital signal or
sequence of numbers is transformed into a another sequence of numbers
termed the output digital signal.
● Digital filters involve signals in the digital domain and are used extensively in
applications such as digital image processing, pattern recognition, and
spectral analysis.
● In general, FIR filters are preferred in lower order solutions, and since they do
not employ feedback, they exhibit naturally bounded response. They are
simpler to implement and require one RAM location and one coefficient for
each other.
● For FIR filters the output of the filter is given by :

y(n)=N-1∑k=0x(k)h(n-k)

Where x(n) is the input to the filter, h(n) is the impulse response of the filter
and y(n) is output of the filter.
● To perform filtering through above equation, the minimum requirement is to
quickly multiply two values, and add the result.
● To make it possible, a fast dedicated hardware MAC using either fixed point or
floating point arithmetic is mandatory.
● Characteristic of a typical fixed point MAC include
1. 16 x 16 bit 2’s complement inputs
2. 16 x 16 multiplier with 32-bit product in 25 ns
3. 32/40 bit accumulator
● In the TMS320C50, for example, the FIR equation can be efficiently
implemented using the instruction pair :

RPT NM1
MACD HNM1,XNM1
● The first instruction, RPT NM1, loads the (N-1) into the repeat instruction
counter, and causes the multiply-accumulate with data move (MACD)
instruction following it to be repeated N times.
● The MACD instruction performs a number of operations in one cycle :
1. Multiplies the data sample, x(n-k), in the data memory by the coefficient, h(k),
in the program memory.
2. Adds the previous product to the accumulator.
3. Implements the unit delay, symbolized by z-1, by shifting the data sample,
x(n-k), up to update the tapped delay line.
The Multiply-Accumulate (MAC) Function
● The MAC speed applies both to FIR and IIR filters.
● The complexity of the filter response dictates the number MAC operations
required per sample period.
● A multiply-accumulate step performs the following :
1. Reads a 16-bit sample data (pointed to by a register)
2. Increments the sample data pointer by 2
3. Reads a 16-bit coefficient (pointed to by another register)
4. Increments the coefficient register pointer by 2
5. Sign multiple (16-bit) data and coefficient to yield a 32-bit result.
6. Adds the result to the contents of a 32-bit register pair for accumulate.
● The TMS320C54x multiply-accumulate (MAC) unit performs a 16x16 → 32-bit
fractional multiply-accumulate operation in a single instruction cycle.
● The multiplier supports signed/signed multiplication, signed/unsigned
multiplication and unsigned/unsigned multiplication. These operations allow
efficient extended-precision arithmetic.
● Many instructions using the MAC unit can optionally specify automatic
round-to-nearest rounding.
Pipelining
● Most of the early microprocessors execute instructions entirely sequentially.
After the execution of the first instruction the next one starts.
● The problem with this is that it is extremely inefficient, since the second
instruction has to wait until all the steps of first instruction are completed.
● To improve the efficiency, advanced microprocessors and digital signal
processors use an approach called pipelining in which different phases of
operation and execution of instructions are carried out in parallel.
● That is in modern processors the first step of execution is performed on the
first instruction, and then when the instruction passes to the next step, a new
instruction is started.
● The steps in the pipeline are often called stages.
● The basic action of any microprocessor can be broken down into a series of
four simple steps. They are:
1. The Fetch phase (F) in which the next instruction is fetched from the address
stored in the program counter.
2. The decode phase (D) in which the instruction in the instruction register is
decoded and the address in the program counter is incremented.
3. Memory read ( R) phase reads the data from the data buses and also writes
data to the data buses.
4. The Execute phase (X) executes the instruction currently in the instruction
register and also completes the write process.
● In a modern processor, the above four steps get repeated over and over
again until the program is finished executing.
● These are the four stages in classic RISC pipeline.
● Each phase takes a fixed but not equal amount of time.
● Pipelining a processor means breaking down its instruction into a series of
discrete pipeline stages which can be completed in sequence by specialized
hardware.
● Because an instruction’s lifecycle consists of four fairly distinct phases, the
instruction execution process is divided into a sequence of four discrete
pipeline stages, where each pipeline stage corresponds to a phase in the
standard instruction lifecycle.
● The number of pipeline stages is referred to as the pipeline depth. So a
four-stage pipeline has a pipeline depth of four.
● To understand pipelining in a better way, let us assume tha the number of
stages is four and the execution time of an instruction is four nanoseconds.
● If we assume the time taken for each stage in the instruction is equal, then the
time taken for each stage is one nanosecond.
● The original single-cycle processor’s four nanosecond execution process is
now broken down into four discrete, sequential pipeline stages of one
nanosecond each in length.
● At the beginning of of the first nanosecond, the first instruction enters the
fetch stage.
● After that nanosecond is complete, the second nanosecond begins and the
first instruction moves on to the decode stage while the second instruction
enters the fetch stage.
● At the start of the third nanosecond, the first instruction advances to the
memory read stage, the second instruction advances to the decode stage,
and the third instruction enters the fetch stage.
● At the fourth nanosecond, the first instruction advances to the execution
stage, the second to the memory read stage, the third to the decode stage,
and the fourth to the fetch stage.
● After the fourth nanosecond has fully elapsed and the fifth nanosecond starts,
the first instruction has passed from the pipeline and is now finished
executing.
● We can say that at the end of the four nanoseconds (= four clock cycles) the
pipelined processor depicted below has completed one instruction.
● At the start of the fifth nanosecond, the pipeline is now full and the processor
can begin completing instructions at a rate of one instruction per nanosecond.
● This one instruction/ns completion rate is a four-fold improvement over the
single-cycle processor’s completion rate of 0.25 instructions/ns (or 4
instructions every 16 microseconds).
● The pipelining stages for different DSPs are shown in Table below.
● The TMS320C54x has two additional phases : pre-fetch (PF) phase which
stores the address of the instruction to be fetched and the access phase (A)
which reads the address of the operand and modify the auxiliary registers and
stack pointer if required.
● Pipelining leads to dramatic improvements in system performance.
● The more stages that we can break the pipeline into, the more theoretical
speed we can get from it.
● For example, let’s suppose it takes 12 clock cycles to handle all the steps to
process an instruction. In theory, if we use a 4-stage pipeline, the maximum
throughput is 1 instruction every 3 cycles. But if we use a 6-stage pipeline,
maximum throughput is 1 instruction every 2 cycles.

You might also like