Implementation of Basic DSP algorithms
Introduction
In this unit, we deal with implementations of DSP algorithms &
write programs to implement the core algorithms only
However, these programs can be combined with input/output
routines to create applications that work with a specific hardware
Q-notation
FIR filters
IIR filters
Interpolation filters
Decimation filters
The Q-notation
DSP algorithm implementations deal with signals and coefficients
To use a fixed point DSP device efficiently, one must consider
representing filter coefficients and signal samples using fixed-
point2’s complement representation
Ex: N=16, Range: -2N-1 to +2N-1(-32768 to 32767) Typically, filter
coefficients are fractional numbers
To represent such numbers, the Q-notation has been developed
The Q-notation specifies the number of fractional bits
Ex: Q7
A commonly used notation for DSP implementations is Q15
In the Q15 representation, the least significant 15 bits represent the
fractional part of a number
In a processor where 16 bits are used to represent numbers, the Q15
notation uses the MSB to represent the sign of the number and the
rest of the bits represent the value of the number
In general, the value of a 16-bit Q15 number N represented as:
b15…………b1b0
N= - b15+ b142-1+…………+b02-15
Range:-1 to 1- 2-15
Example1: What values are represented by the 16-bit fixed point
number N=4000h in Q15 & Q7 notations?
Solution: Q15 notation: 0.100 0000 0000 0000 (N=0.5)
Q7 notation: 0100 0000 0.000 0000 (N=+128)
Multiplication of numbers represented using the Q-notation is
important for DSP implementations
Figure below shows typical cases encountered in such
implementations
N1(16 bit) N2(16 bit) N3(16 bit)
Q0 Q0 Q0
Q0 Q15 Q15
Q15 Q15 Q30
Multiplication of numbers represented using Q-notation
Program to multiply two Q15 numbers i.e N1×N2 = N1*N2
Where N1 & N2 are 16-bit numbers in Q15 notation, N1×N2 is the
16-bit result in Q15 notation
.mmregs ; .memory mapped registers
.data ; sequential locations
N1: .word 4000h ; N1=0.5 (Q15 numbers)
N2: .word 2000h ; N2=0.25 (Q15 numbers)
N1×N2 .space 10h ; space for N1×N2
.text
.ref _c_int00
.sect “.vectors ”
RESET: b _c_int00 ; reset vector
nop
nop
_c_int00
STM #N1, AR2 ; AR2 points to N1
LD *AR2+, T ; T reg =N1
MPY *AR2+, A ; A= N1 *N2 in Q30 notation
ADD #1, 14, A ; round the result
STH A, 1, *AR2 ; save N1 *N2 as Q15 number
NOP
NOP
.end
FIR Filters
A finite impulse response (FIR) filter of order N can be described by
the difference equation
The expanded form is
y(n)=h(N-1)x(n-(N-1))+h(N-2)x(n-(N-2))+ ...h(1)x(n-1)+h(0)x(n)
A FIR filter implementation block diagram
The implementation requires signal delay for each sample to
compute the next output, y(n+1), is given as
y(n+1)=h(N-1)x(n-(N-2))+h(N-2)x(n-(N-3))+ ...h(1)x(n)+h(0)x(n+1)
Figure below shows the memory organization for the implementation
of the filter
The filter coefficients and the signal samples are stored in two
circular buffers each of a size equal to the filter
AR2 is used to point to the samples and AR3 to the coefficients
In order to start with the last product, the pointer register AR2 must
be initialized to access the signal sample x(2-(N-1)), and the pointer
register AR3 to access the filter coefficient h(N-1)
As each product is computed and added to the previous result, the
pointers advance circularly
At the end of the computation, the signal sample pointer is at the
oldest sample, which is replaced with the newest sample to proceed
with the next output computation
Organization of signal samples and filter coefficients in circular
buffers for a FIR filter implementation
Program to implement an FIR filter
It implements the following equation;
y(n)=h(N-1)x(n-(N-1))+h(N-2)x(n-(N-2))+ ...h(1)x(n-1)+h(0)x(n)
Where N = Number of filter coefficients = 16
h(N-1), h(N-2),...h(0) etc are filter coeffs (q15numbers)
The coefficients are available in file: coeff_fir.dat
x(n-(N-1)), x(n-(N-2),...x(n) are signal samples(integers)
The input x(n) is received from the data file: data_in.dat
The computed output y(n) is placed in a data buffer
.mmregs
.def _c_int00
.sect "samples"
InSamples .include "data_in.dat" ; Allocate space for x(n)s
OutSamples .bss y, 200, 1 ; Allocate space for y(n)s
SampleCnt .set 200 ; Number of samples to filter
.bss CoefBuf, 16, 1 ; Memory for coeff circular buffer
.bss SampleBuf, 16, 1 ; Memory for sample circular buffer
.sect "FirCoeff" ; Filter coeff (seq locations)
FirCoeff .include "coff_fir.dat“
Nm1 .set 15 ; N–1
.text
_c_int00:
STM #OutSamples, AR6 ; clear o/p sample buffer
RPT #SampleCnt
ST #0, *AR6+
STM #InSamples, AR5 ; AR5 points to InSamples buffer
STM #OutSamples, AR6 ; AR6 points to OutSample buffer
STM #SampleCnt, AR4 ; AR4 = Number of samples to filter
CALL fir_init ; Init for filter calculations
SSBX SXM ; Select sign extension mode
loop:
LD *AR5+, A ; A = next input sample (integer)
CALL fir_filter ; Call Filter Routine
STH A, 1, *AR6+ ; Store filtered sample (integer)
BANZ loop, *AR4- ; Repeat till all samples filtered
nop
nop
nop
FIR Filter Initialization Routine
This routine sets AR2 as the pointer for the sample circular buffer
AR3 as the pointer for coefficient circular buffer
BK = Number of filter taps - 1
AR0 = 1 = circular buffer pointer increment
fir_init:
ST #CoefBuf, AR3 ; AR3 is the CB Coeff Pointer
ST #SampleBuf, AR2 ; AR2 is the CB sample pointer
STM #Nm1, BK ; BK = number of filter taps
RPT #Nm1
MVPD #FirCoeff, *AR3+% ; Place coeff in circular buffer
RPT #Nm1 - 1 ; Clear circular sample buffer
ST #0h,*AR2+%
STM #1, AR0 ; AR0 = 1 = CB pointer increment
RET
nop
nop
nop
FIR Filter Routine
Enter with A=the current sample x(n)-an integer, AR2 pointing to the
location for the current sample x(n), and AR3 pointing to the q15
coefficient h(N-1)
Exit with A = y(n) as q15 number
fir_filter:
STL A, *AR2+0% ; Place x(n)in the sample buffer
RPTZ A, #Nm1 ;A= 0
MAC *AR3+0%, *AR2+0%, A ; A = filtered sum (q15)
RET
nop
nop
nop
.end
IIR Filters
An infinite impulse response (IIR) filter is represented by a transfer
function, which is a ratio of two polynomials in z
To implement such a filter, the difference equation representing the
transfer function can be derived and implemented using multiply and
add operations
To show such an implementation, we consider a second order transfer
function given by
Block diagram of second order IIR filter
w(n)=x(n)+a1w(n-1)+a2w(n-2)
y(n)=b0w(n)+b1w(n-1)+b2w(n-2)
Program for IIR filter
The transfer function is
This is equivalent to the equations
w(n) = x(n) + a1.w(n-1) + a2.w(n-2)
y(n) = b0.w(n) + b1.w(n-1) + b2.w(n-2)
.mmregs
.def _c_int00
.sect "samples"
InSamples .include "data_in.dat" ; Allocate space for x(n)s
OutSamples .bss y, 200, 1 ; Allocate buffer for y(n)s
SampleCnt .set 200 ; Number of samples to filter
Intermediate variables (sequential locations)
Wn .word 0 ;initial w(n)
wnm1 .word 0 ;initial w(n-1) =0
wnm2 .word 0 ;initial w(n-2)=0
.sect "coeff"
Filter coefficients (sequential locations)
b0 .word 3431 ; b0 = 0.104
b1 .word -3356 ; b1 = -0.102
b2 .word 3431 ; b2 = 0.104
a1 .word -32767 ; a1 = -1
a2 .word 20072 ; a2 = 0.612
.text
_c_int00:
STM #OutSamples, AR6 ; Clear output sample buffer
RPT #SampleCnt
ST #0, *AR6+
STM #InSamples, AR5 ; AR5 points to InSamples buffer
STM #OutSamples, AR6 ; AR6 points to OutSample buffer
STM #SampleCnt, AR4 ; AR4 = Number of samples to filter
loop:
LD *AR5+, 15, A ; A = next input sample (q15)
CALL iir_filter ; Call Filter Routine
STH A, 1, *AR6+ ; Store filtered sample (integer)
BANZ loop,*AR4- ; Repeat till all samples filtered
nop
nop
nop
IIR Filter Subroutine
Enter with A = x(n) as q15 number
Exit with A = y(n) as q15 number
Uses AR2 and AR3
iir_filter
SSBX SXM ; Select sign extension mode
;w(n)=x(n)+ a1.w(n-1)+ a2.w(n-2)
STM #a2, AR2 ; AR2 points to a2
STM #wnm2, AR3 ; AR3 points to w(n-2)
MAC *AR2-,*AR3-, A ; A = x(n)+ a2.w(n-2)
;AR2 points to a1 & AR3 to w(n- 1)
MAC *AR2-, *AR3-, A ; A = x(n)+ a1.w(n-1)+ a2.w(n-2)
;AR2 points to b2 & AR3 to w(n)
STH A, 1, *AR3 ; Save w(n)
;y(n)=b0.w(n)+ b1.w(n-1)+ b2.w(n-2)
LD #0,A ;A=0
STM #wnm2, AR3 ; AR3 points to w(n-2)
MAC *AR2-,*AR3-, A ; A = b2.w(n-2)
;AR2 points to b1 & AR3 to w(n-1)
DELAY *AR3 ; w(n-1) -> w(n-2)
MAC *AR2-,*AR3-, A ; A = b1.w(n-1)+ b2.w(n-2)
;AR2 points to b0 & AR3 to w(n)
DELAY *AR3 ; w(n) -> w(n-1)
MAC *AR2,*AR3,A ; A = b0.w(n)+ b1.w(n-1)+ b2.w(n-2)
RET ; Return
Nop
Nop
Nop
.end
Interpolation Filters
An interpolation filter is used to increase the sampling rate
The interpolation process involves inserting samples between the
incoming samples to create additional samples to increase the
sampling rate for the output
One way to implement an interpolation filter is to first insert zeros
between samples of the original sample sequence
The zero-inserted sequence is then passed through an appropriate
lowpass digital FIR filter to generate the interpolated sequence
The interpolation process is depicted in Figure
The interpolation process
Example
X(n) = [0 2 4 6 8 10] ;input sequence
Xz(n) = [0 0 2 0 4 0 6 0 8 0 10 0] ;zero inserted sequence
h(n) = [0.5 1 0.5] ;impulse sequence
Y(n) = [0 0 1 2 3 4 5 6 7 8 9 10 5 0] ;interpolated sequence y(n)
The kind of interpolation carried out in the examples is called linear
interpolation because the convolving sequence h(n) is derived based
on linear interpolation of samples
Further, in this case, the h(n) selected is just a second-order filter and
therefore uses just two adjacent samples to interpolate a sample
A higher-order filter can be used to base interpolation on more input
samples
To implement an ideal interpolation, Figure below shows how an
interpolating filter using a 15-tap FIR filter and an interpolation
factor of 5 can be implemented
In this example, each incoming samples is followed by four zeros to
increase the number of samples by a factor of 5
The interpolated samples are computed using a program similar to
the one used for a FIR filter implementation
Interpolating filter using a 15-tap FIR filter and an interpolation factor
of 5
One drawback of using the implementation strategy depicted in
Figure above is that there are many multiplies in which one of the
multiplying elements is zero
Such multiplies need not be included in computation if the
computation is rearranged to take advantage of this fact
One such scheme, based on generating what are called polyphase
subfilters, is available for reducing the computation
For a case where the number of filter coefficients N is a multiple of
the interpolating factor L, the scheme implements the interpolation
filter using the equation
Figure below shows a scheme that uses polyphase subfilters to
implement the interpolating filter using the 15-tap FIR filter and an
interpolation factor of 5
In this implementation, the 15 filter taps are arranged as shown and
divided into five 3-tap subfilters
The input samples x(n), x(n-1) and x(n-2) are used five times to
generate the five output samples
This implementation requires 15 multiplies as opposed to 75 in the
direct implementation of Figure below
A scheme that uses polyphase subfilters to implement the
interpolating filter using the 15-tap FIR filter and an interpolation
factor of 5 Implementation
Implementation of interpolating FIR filter
To implement an interpolating FIR filter
The filter length is 15 and the interpolating factor is 5
It implements the equations;
y(m) = h(10)x(n-2) + h(5)x(n-1) + h(0)x(n)
y(m+1) = h(11)x(n-2) + h(6)x(n-1) + h(1)x(n)
y(m+2) = h(12)x(n-2) + h(7)x(n-1) + h(2)x(n)
y(m+3) = h(13)x(n-2) + h(8)x(n-1) + h(3)x(n)
y(m+4) = h(14)x(n-2) + h(9)x(n-1) + h(4)x(n)
Where m = 5n and h(0), h(1),....etc. are the filter coefficients
(q15 numbers) stored in data memory in the order: h(4), h(9), h(14),
h(3), h(8), h(13), h(2), h(7), h(12), h(1), h(6), h(11), h(0), h(5), h(10)
x(n), x(n-1), and x(n-2) are signal samples (integers) used in
computing the next five output samples
The input samples are obtained from a file and placed in memory
starting at address In Samples
The computed output samples are placed starting at data memory
location Out Samples
.mmregs
.def _c_int00
.sect "samples"
InSamples .include "data_in.dat" ; Incoming data (from
a file)
InSampCnt .set 50 ; Input sample count
.bss sample, 3, 1 ; Input samples: x(n),
x(n-1), x(n-2)
OutSamples .bss y, 250, 1 ; Allocate space for
y(n)s
SampleCnt .set 250 ; Number of samples
Coeff .sect "Coeff"
.word 2560, 3072, 512 ; Filter coeffs h(4), h(9),
h(14)
.word 2048, 3584, 1024 ; Filter coeffs h(3), h(8),
h(13)
.word 1536, 4096, 1536 ; Filter coeffs h(2), h(7),
h(12)
.word 1024, 3584, 2048 ; Filter coeffs h(1), h(6),
h(11)
.word 512, 3072, 2560 ; Filter coeffs h(0), h(5),
h(10)
CoeffEnd
Nm1 .set 2 ; # of coeff/interp factor-1
IFm1 .set 4 ; interpolating factor-1
.text
_c_int00:
SSBX SXM ; Select sign extension mode
RSBX FRCT
stm #InSamples, ar6 ; ar6 points to the input
samples
stm #InSampCnt-1, ar7 ; ar7 = input sample count - 1
stm #OutSamples, ar5 ; ar5 points to the output
samples
rpt #SampleCnt-1 ; Reset ouput samples memory
st #0, *ar5+
stm #OutSamples, ar5 ; ar5 points to the output
samples
stm #sample, ar3 ; ar3 points to current in
samples
rpt #Nm1 ; Reset the input samples
st #0, *ar3+
INTloop1:
stm #CoeffEnd-1, ar2 ; ar2 points to the last
coeff
stm #IFm1, ar4 ; ar4 = Interpolation factor -1
INTloop2:
stm #sample+Nm1, ar3 ; ar3 points to last sample in
use
stm #Nm1, ar1 ; ar1 = samples for use
ld #0, A ;A=0
NXTcoeff:
mac *ar2-, *ar3-, A ; Compute interpolated sample
banz NXTcoeff, *ar1-
banz INTloop2, *ar4-
sth A, 1, *ar5+ ; Store the interpolated
sample
stm #sample+Nm1-1, ar3 ; Delay the sample array
rpt #Nm1-1
delay *ar3-
ld *ar6+, A ; Get the next sample
stm #sample, ar2
stl A, *ar2 ; Place it in the sample buffer
banz INTloop1, *ar7- ; Repeat for all input samples
nop
nop
nop
.end
Decimation Filters
A decimation filter is used to decrease the sampling rate
The decrease in sampling rate can be achieved by simply dropping
samples
For instance, if every other sample of a sampled sequence is
dropped, the sampling the rate of the resulting sequence will be half
that of the original sequence
The problem with dropping samples is that the new sequence may
violate the sampling theorem, which requires that the sampling
frequency must be greater than two times the highest frequency
contents of the signal
To circumvent the problem of violating the sampling theorem, the
signal to be decimated is first filtered using a lowpass filter
The cutoff frequency of the filter is chosen so that it is less than half
the final sampling frequency
The filtered signal can be decimated by dropping samples
In fact, the samples that are to be dropped need not be computed at
all
Thus, the implementation of a decimator is just a FIR filter
implementation in which some of the outputs are not calculated
Figure below shows a block diagram of a decimation filter
The decimation process
Digital decimation can be implemented as depicted in Figure below
for an example of a decimation filter with decimation factor of 3
It uses a lowpass FIR filter with 5 taps
The computation is similar to that of a FIR filter
However, after computing each output sample, the signal array is
delayed by three sample intervals by bringing the next three samples
into the circular buffer to replace the three oldest samples
Implementation of decimation filter
Implementation of decimation filter
It implements the following equation:
y(m) = h(4)x(3n-4) + h(3)x(3n-3) + h(2)x(3n-2) + h(1)x(3n-1) +
h(0)x(3n) followed by the equation
y(m+1) = h(4)x(3n-1) + h(3)x(3n) + h(2)x(3n+1) + h(1)x(3n+2) +
h(0)x(3n+3) and so on for a decimation factor of 3 and a filter length
of 5
.mmregs
.def _c_int00
.sect "samples“
InSamples .include "data_in.dat" ; Allocate space for x(n)s
OutSamples .bss y,80,1 ; Allocate space for y(n)s
SampleCnt .set 240 ; Number of samples to decimate
.sect "FirCoeff" ; Filter coeff (sequential)
FirCoeff .include "coeff_dec.dat“
Nm1 .set 4 ; Number of filter taps – 1
.bss CoefBuf, 5, 1 ; Memory for coeff circular
buffer
.bss SampleBuf, 5, 1 ; Memory for sample circular
buffer
.text
_c_int00:
STM #OutSamples, AR6 ; Clear output sample buffer
RPT #SampleCnt
ST #0, *AR6+
STM #InSamples, AR5 ; AR5 points to InSamples buffer
STM #OutSamples, AR6; AR6 points to OutSample buffer
STM #SampleCnt, AR4 ; AR4 = Number of samples to
filter
CALL dec_init ; Init for filter calculations
loop:
CALL dec_filter ; Call Filter Routine
STH A, 1, *AR6+ ; Store filtered sample
(integer)
BANZ loop,*AR4- ; Repeat till all samples filtered
nop
nop
nop
Decimation Filter Initialization Routine
This routine sets AR2 as the pointer for the sample circular buffer,
and AR3 as the pointer for coefficient circular buffer
BK = Number of filter taps. ; AR0 = 1 = circular buffer pointer
increment
dec_init :
ST #CoefBuf, AR3 ; AR3 is the CB Coeff Pointer
ST #SampleBuf, AR2 ; AR2 is the CB sample pointer
STM #Nm1, BK ; BK = number of filter taps
RPT #Nm1
MVPD #FirCoeff, *AR3+% ; Place coeff in circular
buffer
RPT #Nm1 ; Clear circular sample buffer
ST #0h,*AR2+%
STM #1, AR0 ; AR0 = 1 = CB pointer increment
RET ; Return
nop
nop
nop
FIR Filter Routine
Enter with A = x(n), AR2 pointing to the circular sample buffer, and
AR3 to the circular coeff buffer AR0 = 1
Exit with A = y(n) as q15 number
dec_filter :
LD *AR5+,A ; Place next 3 input samples
STL A, *AR2+0% ; into the signal buffer
LD *AR5+,A
STL A, *AR2+0%
LD *AR5+,A
STL A, *AR2+0%
RPTZ A, #Nm1 ; A = 0
MAC *AR3+0%, *AR2+0%, A ; A = filtered signal
RET ; Return
nop
nop
nop
.end
Implementation of FFT Algorithms
Introduction
The N point Discrete Fourier Transform (DFT) of x(n) is a discrete
signal of length N is given by
The Inverse DFT (IDFT) is given by
By referring to equations, the difference between DFT & IDFT are
seen to be the sign of the argument for the exponent and
multiplication factor, 1/N
The computational complexity in computing DFT / I DFT is thus
same (except for the additional multiplication factor in IDFT)
The computational complexity in computing each X(k) and all the
x(k) is shown in table below
In a typical Signal Processing System, shown in figure signal is
processed using DSP in the DFT domain
After processing, IDFT is taken to get the signal in its original
domain
Though certain amount of time is required for forward and inverse
transform, it is because of the advantages of transformed domain
manipulation, the signal processing is carried out in DFT domain
The transformed domain manipulations are sometimes simpler
They are also more useful and powerful than time domain
manipulation
For example, convolution in time domain requires one of the signals
to be folded, shifted and multiplied by another signal, cumulatively
Instead, when the signals to be convolved are transformed to DFT
domain, the two DFT are multiplied and inverse transform is taken
Thus, it simplifies the process of convolution
DSP System
An FFT Algorithm for DFT Computation: As DFT / IDFT is part
of signal processing system, there is a need for fast computation of
DFT / IDFT
There are algorithms available for fast computation of DFT/ IDFT
There are referred to as Fast Fourier Transform (FFT) algorithms
There are two FFT algorithms: Decimation-In-Time FFT
(DITFFT) and Decimation-In-Frequency FFT (DIFFFT)
The computational complexity of both the algorithms are of the order
of log2(N)
From the hardware /software implementation viewpoint the
algorithms have similar structure throughout the computation
In-place computation is possible reducing the requirement of large
memory locations
The features of FFT are tabulated in the table below
Consider an example of computation of 2 point DFT
The signal flow graph of 2 point DITFFT Computation is shown in
figure
The input / output relations are as in eq (6.3) which are arrived at
from eq (6.1)
Signal Flow graph for N=2
Similarly, the Butterfly structure in general for DITFFT algorithm is
shown in fig. 6.3
The signal flow graph for N=8 point DITFFT is shown in fig. 4
The relation between input and output of any Butterfly structure is
shown in eq (6.4) and eq (6.5)
Separating the real and imaginary parts, the four equations to be
realized in implementation of DITFFT Butterfly structure are as in
eq(6.6)
Observe that with N=2^M, the number of stages in signal flow
graph=M, number of multiplications = (N/2)log2(N) and number of
additions = (N/2)log2(N)
Number of Butterfly Structures per stage = N/2
They are identical and hence in-place computation is possible
Also reusability of hardware designed for implementing Butterfly
structure is possible
However in case FFT is to be computed for a input sequence of
length other than 2^M the sequence is extended to N=2^M by
appending additional zeros
The process will not alter the information content of the signal
It improves frequency resolution
To make the point clear, consider a sequence whose spectrum is
shown in fig 6.5
The spectrum is sampled to get DFT with only N=10
The same is shown in fig 6
The variations in the spectrum are not traced or caught by the DFT
with N=10
For example, dip in the spectrum near sample no. 2, between sample
no.7 & 8 are not represented in DFT
By increasing N=16, the DFT plot is shown in fig. 6.7
As depicted in fig 6.7, the approximation to the spectrum with N=16
is better than with N=10
Thus, increasing N to a suitable value as required by an algorithm
improves frequency resolution
Example 1: What minimum size FFT must be used to compute a
DFT of 40 points? What must be done to samples before the
chosen FFT is applied? What is the frequency resolution
achieved?
Solution: Minimum size FFT for a 40 point sequence is 64 point FFT
Sequence is extended to 64 by appending additional 24 zeros
The process improves frequency resolution from
Problem : Derive equations to implement a Butterfly encountered in a
DIFFFT implementation
Solution: Butterfly structure for DIFFFT:
The input / output relations are
Separating the real and imaginary parts,
Example 2: How many add/subtract and multiply operations are
needed to implement a general butterfly of DITFFT?
Solution: Referring to 4 equations required in implementing DITFFT
Butterfly structure, Add//subtract operations 06 and Multiply
operations 04
Overflow and Scaling
In any processing system, number of bits per data in signal
processing is fixed and it is limited by the DSP processor used
Limited number of bits leads to overflow and it results in
erroneous answer
In Q15 notation, the range of numbers that can be represented is
-1 to 1
If the value of a number exceeds these limits, there will be
underflow / overflow
Data is scaled down to avoid overflow
However, it is an additional multiplication operation
Scaling operation is simplified by selecting scaling factor of 2^-n
And scaling can be achieved by right shifting data by n bits
Scaling factor is defined as the reciprocal of maximum possible
number in the operation
Multiply all the numbers at the beginning of the operation by scaling
factor so that the maximum number to be processed is not more than
1
In the case of DITFFT computation, consider for example,
To find the maximum possible value for LHS term, Differentiate and
equate to zero
Thus scaling factor is 1/2.414=0.414
A scaling factor of 0.4 is taken so that it can be implemented by
shifting the data by 2 positions to the right
The symbolic representation of Butterfly Structure is shown in fig
6.8
The complete signal flow graph with scaling factor is shown in fig
6.9
Thus scaling factor is 0.707
To achieve multiplication by right shift, it is chosen as 0.5
Example 3: A time-domain sequence of 73 elements is to be convolved
with another time domain sequence of 50 elements using DFT to
transform the two sequences, multiplying them, and then doing the
IDFT to obtain the resulting time-domain sequence. To implement
DFT or IDFT, the DITFFT algorithm is to be used. Determine the
total number of complex multiplications needed to implement the
convolution. Assume that each butterfly computation requires one
complex multiplication
Solution: x1(n) be of length 73 and x2(n) be of length 50
Length of convolved sequence = 73 + 50 -1 =122
Length of DFT or IDFT =nearest 2n =128
Two DFT and one IDFT each of length 128 are to be determined
Number of Butterfly Structures per stage =N/2=64
Number of stages = log2(N) =7
Total number of complex multiplications= 64x7x3=1344
Example 4: The computation in Example 3 is to be implemented on a
fixed point signal processor that takes 10 ns to do a real integer
multiplication. Determine the convolution computation time. If the
computation is to be implemented for a real time signal and each
time a new sample is received the transform is to be calculated.
Determine the highest frequency signal that can be handled by the
signal processor
Solution: The time for one real integer multiplication =10 ns
One complex multiplication = 4 real multiplications
The convolution computation time =1344x4x10ns=53760ns
The highest frequency signal that can be handled by the signal
processor is thus 1/53760ns=18.601KHz
Since (a+jb)(c+jd)=ac-bd+j(bc+ad)
=a(c+d)-d(a+b) +j(b(c-d)+d(a+b))
Number of real multiplication can be 3
The convolution computation time =40320ns
The highest frequency signal that can be handled by the signal
processor=24.8KHz
Bit-Reversed Index Generation
As noted in table 6.2, DITFFT algorithm requires input in bit
reversed order
The input sequence can be arranged in bit reverse order by reverse
carry add operation
Add half of DFT size (=N/2) to the present bit reversed index to get
next bit reverse index
And employ reverse carry propagation while adding bits from left to
right
The original index and bit reverse index for N=8 is listed in table 6.3
Consider an example of computing bit reverse index
The present bit reversed index be 110
The next bit reversed index is
110
1 0 0 (N/2=4)
-------
001
There are addressing modes in DSP supporting bit reverse indexing,
which do the computation of reverse index
Implementation of FFT on TMS32OC54xx
The main program flow for the implementation of DITFFT is shown
in fig. 6.10
The subroutines used are _clear to clear all the memory locations
reserved for the results
_bitrev stores the data sequence x(n) in bit reverse order
_butterfly computes the four equations of computing real and
imaginary parts of butterfly structure
_spectrum computes the spectrum of x(n)
The Butterfly subroutine is invoked 12 times and the other
subroutines are invoked only once
The program is as follows
.mmregs
.def _c_int00
.data
; Reserve 8 locations for x(n)
;x(n) Q15 notation decimal value
xn0 .word 0 ; 0h 0.0
xn1 . word 16384 ; 4000h 0.5
xn2 .word 23170 ; 5A82h 0.707
xn3 . word - 24576 ; E000h -0.25
xn4 .word 12345 ; 3039h 0.3767
xn5 .word 30000 ; 7530h 0.9155
xn6 .word 10940 ; 2ABCh 0.334
xn7 .word 12345 ; 3039h 0.3767
; Reserve 16 locations for X(k)
X0R .word 0 ;real part of X(0) =0
X0Im .word 0 ;imaginary part of X(0) =0
X1R .word 0
X1Im .word 0
X2R .word 0
X2Im .word 0
X3R .word 0
X3Im .word 0
X4R .word 0
X4Im .word 0
X5R . word 0
X5Im .word 0
X6R .word 0
X6Im .word 0
X7R .word 0
X7Im .word 0
; 8 locations for W08 to W38, twiddle factors
W08R .word 32767 ; cos(0)=1
W08Im .word 0 ; -sin(0)=0
W18R .word 23170 ; cos(pi/4)= 0.707
W18Im .word -23170 ; -sin(pi/4)= -0.707
W28R .word 0 ; cos(pi/2)= 0
W28Im .word -32767 ; -sin(pi/2)= -1
W38R .word -23170 ; cos(3pi/4)= -0.707
W38Im .word -23170 ; -sin(3pi/4)= -0.707
;temporary locations
TEMP1 .word 0
TEMP2 .word 0
;MAIN PROGRAM
. text
_c_int00:
SSBX SXM ; set sign extension mode bit of ST1
CALL _clear
CALL _bitrev
Clear subroutine is shown in fig. 6.11
Sixteen locations meant for final results are cleared
AR2 is used as pointer to the locations
Bit reverse subroutine is shown in fig. 6.12
Here, AR1 is used as pointer to x(n)
AR2 is used as pointer to X(k) locations
AR0 is loaded with 8 and used in bit reverse addressing
Instead of N/2 =4, it is loaded with N=8 because each X(k) requires
two locations, one for real part and the other for imaginary part
Thus, x(n) is stored in alternate locations, which are meant for real
part of X(k)
AR3 is used to keep track of number of transfers
Butterfly subroutine is invoked 12 times
Part of the subroutine is shown in fig. 6.13
Real part and imaginary of A and B input data of butterfly structure
is divided by 4 which is the scaling factor
Real part of A data which is divided by 2 is stored in temp location
It is used further in computation of eq (3) and eq (4) of butterfly
Division is carried out by shifting the data to the right by two places
AR5 points to real part of A input data, AR2 points to real part of B
input data and AR3 points to real part of twiddle factor while invoking
the butterfly subroutine
After all the four equations are computed, the pointers are in the same
position as they were when the subroutine is invoked
Thus, the results are stored such that in-place computation is achieved
Fig. 6.14 through 6.17 show the butterfly subroutine for the
computation of 4 equations
Figure 6.18 depicts the part of the main program that invokes
butterfly subroutine by supplying appropriate inputs, A and B to the
subroutine
The associated butterfly structure is also shown for quick reference
Figures 6.19 and 6.20 depict the main program for the computation
of 2nd and 3rd stage of butterfly