Transposed 3 Tap FIR Filter Design Using
Consolidation of Pipelining and Parallel
Processing Technique
Nehru Kandasamy Nagarjuna Telagam Anughna N
Research Fellow Assistant Professor Assistant Professor
Department of Electrical and Computer Department of Electrical,Electronics Department of Electrical,Electronics
Engineering and Communication Engineering and Communication Engineering
National University of Singapore GITAM University GITAM University
Singapore Bangalore, India Bangalore, India
[email protected] 0000-0002-6184-6283
[email protected] Tirumalasetty Lohith Sai Mooli Rushikeshava Reddy Akula Rupanjali
Student Student Student
Department of Electrical,Electronics Department of Electrical,Electronics Department of Electrical,Electronics
and Communication Engineering and Communication Engineering and Communication Engineering
GITAM University GITAM University GITAM University
Bangalore, India Bangalore, India Bangalore, India
[email protected] [email protected] [email protected] Abstract— The speed of the DSP processor mainly depends algorithm is designed to avoid the problem of pipelined
on the execution time of the data path and memory elements. multiple constant multiplications. RPAG outperforms the
The main objective of the paper is to reduce the critical path of existing method which depends on pipelining the solutions of
the system and trying to increase the speed. The combined the conventional MCM algorithm. A flexible cost evaluation
pipelining and parallel processing technique is applied to the 3
tap fir filter. The proposed 3 tap transposed FIR filter with
is used which enables the optimization for FPGA [4]. A 1-
combined pipeline and parallel process gives a 31.33 % GHz Razor FIR accelerator is designed using the 65nm
improvement in power delay product compare to conventional CMOS process. By using the critical paths on Razor latches
designs. The proposed technique is operated at different supply the timing error can be detected. The error correction can be
voltages and simulation results are verified with cadence realized using two different mechanisms. Energy-efficient
software. improvement of 37% was achieved by using the algorithm
[5]. Certain methods are implemented to reduce the power
Keywords—Speed, FIR, critical path, parallel processing, consumption of the FIR filter. These methods include serial
cadence tools
adder/ multiplier, booth multiplier, and shift multipliers. The
proposed FIR filters were analyzed using Xilinx software
I. INTRODUCTION
[6].The implementation of parallel cyclic redundancy check
With the help of the pipelining technique, the speed, area, and based on pipelining, retiming and unfolding. The author
low power are achieved in FIR and IIR filters. The compares both the parallel and serial implementation of
comparison between the pipelined FPGA and non-pipelined CRC-9. Less number of clock cycles are utilized by the
FPGA is analyzed and the results show that implemented parallel implementation than the serial implementation which
filters provide the best quality of output when compared to increases the speed. The design is analyzed using cadence
the existing system [1]. A new method for generating high- tools [7]. To save hardware cost if the length of the FIR filter
speed FIR filter is implemented using FPGA. The author is large, the new algorithm is designed which named iterated
realized the scenario into two parts the first part is adder short convolution. The author considered a 576-tap filter
designed using the MCM algorithm and the second part is which saves 16.7% to 42.1% of the multiplications which
novel FPGA based pipelining, the schedule is performed to improved efficiency when compared to the existing system
gain the maximum speed. The synthesis results show that the [8]. The FIR filter is designed using registered adders and
proposed system shows a 54.1% reduction on average and hardwired shifts that work at high speed. The author used the
88.2% speed is achieved [2]. The best area and less power subexpression elimination algorithm which reduces the
results are obtained using the concept of rounded truncated number of adders. This proposed algorithm provides a
multipliers. Without affecting the signal precision output reduction in 50% of slices in number and a 75% reduction in
value and the frequency response the author considers the the number of LUTs. A 50% reduction is observed in
optimization of bit width and hardware resources [3]. A novel dynamic power consumption [9]. Path delay is the parameter
978-1-7281-6828-9/20/$31.00 ©2020 IEEE
Authorized licensed use limited to: Rockwell Collins. Downloaded on November 01,2020 at 19:27:38 UTC from IEEE Xplore. Restrictions apply.
that affects the performance of any system. This proposed
architecture for arithmetic using parallel processing of ROM
results in reduced delay and quick prototyping [10]. The
register complexity of direct form FIR filter is analyzed
which involves significantly less registers. This proposed
architecture has major advantages concerning area delay and
energy-efficient for a higher block size of the FIR filter. The
author considers the length of FIR is 64 which provides 1.8
times less area delay. The delay calculations are taken from
[11] and low power delay product using cadence is explained
[12]. The basic arithmetic operations in FIR filter and
multiplications are explained in cadence 180 mm technology
[13], [14]. The major changes in the proposed system are Fig. 2.critical path of a direct form 3-tap FIR
radix-4 multiplier for multiplication and change in the basic
architecture of the FIR filter. In this method, we combine all
the input tap values having similar co-efficient values and 𝑞[𝑛] = 𝑘𝑝[𝑛] + 𝑙𝑝[𝑛 − 1] + 𝑚𝑝[𝑛 − 2] (2)
then multiplying those with the respective co-efficient [15].
Systolic array based digital filter used in signal processing of
electrocardiogram analysis is presented with datapath
architectural innovations in low power consumption
perspective [16]. All other arithmetic operation can be
successively performed using adder circuits. This paper
presents Shannon logic based QCA efficient full adder circuit
for arithmetic operations.[17]. This paper focuses on the
efficient design and FPGA realization of CIC based
decimation filter structure for WiMAX application. [18].
II. RELATED WORK
Efficient and optimized implementation of Finite impulse
response (FIR) filters is very much necessary in the
hardware's perspective ad these filters are the very Fig. 3 Transposed direct form (TDF) of 3-tap FIR
fundamental building units in a lot of digital signal processing 𝑞[𝑛] = 𝑘𝑝[𝑛] + 𝑙𝑝[𝑛 − 1] + 𝑚𝑝[𝑛 − 2]
(DSP) systems. It is very much desired to achieve a high-
speed, low- power, delay, area and cost designs or at least to
design these systems in such a way that there is a reasonable
trade-off amongst these four factors. Usually, in real-time
implementation of the FIR filters, the transposed direct form
(TDF) is employed largely due to the inherent properties it
has like pipelining and reduced critical path. The difference
can be seen in the illustration below in Fig. 1, Fig. 2, Fig. 3,
Fig. 4.
Fig. 4 critical path of the Transposed direct form (TDF) of
3-tap FIR.
𝑞[𝑛] = 𝑘𝑝[𝑛] + 𝑙𝑝[𝑛 − 1] + 𝑚𝑝[𝑛 − 2] (3)
We can see that the critical path in the case of direct form
filters, the critical path is one multiplier time (nm) plus two
Fig. 1 direct form 3-tap FIR adder times (na). i.e
Critical path = nm+2na. (4)
And, the critical path of the TDF filters is
𝑞[𝑛] = 𝑘𝑝[𝑛] + 𝑙𝑝[𝑛 − 1] + 𝑚𝑝[𝑛 − 2] (1) Critical path = nm+na (5)
Where, nm= multiplier time, na= adder time.
It can be seen that using the TDF filter cuts down the critical
path by one adder time.
Authorized licensed use limited to: Rockwell Collins. Downloaded on November 01,2020 at 19:27:38 UTC from IEEE Xplore. Restrictions apply.
Parallel processed, pipelined, retimed direct
form FIR Critical path= nm+ta. (9)
In the implementation of this FIR, we have used 3 stages 𝑞(3𝑥 + 2) = 𝑥𝑝(3𝑥) + 𝑦𝑝(3𝑥) + 𝑧𝑝(3𝑥) (10)
parallel processing along with pipelining and retiming to a
direct form FIR. The sampling rate of this FIR increases by 𝑞(3𝑥 + 1) = 𝑥𝑝(3𝑥) + 𝑦𝑝(3𝑥) + 𝑧𝑝(3𝑥) (11)
three times. This has 3 retiming registers. 𝑞(3𝑥) = 𝑥𝑝(3𝑥) + 𝑦𝑝(3𝑥) + 𝑧𝑝(3𝑥) (12)
The figure 7 shows the implementation of 3 TAP filter using
direct form with the combination of pipeline and parallel
process.
III. PROPOSED METHOD
The TDF FIR which is parallel processed, pipeline and
retimed is as follows
Fig. 5. Untransposed, parallel processed, 3 stage pipelined,
retimed FIR.
𝑞(3𝑥 + 2) = 𝑥𝑝(3𝑥) + 𝑦𝑝(3𝑥) + 𝑧𝑝(3𝑥) (6)
𝑞(3𝑥 + 1) = 𝑥𝑝(3𝑥) + 𝑦𝑝(3𝑥) + 𝑧𝑝(3𝑥) (7)
𝑞(3𝑥) = 𝑥𝑝(3𝑥) + 𝑦𝑝(3𝑥) + 𝑧𝑝(3𝑥) (8)
The critical path of this includes one multiplier time and one
adder time as shown below
Fig. 7. Cadence implementation of untransposed parallel
processed, 3 stages pipelined, retimed FIR.
Fig. 6. The critical path of the untransposed parallel
processed, 3 stages pipelined, retimed FIR. .
Authorized licensed use limited to: Rockwell Collins. Downloaded on November 01,2020 at 19:27:38 UTC from IEEE Xplore. Restrictions apply.
𝑞(3𝑥 + 2) = 𝑥𝑝(3𝑥) + 𝑦𝑝(3𝑥) + 𝑧𝑝(3𝑥) (13) 𝑞(3𝑥) = 𝑥𝑝(3𝑥) + 𝑦𝑝(3𝑥) + 𝑧𝑝(3𝑥) (19)
𝑞(3𝑥 + 1) = 𝑥𝑝(3𝑥) + 𝑦𝑝(3𝑥) + 𝑧𝑝(3𝑥) (14)
𝑞(3𝑥) = 𝑥𝑝(3𝑥) + 𝑦𝑝(3𝑥) + 𝑧𝑝(3𝑥) (15) The figure 8 shows the parallel processed, pipelined and
retimed TDP filter, the expressions i.e 13, 14 and 15 show the
The critical path of it is: operation of the circuit, similarly, figure 9 shows the critical
path of figure 8, the critical path is shown in expression 16,
the operation of circuit is explained with expressions 17, 18
and 19. The 180 mm technology is used in Cadence software.
the implementation of the proposed system is shown in below
figure 10.
Fig. 8. Parallel processed, pipelined, retimed TDF FIR
Fig. 10. Cadence implementation of Parallel processed,
pipelined, retimed TDF FIR.
IV. RESULT ANALYSIS
Table 1, II, III shows the implementation of the 3 tap FIR
Fig. 9. The critical path of Parallel processed, pipelined,
filter. The parameters such as delay, power, and power delay
retimed TDF FIR.
product are calculated using cadence software in 180nm
technology. The delay is calculated in nanoseconds and power
Critical path= nm+na. (16) is calculated in watts and the product of both power and delay
is calculated as power delay product is expressed in
nanoseconds watts. In table, I and table II, III three different
𝑞(3𝑥 + 2) = 𝑥𝑝(3𝑥) + 𝑦𝑝(3𝑥) + 𝑧𝑝(3𝑥) (17) technologies are direct form FIR, transposed FIR and parallel
𝑞(3𝑥 + 1) = 𝑥𝑝(3𝑥) + 𝑦𝑝(3𝑥) + 𝑧𝑝(3𝑥) (18)
Authorized licensed use limited to: Rockwell Collins. Downloaded on November 01,2020 at 19:27:38 UTC from IEEE Xplore. Restrictions apply.
process pipelined FIR filter is designed using cadence (nsec Produ (nsec Produ
software. ) ct ) ct
(ns-W) (ns-W)
The pipelining leads to a reduction in the critical path, this Direct 0.97 0.518 0.506 0.91 0.548 0.502
pipelining will either increase the sampling speed or reduces form FIR 6 5
the power consumption at the same speed in a Digital signal Transpose 0.46 0.409 0.189 0.55 1.347 0.747
d FIR 4 5
processing system. Parallel 0.46 0.409 0.189 0.43 0.615 0.269
In parallel processing, the multiple outputs are computed in process, 4 7
Pipelined
parallel. It is used to reduce power consumption and also to FIR
increase the speed.
For different values of voltage, the delay is calculated. The
improvement in power delay product is observed when the
TABLE I. comparison of power delay product for voltages of 1.3v and 1.4v voltage is 1.8v, it is calculated by the product of delay and
power. The improvement is 47.8% in the parallel process,
Vdc= 1.3 V Vdc= 1.4 V pipelined FIR filter when compared to Transposed FIR filter.
Similarly, the improvement is 23.3% in the parallel process,
Methods Dela Power Speed- Dela Power Speed-
y (Watt Power y (Watt Power pipelined FIR filter when compared to direct form FIR filter.
(nsec s) Produ (nsec s) Produ For a voltage of 1.7v, the improvement in power delay product
) ct ) ct of the parallel process pipelined FIR is 31.7% when compared
(ns-W) (ns-W) to direct form 3 tap filter.
Direct 1.44 0.311 0.449 1.27 0.335 0.427
form FIR 4 The transposed structure having 20 % less delay compares to
Transpose 0.83 0.99 0.831 0.68 1.221 0.831 direct form filter. The combination of the parallel and
d FIR 7 0 pipelining process in a transposed structure gives better results
Parallel 0.67 0.316 0.212 0.59 0.339 0.202
compared to conventional FIR filters
process, 1 7
Pipelined
FIR
V. CONCLUSION
Table I shows the simulation results, the improvement in The proposed 3 tap transposed FIR filter with combined
power delay product in a parallel process with pipelining FIR pipeline and parallel process gives a 31.33 % improvement in
filter is 22.5% when compared to direct form FIR filter under power delay product compare to conventional designs. The
1.4 Volts. The 16.6% of delay is improved in parallel improvement in power delay product is observed when the
processing with the pipelining 3 tap filter when compared to voltage is 1.8v, it is calculated by the product of delay and
transposed FIR. power. The improvement is 47.8% in the parallel process,
pipelined FIR filter when compared to Transposed FIR filter.
TABLE II. comparison of power delay product for voltages 1.5v and 1.6v Similarly, the improvement is 23.3% in the parallel process,
pipelined FIR filter when compared to direct form FIR filter.
Vdc= 1.5 V Vdc= 1.6 V The proposed FIR filter structure consists of the least number
Methods Dela Power Speed- Dela Power Speed- of adders and multipliers in the critical path and increasing
y (Watt Power y (Watt Power the speed of the DSP processor and validated in 90 and 180
(nsec s) Produ (nsec s) Produ nm technology. The transposed FIR filter is more suitable for
) ct ) ct low power MAC design and imaging devices.
(ns-W) (ns-W)
Direct 1.14 0.358 0.411 1.05 0.382 0.402
form FIR 9 3
Transpose 0.68 1.145 0.787 0.68 1.221 0.831 REFERENCES
d FIR 7 0
Parallel 0.54 0.362 0.196 0.49 0.386 0.192 [1] Kaur, R., Raman, A., Singh, H., & Malhotra, J. (2011). Design and
process, 1 8 Implementation of High Speed II Rand FIR Filter using
Pipelined Pipelining. International Journal of Computer Theory and
FIR Engineering, 3(2), 292.
[2] Kumm, M., & Zipf, P. (2011, December). High speed low complexity
Table II shows the simulation results, the improvement in FPGA-based FIR filters using pipelined adder graphs. In Field-
power delay product in a parallel process with pipelining FIR Programmable Technology (FPT), 2011 International Conference
on (pp. 1-4). IEEE.
filter is 12.5% when compared to direct form FIR filter under
[3] Hsiao, S. F., Jian, J. H. Z., & Chen, M. C. (2013). Low-cost FIR filter
1.5Volts. The 26.6% of delay is improved in parallel designs based on faithfully rounded truncated multiple constant
processing with pipelining 3 tap filter when compared to multiplication/accumulation. IEEE Transactions on Circuits and
transposed FIR under 1.6 Volts. Systems II: Express Briefs, 60(5), 287-291.
[4] Kumm, M., Zipf, P., Faust, M., & Chang, C. H. (2012, May). Pipelined
TABLE III. comparison of power delay product for voltages 1.7v and 1.8v adder graph optimization for high speed multiple constant
multiplication. In Circuits and Systems (ISCAS), 2012 IEEE
International Symposium on (pp. 49-52). IEEE.
Vdc= 1.7 V Vdc= 1.8 V
[5] Whatmough, P. N., Das, S., & Bull, D. M. (2014). A low-power 1-ghz
Methods Dela Power Speed- Dela Power Speed- razor fir accelerator with time-borrow tracking pipeline and
y (Watt Power y (Watt Power approximate error correction in 65-nm cmos. IEEE Journal of Solid-
s) s) State Circuits, 49(1), 84-94.
Authorized licensed use limited to: Rockwell Collins. Downloaded on November 01,2020 at 19:27:38 UTC from IEEE Xplore. Restrictions apply.
[6] Rashidi, B., Rashidi, B., & Pourormazd, M. (2011, April). Design and and transmission gate based adder and multiplier circuits in 180 and 90
Implementation of Low Power Digital FIR Filter based on low power nm technology. Microprocessors and Microsystems, 59, 15-28.
multipliers and adders on xilinx FPGA. In Electronics Computer https://doi.org/10.1016/j.micpro.2018.03.003
Technology (ICECT), 2011 3rd International Conference on (Vol. 2, [14] Nehru, K., & Linju, T. T. (2017). Design of 16 Bit Vedic Multiplier
pp. 18-22). IEEE. Using Semi-Custom and Full Custom Approach. Journal of
[7] Singh, S., Sujana, S., Babu, I., & Latha, K. (2013). VLSI Engineering Science & Technology Review, 10(2).
Implementation of Parallel CRC Using Pipelining, Unfolding and [15] Karthick, Sa, Sa Valarmathy, and E. Prabhu. "Reconfigurable FIR
Retiming. IOSR Journal of VLSI and Signal Processing (IOSR- Filter With Radix-4 Array Multiplier." Journal of Theoretical &
JVSP), 2(5). Applied Information Technology 57, no. 3, pp 326-336, (2013).
[8] Cheng, C., & Parhi, K. K. (2004). Hardware efficient fast parallel FIR [16] Karthick, S., S. Valarmathy, and E. Prabhu. "Low power systolic array
filter structures based on iterated short convolution. IEEE based digital filter for DSP applications." The Scientific World
Transactions on Circuits and Systems I: Regular Papers, 51(8), 1492- Journal 2015 (2015). pp-1-6. https://doi.org/10.1155/2015/592537
1500.
[17] Kandasamy, Nehru, Firdous Ahmad, and Nagarjuna Telagam.
[9] Mirzaei, S., Hosangadi, A., & Kastner, R. (2007, October). FPGA "Shannon logic based novel qca full adder design with energy
implementation of high speed FIR filters using add and shift method. dissipation analysis." International Journal of Theoretical Physics, 57,
In Computer Design, 2006. ICCD 2006. International Conference no. 12 (2018): 3702-3715. https://doi.org/10.1007/s10773-018-
on (pp. 308-313). IEEE. 3883-3
[10] Subathradevi, S., & Vennila, C. (2015). Modified Architecture for [18] Jayaprakasan, V., S. Vijayakumar, and Pandya Vyomal
Distributed Arithmetic with Optimized Delay using Parallel Naishadhkumar. "Design of CIC based decimation filter structure using
Processing. Indian Journal of Science and Technology, 8(24). FPGA for WiMAX applications." IEICE Electronics Express 16, no. 7
[11] Arun, V., K. D. Naresh, T. Ravinder, R. Karthik, K. R. Sankit, and T. (2019): 20190074-20190074.
Nagarjuna. "“Implementation of High-Speed Digital Reconfigurable
FIR filter using Low Power Carry Look Ahead adder–
Review." Journal of Advanced Research in Dynamical and Control
Systems 3 (2018): 1217-1221.
[12] Telagam, Nagarjuna, and Nehru Kandasamy. "Low Power Delay
Product 8-bit ALU design using decoder and data selector." Majlesi
Journal of Electrical Engineering 12, no. 1 (2018): 103-108.
[13] Kandasamy, N., Ahmad, F., Reddy, S., Telagam, N., & Utlapalli, S.
(2018). Performance evolution of 4-b bit MAC unit using hybrid GDI
Authorized licensed use limited to: Rockwell Collins. Downloaded on November 01,2020 at 19:27:38 UTC from IEEE Xplore. Restrictions apply.