Thanks to visit codestin.com
Credit goes to www.scribd.com

0% found this document useful (0 votes)
12 views5 pages

Shuffel 4

This article presents a novel memory-based architecture for computing the real-valued fast Fourier transform (RFFT) using a modified radix-2 Decimation-In-Frequency algorithm. The proposed architecture optimizes stage partitioning to reduce computation cycles by 17.5% for 32-point RFFT while minimizing hardware usage. It utilizes a processing element capable of handling four inputs in parallel, enhancing efficiency and resource utilization compared to previous designs.

Uploaded by

anjusiva.india
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
12 views5 pages

Shuffel 4

This article presents a novel memory-based architecture for computing the real-valued fast Fourier transform (RFFT) using a modified radix-2 Decimation-In-Frequency algorithm. The proposed architecture optimizes stage partitioning to reduce computation cycles by 17.5% for 32-point RFFT while minimizing hardware usage. It utilizes a processing element capable of handling four inputs in parallel, enhancing efficiency and resource utilization compared to previous designs.

Uploaded by

anjusiva.india
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 5

This article has been accepted for publication in a future issue of this journal, but has not been

fully edited. Content may change prior to final publication. Citation information: DOI
10.1109/TCSII.2015.2435522, IEEE Transactions on Circuits and Systems II: Express Briefs

1
A Novel Memory-Based FFT Architecture for Real-Valued Signals
Based on Radix-2 Decimation-In-Frequency Algorithm
Zhen-guo Ma, Xiao-bo Yin, and Feng Yu, member, IEEE

Abstract—This brief presents a novel architecture for memo- computation, which are also referred to continuous flow or
ry-based fast Fourier transform (FFT) computation for in-place architectures. Some memory-based RFFT architec-
real-valued signals based on radix-2 Decimation-In-Frequency tures have been proposed to achieve low hardware usage and
(DIF) algorithm. A superior strategy of stage partition for the real high hardware resource utilization [10]–[12].
fast Fourier transform (RFFT) is proposed to minimize the
computation clock cycles and maximize the utilization of the
In this brief, we propose a novel memory-based architecture
processing element (PE). The PE employed in our RFFT archi- that computes the RFFT based on the modified radix-2 deci-
tecture can process four inputs in parallel by using two radix-2 mation-in-frequency (DIF) algorithm in [13]. The modified
butterflies and only two multiplexers. The proposed memo- algorithm requires the lowest number of operations for radix-2
ry-addressing scheme and control of the multiplexers can be ex- by computing only half of the output samples as presented. The
pressed in terms of a counter according to the RFFT computation modified algorithm separates the data into real and imaginary
stage. Furthermore, the proposed RFFT architecture can support components with real data paths in a regular flow graph. Due to
more PEs in two dimensions as well. Compared with prior works, this, the word length of the required memory can be W instead
the proposed RFFT processors have the advantages of fewer of 2W, where W is the word length chosen to represent either
computation cycles and lower hardware usage. The experiment
shows that the proposed processor reduces the computation cycles
the real or imaginary component. An in-place FFT architecture
by a factor of 17.5% for a 32-point RFFT computation compared based on the modified radix-2 algorithm [12] is recently pro-
with a recently presented work while maintaining a lower hard- posed to achieve the lower area-time (AT) product for RFFT,
ware usage and complexity in the PE design. where area corresponds to the data-path area and time corres-
Index Terms—memory-based, fast Fourier transform (FFT), ponds to the number of cycles. However, the RFFT architecture
real-valued signals, real fast Fourier transform (RFFT), memo- in [12] does not achieve the best hardware utilization. The key
ry-addressing scheme. contribution of this brief is the design of a memory-based ar-
chitecture based on a new strategy of RFFT stage partition that
I. INTRDUCTION achieves lower area-time (AT), compared with the prior works.
The organization of this brief is as follows. Section II first
Fast Fourier transform (FFT) is widely used in the field of describes the Radix-2 RFFT algorithm, and then presents a new
digital signal processing, such as speech, audio, image, radar strategy of the RFFT stage partition to achieve fewer compu-
and biomedical signal processing [1]. FFT generally operates tation cycles. Then, the RFFT architecture and the parallelism
over complex numbers and much research has been done on the exploitation in two dimensions are detailed in Section III. Fi-
efficient architecture for the complex fast Fourier transform nally, the proposed architecture is compared with prior designs
(CFFT) [2]. Nowadays, the interest in the computation of FFT and the implementation result is shown in Section IV.
for real-valued signals is increasing, since most of the physical
signals are real [3]-[5]. However, not much research has been II. NEW STATEGY OF RFFT STAGE PARTITION
done on optimizing the architecture until recently for the FFT
computation of real-valued signals. The N-point discrete Fourier Transform (DFT) for a se-
The presented architecture for the CFFT computation cannot
quence x(n) is defined as follows,
achieve the same efficiency for the RFFT computation, be- N 1
cause when the input samples are real-valued signals, the X (k )   x(n) W N
nk
, 0  k  N  1. (1)
spectrum of the FFT is symmetric and approximately half of n0

the operations are redundant [6], [7]. Most RFFT architectures where WN=e−j(2π/N) is often referred to as the twiddle factor.
presented can be divided into two categories: pipelined [8] and When the samples are real, we have X ( k )  X * [ N  k ] .
memory-based architectures [9]. Taking the advantage of Therefore, it is not necessary to compute all of the FFT coef-
structural regularity in VSLI implementation, some pipelined ficients. A modified radix-2 flow graph [13] that contains only
RFFT architecture designs have been presented to achieve real data paths in a regular way is shown in Fig. 1. The N-point
higher performance in computing time by employing many RFFT computation can be divided in n stages, where n is equal
Processing Elements (PEs). The memory-based RFFT archi- to the value of log2N. As the entire data path is real, the word
tectures employ one or several PEs to provide the tradeoff length of the required memory can be reduced to half of the
between hardware cost and speed performance in low and RFFT computation of the complex samples. Furthermore, all
moderate speed applications, which are suited for large-size the butterflies in the flow graph only process real samples
RFFT computation. These architectures are adopted in many which require two real adders instead of two complex adders.
applications, such as optical coherence tomography (OCT) in The proposed architecture takes the advantage of the reduced
image processing [1], orthogonal frequency-division multip- numbers of the real fast Fourier transform with respect to the
lexing (OFDM) and discrete multi-tone (MDT) in communi- complex fast Fourier transform.
cation [9], wireless sensor network [10], and so on. The focus
of this brief is on memory-based architectures for RFFT

1549-7747 (c) 2015 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See
http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI
10.1109/TCSII.2015.2435522, IEEE Transactions on Circuits and Systems II: Express Briefs

2
1st Stage 2nd Stage 3rd Stage 4th Stage 5th Stage
the PE employed in the proposed architecture contains one
x(0)
x(1)
1
X(0) complex multiplier and two complex adders that can process
X(16)
x(2)
x(3)
Xr(8) four inputs in parallel. Compared with the traditional strategy,
Xi(8)
x(4)
x(5) W0 Xr(4)
Xr(20)
there is only one real addition in the last stage and more mul-
x(6)
x(7)
1
W4 Xi(4)
Xi(20)
tiplications and additions are processed in the first two stages.
x(8)
x(9)
W0 W0 Xr(2)
Xr(18)
The required computational cycles for N-point RFFT is only
x(10)
x(11)
W2 W0 Xr(10)
Xr(26)
(N/4*(log2N-1)+1) in the proposed processor with one PE. It is
x(12)
x(13) W4 W0 Xi(2)
Xi(18)
obvious that the proposed RFFT architecture based on the new
x(14)
x(15) W6 W8
Xi(10)
Xi(26)
strategy of the RFFT stage partition achieves fewer computa-
x(16)
x(17) W0 W0 W0 Xr(1)
Xr(17)
tion cycles, compared with the traditional one. The factor of the
x(18)
x(19)
W1 W0 W0
Xr(9)
Xr(25)
reduced computation cycles for N-point RFFT can be calcu-
x(20)
x(21)
W2 W0 W0
Xr(5)
Xr(21)
lated using the equation (2):
x(22) W3 W0 W8
Xr(13) N / 4 * ( lo g N  1)  1 1 4/N (2)
x(23) Xr(29)
(1  2 ) *100% = *100%
x(24) Xi(1)
x(25) W4 W0 W0
Xi(17) N / 4 * lo g N lo g N
x(26) Xi(9) 2 2
W5 W4 W0
x(27)
x(28)
Xi(25)
Xi(5) When N is 32, the reduced factor reaches to 17.5%.
W6 W8 W0
x(29) Xi(21)
x(30) W7 W12 W8
Xi(13)
x(31) Xi(29)
III. PROPOSED ARCHITECTURE
Fig. 1. Traditional strategy of stage partition for 32-point RFFT.
Fig.3 shows the high-level architecture of the proposed
The traditional strategy of stage partition for the 32-point RFFT processor with one PE. Four memory banks are used to
RFFT computation is shown in Fig.1 [8.13]. The RFFT com- store the samples and intermediate data in the processor, and
putations of the first stage and the last stage don’t have mul- each memory bank can store N/4 words of data length W. Thus,
tiplication. There are N/2 complex additions in the first stage, the total capability of the required memory is N*W bit. The
while only N/4 complex additions are computed in the second memory support dual-port access, where the data can be read
stage. Based on the traditional strategy of stage partition, a new and written at the same clock cycle. When the input samples
in-place RFFT architecture [12] is recently proposed to achieve are written to the memory banks, the four multiplexer named
the lower AT product. For the processor with one PE, the “m0” will select the input data path. After all the samples are
number of the required computation cycles for N-point RFFT stored, the four multiplexer named “m0” will select another
can be calculated with N/4*log2N. The in-place RFFT archi- input data path for the intermediate results. The four multip-
tecture does not achieve the fully utilization of the PE resource, lexers named “m11, m12, m13, m14” decide the input se-
because the complex multiplier in that PE is not processing quence of the memory banks, and the four multiplexers named
during the first and the last stage, and one complex adder is “m21, m22, m23, m24” choose which memory banks the in-
bypassed during the middle stages especially the second stage. termediate results are written to. The number in the multip-
To further reduce the RFFT computation clock cycles, a new lexers is according to the number of the memory bank. To
strategy of the stage partition is proposed, as shown in Fig. 2. achieve lower hardware usage and simple logic design, the
The multiplication and addition both appear in all the stage of output way of the result could share the same data path with the
the RFFT except the last two stages. The multiplication can be read data path. When the result data are being read out, the
ignored when the twiddle factor is W0, as W0 is a real value of reordering work could be done at the same time, and the next
one. Then, the number of the complex multiplication and frame of N-point samples could be inputting simultaneously.
complex addition required in all the stages except the last two In the proposed RFFT architecture, a new PE that can
stages can be considered as N/4 and N/2, respectively. Thus, process four samples in parallel is proposed by employing two
butterflies and two multiplexers, as shown in Fig. 4 (b). This
choice of PE leads to less complexity and high throughout. The
multiplexers named “m31, m32” are used together to decide
whether the multiplier is bypassed. There are four input ports
named “in0, in1, in2, in3” and four output ports named “out0,

Input data Output result

m0 m11 m21
0 0
Data
in0 out0
MEM0
2 2
m0 m12 m22
0 1
Data in1 out1
1
MEM1
2 3
PE
m0 m13 m23
1 0
Data in2 out2
2
MEM2
3 2

m0 m14 m24
1 out3 1
Data in3
MEM3
3 3
Twiddle ROM

Fig. 2. Presented strategy of stage partition for 32-point RFFT.


Fig.3. Proposed high-level RFFT architecture with one PE.

1549-7747 (c) 2015 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See
http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI
10.1109/TCSII.2015.2435522, IEEE Transactions on Circuits and Systems II: Express Briefs

3
As the PE can process four in parallel, each stage takes N/4
clock cycles for N-point RFFT computation. Thus, a (n-2) bit
counter (Wn-3, Wn-4,…W0) can be used to count the clock cycles
at each stage, where n is the stage number. The address of the
memory banks can be expressed in terms of the counter at
every stage, shown in Table. I. The read and write address
patterns of memory bank0 and bank1 are the same for all stages,
which can be represented as Wn-3, Wn-4…W0. The data are read
and written in a serial order from address location 0 to (N/4-1)
Fig. 4. Comparison of the traditional PE (a) and the proposed PE (b) for RFFT. at every stage for memory bank0 and bank1. The data in the
memory bank2 and bank3 are read and written in groups, which
out1, out2, out3” in the PE. Two real-valued data from some is different from the bank0 and bank1. The size of the group
two memory banks and the twiddle factor from the bottom goes down by a factor of 2 with every subsequent stage as
ROM are multiplied in the complex multiplier. The traditional shown in Fig. 5. The addressing pattern for second and third
PE that contains two butterflies and six multiplexers is shown stages can be represented with Wn3 , Wn  4 , ...W0 and Wn3 , Wn4 , ...W0 ,
in the Fig. 4 (a) [12], where the switch can be considered as two respectively. Therefore, the read and write addresses of all the
multiplexers. Compared with the traditional PE, the proposed memory banks can be generated from only a counter by simply
PE achieves a lower hardware usage and complexity. selecting the value or inverse value of each bit.
A. Memory-Addressing Scheme B. Control of Multiplexers for Memory Bank Assignment
In this section, a new conflict-free memory-addressing There are 2 multiplexers in the PE and total 14 multiplexers
scheme is detailed for the proposed RFFT processor with one in the proposed RFFT processor with one PE. Only when the
PE. In the pipelined architecture [8], the intermediate data input samples are written to the memory banks, the four mul-
between the stages are stored in the registers which act as First tiplexer named “m0” select the input data path. The four mul-
in First out (FIFO). The number of required registers in the tiplexers named “m11, m12, m13, m14” decide the input se-
pipelined architecture decreases from the first stage to the last quence of the memory banks, where the data are read out to the
stage. In the proposed RFFT architecture, the intermediate data four input ports of the PE. For example, the multiplex named
are written and read in groups. The size of the group decreases “m11” can select the input data from either the memory bank0
from the first stage to the last stage, which is similar to the way or bank2, and the multiplex “m13” can select the input data
in the pipelined RFFT architecture. from any memory of bank 1, 2, 3, as shown in Fig. 3.
The control of the four multiplexers “m11, m12, m13, m14”
can be expressed with a counter at every stage, as shown in
Table. II. In the first stage, the four multiplexers makes the data
in memory bank 0,2,1,3 be read out to the “in0, in1, in2, in3”
input ports of PE, respectively. When the value of Wn-3 is zero
in the second stage, the data in memory bank 0,1,2,3 are read
out to the “in0, in1, in2, in3” input ports of PE, respectively.
From the third to the (n-1)-th stage, there are three bank select
patterns according the value of the counter in different stages.
The bit width of the according counter increases as the stage
Fig. 5. Addressing scheme for 32-point RFFT with one PE number increases, as shown in Table. II. Since there is only one
real addition that is needed in the last stage, the result of the
An example of the addressing scheme for a 32-point RFFT bottom complex adder is deserted. At the last stage, the data of
computation is shown in Fig. 5, which shows how the data are memory bank 0,1,2,3 are read out to the “in0, in1, in2, in3”
stored in the memory banks at every stage. The first column input ports of PE respectively for only one clock cycle. After
corresponds to memory bank0, and the second column cor- one addition in the last stage, the result can be unloaded to the
responds to memory bank1, and so on. The samples are stored output data path, and the next frame data can be unloaded for
in a natural order before the computation. After one computa- the N-point RFFT computation simultaneously.
tion of the PE, the four output data are written to the locations TABLE I
of the four input data. In this way, the memory bank conflicts ADDRESS PATTERNS FOR N-POINT RFFT COMPUTATION (n=log2N)
are avoided during the computation in the next stage and our
architecture achieves lowest hardware usage of memory. The Counter: Wn-3, Wn-4,…W0 Write/Read address
arrows show how the data are read from and written into the Memory All stage Wn  3 ,Wn  4 ,...W0
memory banks during the stages. For example, the data with bank 0,1
indices (0, 8, 16, 24) are read out to the PE in the first stage and Memory
bank 2,3
Stage 1 / Stage 2
Wn  3 ,Wn  4 ,...W0 / Wn 3 , Wn  4 , ...W0
the four result data are written in the same location. When the Stage 3 / ……
data with indices (4, 12, 20, 28) are read out and the four result Wn 3 , Wn 4 , ...W0 / ……

data are written in the interleaving location with indices (20, 28, Stage n-1 / Stage n Wn 3 , Wn 4 , ...W0 / 0 (only one cycle)
4, 12), according to the data arrangement in the second stage.

1549-7747 (c) 2015 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See
http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI
10.1109/TCSII.2015.2435522, IEEE Transactions on Circuits and Systems II: Express Briefs

4
TABLE II
CONTROL OF THE MULTIPLEXERS DEPENDING ON THE VALUE OF
THE COUNTER IN DIFFERENT STAGES

Multiplexer m21,22,23,24 m11,12,13,14 m31,32


Stage Part of the Value Write- Value Read- multiplier is
counter type bank seq. type bank seq. bypassed?
1 any 0,2,1,3 No
2 Wn-3 odd 2,3,0,1 zero 0,1,2,3 No
else 0,1,2,3 else 2,0,3,1
3 Wn-3, Wn-4 odd 2,3,0,1 zero 0,1,2,3 No
else 0,1,2,3 odd 2,0,3,1 Fig. 7. Addressing scheme for 32-point RFFT processor with two PEs.
else 0,2,1,3
…… odd 2,3,0,1 zero 0,1,2,3 No The parallelism of our proposed RFFT architecture can also
else 0,1,2,3 odd 2,0,3,1 be exploited within a stage. In the parallel RFFT architectures
else 0,2,1,3 with m PEs, each memory bank needs to store N/(4m) words of
n-1 Wn-3, odd 2,3,0,1 zero 0,1,2,3 Yes
Wn-4 ,…W0
data length W at least, and the total required memory could be
else 0,1,2,3 odd 2,0,3,1
the same as N*W bit. The resource of the required multiplexers
else 0,2,1,3
n any 0,1,2,3 zero 0,1,2,3 Yes
would increase proportionately and the design of control unit
will be more complex as the number of the PEs increases. To
achieve the lowest usage of the memory resource, the memory
The four multiplexers named “m21, m22, m23, m24” select bank conflicts should also be avoided during the computation.
which memory banks the intermediate result from the output Compared with the pipelined architecture, the parallel archi-
ports are written to. When the value of Wn-3 is odd, the inter- tecture has the advantages of lower resource usage of memory.
mediate result of the first stage from “out0, out1, out2, out3” The limitation is that the number of the PE (m) in the parallel
ports is written to memory bank 2,3,0,1 respectively. When the architecture should be a power of 2. Otherwise, more memory
value of Wn-3 is not odd, the result of the first stage is written to resource would be used and the design of the control unit would
memory bank 0,1,2,3 respectively. The two multiplexers “m31, become much more complex.
m32” in the PE decide whether the complex multiplier is by- A new addressing scheme for 32-point RFFT computation
passed. From Fig. 3, we can see that the computations of the for the processor with two PEs is proposed to achieve the
last two stages don’t contain any multiplication, and the two lowest usage of the memory resource. At every stage of the
multiplexers “m31, m32” make the multiplier bypassed. RFFT computation, each memory bank in the proposed archi-
tecture with one PE shown in Fig. 5 can be regarded as the
C. Exploit parallelism with Multiple PEs combination of some two memory banks in the architecture
with two PEs shown in Fig. 7. It could be concluded that each
The parallelism of the proposed RFFT architecture can be memory bank in the proposed architecture with one PE can be
exploited in two dimension: pipelined architecture (parallelism regarded as the combination of some m memory banks in that
in multiple stages) [14] and parallel architecture (parallelism parallel RFFT architecture with m PEs for all the stages. In this
within a stage) [13]. The Fig. 6 shows the high-level RFFT way, the proposed addressing scheme could be extended from
architecture with m PEs, which requires 4m memory banks. the architecture with one PE to the parallel architectures with
In that pipelined RFFT architectures, the computation of more PEs.
each stage is implemented by one PE and a group of four
memory banks. The number of multiplexers would increase
IV. COMPARISON AND EXPERIMENTAL RESULTS
proportionately as the number of the PEs increases. If the
number of the employed PE is m, m memory groups of four
banks will be required, each bank of which are required to store In this section, we compare the hardware complexity and
N/4 words at least. An example of the pipelined memory-based computation cycles with some proposed memory-based RFFT
FFT architecture with Four PEs is presented in [14] achieving processors and a pipelined RFFT processor, as shown in Table
lower usage of the memory resource could be used for refer- III. A cost-effective memory-based RFFT processer based on
ence, in which the memory bank conflicts are well avoided. In the radix-4 algorithm is proposed in [11], which requires a
these pipelined architectures, the number of the PEs could be a complex radix-4 butterfly consisting of 12 complex adders (CA)
suitable value between one and the number of the stages. and 3 complex multipliers (CM). To further reduce the usage of
the computing hardware resource, an in-place RFFT processor
is recently proposed in [12] achieving the same the computa-
tion cycles as that in [11]. The PE in [12] requires one complex
multiplier, two complex adders and six multiplexers. Com-
pared with [12], our proposed architecture achieves lower
hardware usage and hardware complexity by using only two
multiplexers in the PE. The proposed architecture is based on
the new strategy of the RFFT stage partition, which further
reduces the computation clock cycles. For N-point RFFT
computation, the required computation cycles in [12] with one
Fig. 6. Proposed high-level RFFT architecture with m PEs. PE are N/4*log2N, while only needs N/4*(log2N-1)+1 com-
putation cycles are required in our RFFT processor with one PE.

1549-7747 (c) 2015 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See
http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI
10.1109/TCSII.2015.2435522, IEEE Transactions on Circuits and Systems II: Express Briefs

5
TABLE III
COMPARISON OF THE RFFT PROCESSORS

Proposed(1 PE) Proposed(2 PE) [12] [11] [9] [9] [15] [13]
Radix Radix-2 Radix-2 Radix-2 Radix-4 Radix-2 Radix-4 Radix-2/4 Radix-2
CA 2 4 2 12 2 8 8 4log4N
CM 1 2 1 3 1 3 3 2( log4N-1)
Memory N/4 * W N/8 * W N/4 * W N/8 * 2W N/4 * 2W N/4 * 2W N/4 * 2W N/4 * W
banks 4 8 4 4 4 4 4 4
N #Cycle AT #Cycle AT #Cycle AT #Cycle AT #Cycle AT #Cycle AT #Cycle AT #Cycle AT
256 449 5388 225 5400 512 6144 512 21504 1024 12288 256 9728 256 9728 64 4864
512 1025 12300 513 12312 1152 13824 768 32256 2304 27648 576 21888 640 24320 128 11264
1024 2305 27660 1153 27672 2560 30720 2560 107520 5120 61440 1280 48640 1280 48640 256 25600
2048 5121 61452 2561 61464 5632 67584 3854 161868 11264 135168 2816 107008 3072 116736 512 57344
4096 11265 135180 5633 135192 12288 147456 12288 516096 24576 294912 6144 233472 6144 233472 1024 126976

When N is 32, the proposed RFFT architecture can reduce the are stored in the Block RAMs, each of which can store maxi-
computation cycles by a factor of 17.5% compared with that in mum data of 1K word. The DSP48E are used for the imple-
[12]. We calculate the AT product by multiplying computation mentation of floating-point multipliers and adders. As expected,
cycles with complex adder area as the unit (it is fair to assume the hardware cost in the processor with one PEs is about half of
that the area of a complex multiplier is ten times the area of a that in the processor with two PEs.
complex adder). Compared with the other memory-based ar-
chitectures, the novel RFFT architecture achieves fewer com- ACKNOWLEDGMENT
putation cycles and better utilization of the PEs. From the Table
III, it can be concluded that the proposed design achieves the The authors would thank the associate editor and all the
best performance in terms of AT product of memory-based anonymous reviewers for numerous constructive comments.
architectures. Furthermore, the parallelism of the proposed
RFFT architecture can be exploited in multiple stages and References
within a stage. The total computation clock cycles decrease [1] Song-Nien, T., J. Fu-Chiang, et al, “Multimode Memory-Based FFT Processor
approximately linearly with the increase in the number of PEs. for Wireless Display FD-OCT Medical Systems,” Circuits and Systems I:
Regular Papers, IEEE Transactions on 61(12): 3394-3406. 2014.
The number of PEs in the pipelined RFFT processor [13] [2] Luo, H. F., Y. J. Liu, et al, “Efficient Memory-Addressing Algorithms for FFT
increases with the size of the RFFT, which can process four in Processor Design,” Very Large Scale Integration (VLSI) Systems, IEEE
parallel. Since the design of the PEs, multiplexers and control Transactions on. pp(99): 1-1. 2014.
[3] Sekhar, B. R. and K. M. M. Prabhu, “Radix-2 decimation-in-frequency algorithm
units of each stage in the pipelined processor of can be various for the computation of the real-valued FFT,” Signal Processing, IEEE Trans-
among the RFFT stages, its hardware optimization could be actions on 47(4): 1181-1184. 1999.
well done. Thus, the pipelined processor achieve a high hard- [4] Ayinala, M. and K. K. Parhi, “Parallel-pipelined radix-22 FFT architecture for
real-valued signals,” Signals, Systems and Computers (ASILOMAR), 2010
ware utilization in the term of AT product. The gap ratio (GR) Conference Record of the Forty Fourth Asilomar Conference on. 2010.
of the AT product between the proposed architecture and the [5] Ayinala, M. and K. K. Parhi, “FFT Architectures for Real-Valued Signals Based
pipelined architecture can be calculated with the equation (3): on Radix-23 and Radix-24 Algorithms,” Circuits and Systems I: Regular Papers,
IEEE Transactions on 60(9): 2422-2430. 2013.
G R  ( A T p r o p o s e d  A T p ip e lin e d ) / A T p ip e lin e d * 1 0 0 % (3) [6] Sorensen, H. V., D. L. Jones, et al., “Real-valued fast Fourier transform algo-
rithms,” Acoustics, Speech and Signal Processing, IEEE Transactions on 35(6):
The AT product of the proposed processor with one PE gets 849-863. 1987.
close to that of the pipelined processor, as the gap ratio de- [7] Murakami, H, "Real-valued decimation-in-time and decimation-in-frequency
creases form 10.8% to 6.5% when the size of FFT (N) increases algorithms," Circuits and Systems II: Analog and Digital Signal Processing,
IEEE Transactions on 41(12): 808-816. 1994.
from 256 to 4096. Compared with the pipelined architecture [8] Salehi, S. A., R. Amirfattahi, et al., “Pipelined Architectures for Real-Valued
[13], the proposed architectures achieve lower hardware cost FFT and Hermitian-Symmetric IFFT With Real Datapaths,” Circuits and Sys-
with one or several PEs, which would be more suitable for tems II: Express Briefs, IEEE Transactions on 60(8): 507-511. 2013.
[9] Pei-Yun Tsai, Chung-Yi Lin, “A Generalized Conflict-Free Memory Addressing
large-point RFFT computation in low and moderate speed Scheme for Continuous-Flow Parallel-Processing FFT Processors With Re-
applications. scheduling,” Very Large Scale Integration (VLSI) Systems, IEEE Transactions
Moreover, one obvious advantage of the proposed RFFT on 19(12): 2290 - 2302. 2011.
[10] Wang, A. and A. P. Chandrakasan, “Energy-aware architectures for a
architecture is that the capability of the required memory can Real-Valued FFT implementation,” Low Power Electronics and Design, 2003.
be reduced by a factor of 2, compared with the traditional ISLPED '03. Proceedings of the 2003 International Symposium on. 2003.
memory-based complex FFT processors in [9], [15]. [11] Hsiang-Feng, C. and L. Zhao-Hong, “A cost-effective memory-based
real-valued FFT and Hermitian symmetric IFFT processor for DMT-based
Table IV presents the synthesis results obtained for imple- wire-line transmission systems,” Circuits and Systems, 2005. ISCAS 2005.
mentation of the proposed processors with ISE14.5 tool on a IEEE International Symposium on. 2005.
Xilinx Virtex-7 field programmable gate array, i.e., [12] Ayinala, M., L. Yingjie, et al., “An In-Place FFT Architecture for Real-Valued
Signals,” Circuits and Systems II: Express Briefs, IEEE Transactions on 60(10):
XC7VX485T. The floating-point samples and twiddle factors 652-656. 2013.
[13] M. Garrido, K. K. Parhi, and J. Grajal, “A pipelined FFT architecture for
TABLE IV real-valued signals,” IEEE Trans. Circuits Syst. I, Reg. Papers, vol. 56, no. 12,
EXPERIMENTAL RESULT FOR THE PROPOSED N-POINT PROCESSORS pp. 2634–2643, Dec. 2009.
[14] ZhenGuo, Ma et al., “An efficient radix-2 fast Fourier transform processor with
ganged butterfly engines on field programmable gate arrays,” Journal of Zhe-
N PE Slice LUT Slice Register DSP48E RAM* Freq.(MHz)
Jiang University-science C-Computers & electronics 12(4): 323. 2011.
1024 1 2840 2981 24 5 423.1
[15] Jo, B. G. and M. H. Sunwoo, "New continuous-flow mixed-radix (CFMR) FFT
4096 1 2863 2992 24 8 410.2
1024 2 5545 5907 48 10 414.5
Processor using novel in-place strategy," Circuits and Systems I: Regular Pa-
4096 2 5580 5937 48 16 409.6 pers, IEEE Transactions on 52(5): 911-919. 2005.
*The Block RAMs includes both for the data and the twiddle factor.

1549-7747 (c) 2015 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See
http://www.ieee.org/publications_standards/publications/rights/index.html for more information.

You might also like