Thanks to visit codestin.com
Credit goes to www.scribd.com

0% found this document useful (0 votes)
42 views16 pages

Flexible LDPC/Turbo Decoder Design

The document discusses a flexible LDPC/Turbo decoder architecture that can support both LDPC and Turbo codes. It proposes a unified message passing algorithm and a flexible soft-input soft-output module to handle LDPC/Turbo decoding. As a case study, a flexible decoder implemented on a 90nm CMOS technology can support various codes and provides up to 600Mbps LDPC decoding or 450Mbps Turbo decoding.

Uploaded by

Heekwan Son
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
42 views16 pages

Flexible LDPC/Turbo Decoder Design

The document discusses a flexible LDPC/Turbo decoder architecture that can support both LDPC and Turbo codes. It proposes a unified message passing algorithm and a flexible soft-input soft-output module to handle LDPC/Turbo decoding. As a case study, a flexible decoder implemented on a 90nm CMOS technology can support various codes and provides up to 600Mbps LDPC decoding or 450Mbps Turbo decoding.

Uploaded by

Heekwan Son
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 16

J Sign Process Syst (2011) 64:1–16

DOI 10.1007/s11265-010-0477-6

A Flexible LDPC/Turbo Decoder Architecture


Yang Sun · Joseph R. Cavallaro

Received: 21 November 2009 / Revised: 11 March 2010 / Accepted: 12 March 2010 / Published online: 9 April 2010
© Springer Science+Business Media, LLC 2010

Abstract Low-density parity-check (LDPC) codes and Turbo codes. Running at 500 MHz clock frequency, the
convolutional Turbo codes are two of the most power- decoder can sustain up to 600 Mbps LDPC decoding or
ful error correcting codes that are widely used in mod- 450 Mbps Turbo decoding.
ern communication systems. In a multi-mode baseband
receiver, both LDPC and Turbo decoders may be re- Keywords SISO decoder · LDPC decoder ·
quired. However, the different decoding approaches Turbo decoder · Error correcting codes ·
for LDPC and Turbo codes usually lead to different MAP algorithm · Reconfigurable architecture
hardware architectures. In this paper we propose a uni-
fied message passing algorithm for LDPC and Turbo
codes and introduce a flexible soft-input soft-output 1 Introduction
(SISO) module to handle LDPC/Turbo decoding. We
employ the trellis-based maximum a posteriori (MAP) Practical wireless communication channels are inher-
algorithm as a bridge between LDPC and Turbo codes ently “noisy” due to the impairments caused by channel
decoding. We view the LDPC code as a concatenation distortions and multipath effect. Error correcting codes
of n super-codes where each super-code has a simpler are widely used to increase the bandwidth and energy
trellis structure so that the MAP algorithm can be efficiency of wireless communication systems. As a core
easily applied to it. We propose a flexible functional technology in wireless communications, forward error
unit (FFU) for MAP processing of LDPC and Turbo correction (FEC) coding has migrated from basic con-
codes with a low hardware overhead (about 15% area volutional/block codes to more powerful Turbo codes
and timing overhead). Based on the FFU, we propose and LDPC codes. Turbo codes, introduced by Berrou
an area-efficient flexible SISO decoder architecture to et al. in 1993 [4], have been employed in 3G and
support LDPC/Turbo codes decoding. Multiple such beyond 3G wireless systems, such as UMTS/WCDMA
SISO modules can be embedded into a parallel decoder and 3GPP Long-Term Evolution (LTE) systems. As a
for higher decoding throughput. As a case study, a candidate for 4G coding scheme, LDPC codes, which
flexible LDPC/Turbo decoder has been synthesized on were introduced by Gallager in 1963 [13], have re-
a TSMC 90 nm CMOS technology with a core area of cently received significant attention in coding theory
3.2 mm2 . The decoder can support IEEE 802.16e LDPC and have been adopted by some advanced wireless sys-
codes, IEEE 802.11n LDPC codes, and 3GPP LTE tems such as IEEE 802.16e WiMAX system and IEEE
802.11n WLAN system. In future 4G networks, inter-
networking and roaming between different networks
Y. Sun (B) · J. R. Cavallaro
Department of Electrical and Computer Engineering Rice
would require a multi-standard FEC decoder. Since
University, 6100 Main Street, Houston, TX 77005, USA Turbo codes and LDPC codes are widely used in many
e-mail: [email protected] different 3G/4G systems, it is important to design a
J. R. Cavallaro configurable decoder to support multiple FEC coding
e-mail: [email protected] schemes.
2 J Sign Process Syst (2011) 64:1–16

In the literature, many efficient LDPC decoder VLSI code has a simpler trellis structure so that the maxi-
architectures have been studied [6, 9, 12, 14, 18, 24, 27, mum a posteriori (MAP) algorithm can be efficiently
29, 35, 37, 39, 45, 47]. Turbo decoder VLSI architec- performed. In the Turbo decoding, we modify the tradi-
tures have also been extensively investigated by many tional message passing flow so that the proposed super-
researchers [5, 8, 20, 21, 25, 30, 33, 41, 44]. However, code based decoding scheme works for Turbo codes as
designing a flexible decoder to support both LDPC well.
and Turbo codes still remains very challenging. In this Contributions of this paper are as follows. First, we
paper, we aim to provide an alternative to dedicated introduce a flexible soft-input soft-output (Flex-SISO)
silicon that reduces the cost of supporting both LDPC module for LDPC and Turbo codes decoding. Sec-
and Turbo codes with a small additional overhead. We ond, we introduce an area-efficient flexible functional
propose a flexible decoder architecture to meet the unit (FFU) for implementing the MAP algorithm in
needs of a multi-standard FEC decoder. hardware. Third, we propose a flexible SISO decoder
From the theoretical point of view, there are some hardware architecture based on the FFU. Finally, we
similarities between LDPC and Turbo codes. They can show how to enable parallel decoding by using multiple
both be represented as codes on graphs which define such Flex-SISO decoders.
the constraints satisfied by codewords. Both families The remainder of the paper is organized as follows.
of codes are decoded in an iterative manner by em- Section 2 reviews the super-code based decoding al-
ploying the sum-product algorithm or belief propa- gorithm for LDPC codes. Section 3 presents a Flex-
gation algorithm. For example, MacKay has related SISO module for LDPC/Turbo decoding. Section 4
these two codes by treating a Turbo code as a low- introduces a flexible functional unit (FFU) for LDPC
density parity-check code [23]. On the other hand, a and Turbo decoding. Based on the FFU, Section 5
few other researchers have tried to treat a LDPC code describes a dual-mode Flex-SISO decoder architecture.
as a Turbo code and apply a turbo-like message passing Section 6 presents a parallel decoder architecture us-
algorithm to LDPC codes. For example, Mansour and ing multiple Flex-SISO cores. Section 7 compares our
Shanbhag [24] introduce an efficient turbo message flexible decoder with existing decoders in the literature.
passing algorithm for architecture-aware LDPC codes. Finally, Section 8 concludes the paper.
Hocevar [18] proposes a layered decoding algorithm
which treats the parity check matrix as horizontal lay-
ers and passes the soft information between layers to 2 Review of Super-code Based Decoding Algorithm
improve the performance. Zhu and Chakrabarti [50] for LDPC Codes
looked at the super-code based LDPC construction and
decoding. Zhang and Fossorier [46] suggest a shuffled By definition, a Turbo code is a parallel concatenation
belief propagation algorithm to achieve a faster decod- of two super-codes, where each super-code is a con-
ing speed. Lu and Moura [22] propose to partition the stituent convolutional code. Naturally, Turbo decoding
Tanner graph into several trees and apply the turbo-like procedure can be partitioned into two phases where
decoding algorithm in each tree for faster convergence each phase corresponds to one super-code processing.
rate. Dai et al. [12] introduce a turbo-sum-product Similarly, LDPC codes can also be partitioned into
hybrid decoding algorithm for quasi-cyclic (QC) LDPC super-codes for efficient processing as previously men-
codes by splitting the parity check matrix into two sub- tioned in Section 1. Before proceeding with a discussion
matrices where the information is exchanged. of the proposed flexible decoder architecture, it is de-
In our early work [38], we have proposed a super- sirable to review the super-code based LDPC decoding
code based decoding algorithm for LDPC codes. In scheme in this section.
this paper, we extend this algorithm and present a
more generic message passing algorithm for LDPC 2.1 Trellis Structure for LDPC Codes
and Turbo decodings, and then exploit the architecture
commonalities between LDPC and Turbo decoders. A binary LDPC code is a linear block code specified by
We create a connection between LDPC and Turbo a very sparse binary M × N parity check matrix:
codes by applying a super-code based decoding algo-
H · xT = 0, (1)
rithm, where a code is divided into multiple super-codes
and then the decoding operation is performed by iter- where x is a codeword (x ∈ C) and H can be viewed
atively exchanging the soft information between super- as a bipartite graph where each column and row in
codes. In the LDPC decoding, we treat a LDPC code H represent a variable node and a check node, re-
as a concatenation of n super-codes, where each super- spectively. Each element of the parity check matrix is
J Sign Process Syst (2011) 64:1–16 3

Variable Nodes

Degree i I Super-code 1
Interconnect Network (Π)
Degree j
0 Super-code 2
… Super-code ...
1 2… j
Check Nodes
x1+x2+…+xj =0 Super-code n
2-state trellis

Figure 1 Trellis representation for LDPC codes where a two- Figure 3 A block-structured parity check matrix, where each
state trellis diagram is associated with each check node. block row (or layer) defines a super-code. Each sub-matrix of the
parity check matrix is either a zero matrix or a z × z cyclically
shifted identity matrix.

either a zero or a one, where nonzero elements are


typically placed at random positions to achieve good
algorithm can achieve a faster convergence rate than
performance. The number of nonzero elements in each
the standard two-phase message-passing algorithm for
row or each column of the parity check matrix is called
structured LDPC codes [18, 24]. To be more general,
check node degree or variable node degree. A regular
we can divide the factor graph of an LDPC code into
LDPC code has the same check node and variable node
several sub-graphs [38] as illustrated in Fig. 2. Each sub-
degrees, whereas an irregular LDPC code has different
graph corresponds to a super-code. If we restrict that
check node and variable node degrees.
each sub-graph is loop-free, then each super-code has a
The full trellis structure of an LDPC code is enor-
simpler trellis structure so that the MAP algorithm can
mously large, and it is impractical to apply the MAP
be efficiently performed.
algorithm on the full trellis. However, alternately, a
As a special example, the block-structured Quasi-
(N, M-N) LDPC code can be viewed as M parallel
Cyclic (QC) LDPC codes used in many practical com-
concatenated single parity check codes. Although the
munication systems such as 802.16e and 802.11n can be
performance of a single parity check code is poor, when
easily decomposed into several super-codes. As shown
many of them are sparsely connected they become a
in Fig. 3, a block structured parity check matrix can
very strong code. Figure 1 shows a trellis representation
be viewed as a 2-D array of square sub-matrices. Each
for LDPC codes where a single parity check code is
sub-matrix is either a zero matrix or a z-by-z cyclically
considered as a low-weight two-state trellis, starting at
shifted identity matrix Iz(x) with random shift value x.
state 0 and ending at state 0.
The parity check matrix can be viewed as a concate-
nation of n super-codes where each block row or layer
2.2 Layered Message Passing Algorithm for LDPC defines a super-code. In the layered message passing
Codes algorithm, soft information generated by one super-
code can be used immediately by the following super-
The main idea behind the layered LDPC decoding is codes which leads to a faster convergence rate [24].
essentially the Turbo message passing algorithm [24].
It has been shown that the layered message passing
3 Flexible SISO Module
c1 c2 c3 c4
In this section, we propose a flexible soft-input soft-
output (SISO) module, named Flex-SISO module, to
v1 v2 v3 v4 v5 v6
decode LDPC and Turbo codes. The SISO module is
Original factor graph
based on the MAP algorithm [3]. To reduce complexity,
c1 c3 c2 c4 the MAP algorithm is usually calculated in the log do-
main [31]. In this paper, we assume the MAP algorithm
v1 v2 v3 v4 v5 v1 v2 v3 v4 v5 v6
is always calculated in the log domain.
The decoding algorithm underlying the Flex-SISO
Sub factor graph 1 Sub factor graph 2
module works for codes which have trellis representa-
Figure 2 Dividing a factor graph into sub-graphs. tions. For LDPC codes, a Flex-SISO module was used
4 J Sign Process Syst (2011) 64:1–16

Channel values for


be thought of as the sum of the channel value λc (u)
parity bits
λ c (p) (for information bit) and all the extrinsic values λe (u)
previously generated by all the super-codes:
Soft values for λ i(u) Flex-SISO λ o(u) APP values for 
information bits information bits λi (u) = λc (u) + λe (u). (2)
Module
Note that prior to the iterative decoding, λi (u) should
Old extrinsic values λ e(u;old) λ e(u;new) New extrinsic values be initialized with λc (u). Next, the old extrinsic value
for information bits for information bits
Memory λe (u; old) generated by this Flex-SISO module in the
previous iteration is subtracted from λi (u) as follows:
Figure 4 Flex-SISO module.
λt (u) = λi (u) − λe (u; old). (3)
Then, the new extrinsic value λe (u; new) can be com-
to decode a super-code. For Turbo codes, a Flex-SISO puted using the MAP algorithm based on λt (u), and
module was used to decode a component convolutional λc ( p) if available. Finally, the APP value is updated as
code. Iteration performed by the Flex-SISO module is λo (u) = λi (u) − λe (u; old) + λe (u; new). (4)
called sub-iteration, and thus one full iteration contains
n sub-iterations. Then this updated APP value is passed to the down-
stream Flex-SISO modules. This computation repeats
in each sub-iteration.
3.1 Flex-SISO Module
3.2 Flex-SISO Module to Decode LDPC Codes
Figure 4 depicts the proposed Flex-SISO module. The
output of the Flex-SISO module is the a posteriori
In this section, we show how to use the Flex-SISO
probability (APP) log-likelihood ratio (LLR) values,
module to decode LDPC codes. Because QC-LDPC
denoted as λo (u), for information bits. It should be
codes are widely used in many practical systems, we
noted that the Flex-SISO module exchanges the soft
will primarily focus on the QC-LDPC codes. First,
values λo (u) instead of the extrinsic values in the iter-
we decompose a QC-LDPC code into multiple super-
ative decoding process. The extrinsic values, denoted
codes, where each layer of the parity check matrix
as λe (u), are stored in a local memory of the Flex-
defines a super-code. After the layered decomposition,
SISO module. To distinguish the extrinsic values gen-
each super-code comprises z independent two-state sin-
erated at different sub-iterations, we use λe (u; old) and
gle parity check codes. Figure 5 shows the super-code
λe (u; new) to represent the extrinsic values generated in
based, or layered, LDPC decoder architecture using the
the previous sub-iteration and the current sub-iteration,
Flex-SISO modules. The decoder parallelism at each
respectively. The soft input values λi (u) are the out-
Flex-SISO module is at the level of the sub-matrix size
puts from the previous Flex-SISO module, or other
z, because these z single parity codes have no data
previous modules if necessary. Another input to the
dependency and can thus be processed simultaneously.
Flex-SISO module is the channel values for parity bits,
This architecture differs than the regular two-phase
denoted as λc ( p), if available. For LDPC codes, we do
LDPC decoder in that a code is partitioned into mul-
not distinguish information and parity bits, and all the
tiple sections, and each section is processed by a same
codeword bits are treated as information bits. However,
processor. The convergence rate can be twice faster
in the case of Turbo codes, we treat information and
than that of a regular decoder [18].
parity bits separately. Thus the input port λc ( p) will not
be used when decoding of LDPC codes. At each sub-
iteration, the old extrinsic values, denoted as λe (u; old),
Flex-SISO 1 Flex-SISO 2 Flex-SISO n
are retrieved from the local memory and should be
subtracted from the soft input values λi (u) to avoid λ i(u) λ o(u) λ i(u) λ o (u) ... λ i(u) λ o(u)
positive feedback.
A generic description of the message passing algo- λ e (u;old) λ e (u;new)
rithm is as follows. Multiple Flex-SISO modules are
Memory Memory Memory
connected in series to form an iterative decoder. First,
the Flex-SISO module receives the soft values λi (u) Figure 5 LDPC decoding using Flex-SISO modules where a
from upstream Flex-SISO modules and the channel LDPC code is decomposed into n super-codes, and n Flex-SISO
values (for parity bits) λc ( p) if available. The λi (u) can modules are connected in series to decode.
J Sign Process Syst (2011) 64:1–16 5

APP λ c (u) where the ⊞ operation is associative and commutative,


Memory and is defined as [15]

λ c (p)=0 1 + eλ(u1 ) eλ(u2 )


λ(u1 ) ⊞ λ(u2 ) = log . (7)
eλ(u1 ) + eλ(u2 )
Flex-SISO
λ i (u) λ t (u) λ o(u) Finally, the new APP value is updated as:
LDPC
+ MAP Processor
- λo (uk ) = λt (um,k ) + λe (um,k ; new). (8)

For each sub-iteration l, Eqs. (5)–(8) can be executed


λ e (u;old) λ e (u;new) in parallel for check nodes m = lz to lz + z − 1 because
there are no data dependency between them.
Extrinsic
Memory
3.3 Flex-SISO Module to Decode Turbo Codes
Figure 6 LDPC decoder architecture based on the Flex-SISO
module. In this section, we show how to use the Flex-SISO mod-
ule to decode Turbo codes. A Turbo code can be nat-
urally partitioned into two super-codes, or constituent
codes. In a traditional Turbo decoder, where the extrin-
Since the data flow is the same between different sic messages are exchanged between two super-codes,
sub-iterations, one physical Flex-SISO module is in- the Flex-SISO module can not be directly applied,
stantiated, and it is re-used at each sub-iteration, which because the Flex-SISO module requires the APP val-
leads to a partial-parallel decoder architecture. Figure 6 ues, rather than the extrinsic values, being exchanged
shows an iterative LDPC decoder hardware architec- between super-codes. In this section, we made a small
ture based on the Flex-SISO module. The structure modification to the traditional Turbo decoding flow so
comprises an APP memory to store the soft APP val- that the APP values are exchanged in the decoding
ues, an extrinsic memory to store the extrinsic values, procedure.
and a MAP processor to implement the MAP algorithm
for z single parity check codes. Prior to the iterative
3.3.1 Review of the Traditional Turbo Decoder
decoding process, the APP memory is initialized with
Structure
channel values λc (u), and the extrinsic memory is ini-
tialized with 0.
The traditional Turbo decoding procedure with two
The decoding flow is summarized as follows. It
SISO decoders is shown in Fig. 7. The definitions of
should be noted that the parity bits are treated as
the symbols in the figure are as follows. The informa-
information bits for the decoding of LDPC codes. We
tion bit and the parity bits at time k are denoted as
use the symbol uk to represent the k-th data bit in the
uk and ( p(1) (2) (n)
k , pk , ..., pk ), respectively, with uk , pk ∈
(i)
codeword. For check node m, we use the symbol um,k
{0, 1}. The channel LLR values for uk and pk(i) are
to denote the k-th codeword bit (or variable node) that
is connected to this check node m. To remove corre- denoted as λc (uk ) and λc ( pk(i) ), respectively. The a priori
lations between iterations, the old extrinsic message LLR, the extrinsic LLR, and the APP LLR for uk are
is subtracted from the soft input message to create a denoted as λa (uk ), λe (uk ), and λo (uk ), respectively.
temporary message λt as follows

λt (um,k ) = λi (uk ) − λe (um,k ; old), (5) Π


–1

λ c (p1) λ c (p2)
λ 1 a(u) λ 2e (u)
where λi (uk ) is the soft input log likelihood ratio (LLR) 1
λ e(u) λ 2a (u)
and λe (um,k ; old) is the old extrinsic value generated by λ c (u) SISO 1 Π SISO 2
1
λ (u) o λ 2o(u)
this MAP processor in the previous iteration. Then the
Π
new extrinsic value can be computed as:
 Figure 7 Traditional Turbo decoding procedure using two SISO
λe (um,k ; new) = ⊞λt (um, j), (6) decoders, where the extrinsic LLR values are exchanged between
j: j=k two SISO decoders.
6 J Sign Process Syst (2011) 64:1–16

In the decoding process, the SISO decoder computes 3.3.2 Modif ied Turbo Decoder Structure Using
the extrinsic LLR value at time k as follows: Flex-SISO Modules

λe (uk ) = max {αk−1 (sk−1 ) + γke (sk−1 , sk ) + βk (sk )} In order to use the proposed Flex-SISO module for
u:uk =1
Turbo decoding, we modify the traditional Turbo de-

− max {αk−1 (sk−1 ) + γke (sk−1 , sk ) + βk (sk )}. coder structure. Figure 8 shows the modified Turbo
u:uk =0
decoder structure based on the Flex-SISO modules.
(9) It should be noted that the modified Turbo decoding
flow is mathematically equivalent to the original Turbo
The α and β metrics are computed based on the for- decoding flow, but uses a different message passing
ward and backward recursions: method. The modified data flow is as follows. In the
∗ first half iteration, Flex-SISO decoder 1 receives soft
αk (sk ) = max{αk−1 (sk−1 ) + γk (sk−1 , sk )} (10) LLR value λi1 (uk ) from Flex-SISO decoder 2 through
sk−1
∗ de-interleaving (λi1 (uk ) is initialized to channel value
βk (sk ) = max{βk+1 (sk+1 ) + γk (sk , sk+1 )}, (11) λc (uk ) prior to decoding). Then it removes the old ex-
sk+1
trinsic value λ1e (uk ; old) from the soft input LLR λi1 (uk )
where the branch metric γk is computed as: to form a temporary message λ1t (uk ) as follows (for
brevity, we drop the superscript “1" in the following
n
 equations)
γk = uk · (λc (uk ) + λa (uk )) + pk(i) · λc ( pk(i) ). (12)
i λt (uk ) = λi (uk ) − λe (uk ; old). (16)
The extrinsic branch metric γke in Eq. 9 is computed as: To relate to the traditional Turbo decoder structure,
this temporary message is mathematically equal to the
n
 sum of the channel value λc (uk ) and the a priori value
γke = pk(i) · λc ( pk(i) ). (13)
λa (uk ) in Fig. 7:
i

λt (uk ) = λc (uk ) + λa (uk ). (17)


The max∗ (·) function in Eqs. 9–11 is defined as:

Thus, the branch metric calculation in Eq. 12 can be re-
max(a, b ) = max(a, b ) + log(1 + e−|a−b | ). (14) written as:
n

The soft APP value for uk is generated as: γk = uk · λt (uk ) + pk(i) · λc ( pk(i) ). (18)
i
λo (uk ) = λe (uk ) + λa (uk ) + λc (uk ). (15)
The extrinsic branch metric (γke ) calculation, and the
In the first half iteration, SISO decoder 1 computes extrinsic LLR (λe (uk )) calculation, however, remain the
the extrinsic value λ1e (uk ) and pass it to SISO decoder 2. same as Eqs. 13 and 9–11, respectively. Finally, the soft
Thus, the extrinsic value computed by SISO decoder 1 APP LLR output is computed as:
becomes the a priori value λa2 (uk ) for SISO decoder 2 in λo (uk ) = λt (uk ) + λe (uk ; new). (19)
the second half iteration. The computation is repeated
in each iteration. The iterative process is usually termi- In the Flex-SISO based iterative decoding proce-
nated after certain number of iterations, when the soft dure, the soft outputs λ1o (u) computed by Flex-SISO
APP value λo (uk ) converges. decoder 1 are passed to Flex-SISO decoder 2 so that

Figure 8 Modified Turbo –1


Π
decoding procedure using two
λ c (p1) λ c (p2)
Flex-SISO modules. The soft
LLR values are exchanged Flex-SISO 1 Flex-SISO 2
between two SISO modules. λ c (u) 1
λ i(u)
1
λ t(u) MAP
1
λ o(u)
2
λ i (u)
2
λ t (u) MAP
2
λ o(u)
+ Processor
Π + Processor
- -
1 2
λ 1e (u;old) λ e (u;new) λ 2e(u;old) λ e (u;new)
Memory Memory
J Sign Process Syst (2011) 64:1–16 7

α0
APP λ c (u) α0
γ0 +
Memory γ0
State α' 0 + LUT-S

λ c(p) m
α1 -
α1
γ1 + MSB
Flex-SISO γ1 α' 0
0 +
λ i (u) λ t (u) Turbo λ o(u) 1

+ MAP Processor (a) (b)


-
Figure 10 Turbo ACSA structure. a Flow of state metric calcu-
λ e (u;old) λ e (u;new) lation. b Circuit diagram for the Turbo ACSA unit.

Extrinsic
Memory unit to decode LDPC and Turbo codes with a small
additional overhead.
Figure 9 Turbo decoder architecture based on the Flex-SISO
module.
4.1 MAP Functional Unit for Turbo Codes

In a Turbo MAP processor, the critical path lies in the


state metric calculation unit which is often referred to
they become the soft inputs λi2 (u) for Flex-SISO de- as add-compare-select-add (ACSA) unit. As depicted
coder 2 in the second half iteration. The computation in Fig. 10, for each state m of the trellis, the decoder
is repeated in each half-iteration until the iteration needs to perform an ACSA operation as follows:
converges. Since the operations are identical between

two sub-iterations, only one physical Flex-SISO module α0′ = max(α0 + γ0 , α1 + γ1 ), (20)
is instantiated, and it is re-used for two sub-iterations.
Figure 9 shows an iterative Turbo decoder architec- where α0 and α1 are the previous state metrics, and
ture based on the Flex-SISO module. The architecture γ0 and γ1 are the branch metrics. Figure 10b shows
is very similar to the LDPC decoder architecture shown a circuit implementation for the ACSA unit, where a
in Fig. 6. The main differences are: 1) the Turbo de- signed-input look-up table “LUT-S" was used to imple-
coder has separate parity channel LLR inputs whereas ment the non-linear function log(1 + e−|x| ). This circuit
the LDPC decoder treats parity bits as information can be used to recursively compute the forward and
bits, 2) the Turbo decoder employs the MAP algorithm backward state metrics based on Eqs. 10 and 11.
on an N-state trellis whereas the LDPC decoder ap-
plies the MAP algorithm on z independent two-state 4.2 MAP Functional Unit for LDPC Codes
trellises, and 3) the interleaver/permuter structures are
different (not shown in the figures). But despite these In the layered QC-LDPC decoding algorithm, each
differences, there are certain important commonalities. super-code comprises z independent single parity check
The message passing flows are the same. The memory codes. Each single parity check code can be viewed as
organizations are similar, but with a variety of sizes de- a terminated two-state convolutional code. Figure 11
pending on the codeword length. The MAP processors, shows an example of the trellis structure for a single
which will be described in the next section, have similar parity check node.
functional unit resources that will be configured using An efficient MAP decoding algorithm for single
multiplexors for each algorithm. Thus, it is natural to parity check code was given in [16]: for independent
design a unified SISO decoder with configurable MAP
processors to support both LDPC and Turbo codes.
u0 u1 u2 u3
0 0 0 0
0 0 0 0 0
1 1 1 1
4 Design of a Flexible Functional Unit 1 1
1 1 1
0 0
The MAP processor is the main processing unit in both
u0 +u1+u2+u3 = 0 (GF2)
LDPC and Turbo decoders as depicted in Fig. 6 and
Fig. 9. In this section, we introduce a flexible functional Figure 11 Trellis structure for a single parity check code.
8 J Sign Process Syst (2011) 64:1–16

Forward Recursion: ak+1=f (ak, γ k) Table 1 LUT approximation for g(x) = log(1 + e−|x| ).
γ0 γ1 γ2
a0=+∞ α 0 α1 α2 α3 |x| |x| = 0 0 < |x| ≤ 0.75 0.75 < |x| ≤ 2 |x| > 2
g(x) 0.75 0.5 0.25 0
λ0 λ1 λ2 λ 3 λ k= f (α k , β k )
γ1 γ2 γ3
β0 β1 β2 β 3 β 3 =+∞
Backward Recursion: β k= f (β k+1, γ k+1)
as shown in Fig. 6, the inputs to this MAP processor
are the temporary metrics λt (um,k ), and the outputs
Figure 12 A forward–backward decoding flow to compute the from this MAP processor are the extrinsic metrics
extrinsic LLRs for single parity check code. λe (um,k ; new).
To compute Eq. 22 in hardware, we separate the
operation into sign and magnitude calculations:
random variables u0 , u1 , ..., ul the extrinsic LLR value
for bit uk is computed as: sign( f (a, b )) = sign(a) sign(b ),

λ(uk ) = ⊞λi (ui ), (21) | f (a, b )| = min(|a|, |b |) + log(1 + e−(|a|+|b |) )
  
∼{uk }
− log 1 + e− |a|−|b | .
 
(26)
where the compact notation ∼{uk } represents the set
of all the variables with uk excluded. For brevity, we Compared to the classical “tanh” function used in
define a function f (a, b ) to represent the operation LDPC decoding (x) = − log(tanh(|x/2|)), the f (·)
λi (u1 ) ⊞ λi (u2 ) as follows function is numerically more robust and less sensitive to
quantization noise. Due to its widely dynamic range (up
1 + ea eb
f (a, b ) = log , (22) to +∞), the (x) function has a high complexity and is
ea + eb
prone to quantization noise. Although many approxi-
where a  λi (u1 ) and b  λi (u2 ). Figure 12 shows a mations have been proposed to improve the numerical
forward–backward decoding flow to implement Eq. 21. accuracy of (x) [26, 29, 48], it is still expensive to
The forward (α) and backward (β) recursions are implement the (x) function in hardware. However,
defined as: the non-linear term in the f (·) function has a very small
dynamic range:
αk+1 = f (αk , γk ) (23)
0 < g(x)  log(1 + e−|x| ) < 0.7,
βk = f (βk+1 , γk+1 ), (24)
where γk = λi (uk ) and is referred to as the branch thus the f (·) function is more easily to be implemented
metric as an analogy to a Turbo decoder. The α and β in hardware by using a low complexity look-up table
metrics are initialized to +∞ in the beginning. Based (LUT). To implement g(x) in hardware, we propose to
on the α and β metrics, the extrinsic LLR for uk is use a four-value LUT approximation which is shown in
computed as: Table 1. For fixed point implementation, we propose
to use Q.2 quantization scheme (Q total bits with 2
λ(uk ) = f (αk , βk ). (25) fractional bits). Table 2 shows the proposed LUT im-
plementation for Q.2 quantization. It should be noted
Figure 13 shows a MAP processor structure to de-
that g(x) is the same as the non-linear term in the
code the single parity check code. Three identical
Turbo max∗ (·) function (c.f. Eq. 14). Thus, the same
f (a, b ) units are used to compute α, β, and λ values.
look-up table configuration can be applied to the Turbo
To relate to the top level LDPC decoder architecture
ACSA unit. In Section 4.4, we will show the decoding
performance by using this look-up table.
Figure 14 depicts a circuit implementation for the
Input stream f (.) α LDPC | f (a, b )| functional unit using two look-up ta-
…γ2 γ1 γ0 bles “LUT-S” and “LUT-U”,
D  where LUT-S and LUT-
Output stream −|a|−|b |
f (.) λ 0 λ1 λ 2 … U implement log(1 + e ) and log(1 + e−(|a|+|b |) ),
Stack Stack

f (.) β Table 2 LUT implementation for Q.2 quantization.


D |x| 0 1 2 3 4 5 6 7 8 >8
Figure 13 MAP processor structure for single parity check code. g(x) 3 2 2 2 1 1 1 1 1 0
J Sign Process Syst (2011) 64:1–16 9

Table 3 Functional description of the FFU.


|a|
|b| + LUT-U Signals LDPC Mode Turbo Mode
select 1 0
+ bypass1 0 1
|a| - bypass2 1 0

-| b| + MSB
LUT-S X
Y
|a|
|b |
α0
γ0
0 + V |a| α1
1 W −|b | γ1
Z | f (a, b )| max∗ (α0 + γ0 , α1 + γ1 )
Figure 14 Circuit diagram for the LDPC | f (a, b )| functional
unit.

and synthesized them on a TSMC 90 nm CMOS tech-


respectively. The difference between LUT-S and LUT- nology. The maximum achievable frequency (assum-
U is that: LUT-S is a signed-input look-up table that ing no clock skews) and the synthesized area at two
takes both positive and negative data inputs whereas frequencies (400 and 800 MHz) are summarized in
LUT-U is an unsigned-input look-up table (half size of Table 4. As can be seen, the proposed flexible func-
LUT-S) that only takes positive data inputs. tional unit FFU has only about 15% area and timing
overhead compared to the dedicated functional units.
4.3 Proposed Flexible Functional Unit (FFU) The area efficiency is achieved because many logic
gates can be shared between LDPC and Turbo modes.
If we compare the LDPC | f (a, b )| functional unit (c.f.
Fig. 14) with the Turbo ACSA functional unit (c.f. 4.4 Fixed Point Decoding Performance
Fig. 10), we can see that they have many commonali-
ties except for the position of the look-up tables and To evaluate the fixed-point decoding performance
the multiplexor. To support both LDPC and Turbo using the look-up table based FFU, we perform
codes with minimum hardware overhead, we propose float-point and bit-accurate fixed-point simulations for
a flexible functional unit (FFU) which is depicted in LDPC and Turbo codes using BPSK modulation over
Fig. 15. We modify the look-up table structure so that an AWGN channel. As a good trade-off between
each look-up table can be bypassed when the bypass complexity and performance, we use 6.2 quantization
control signal is high. A select signal was used to switch scheme for channel LLR inputs for fixed-point LDPC
between the LDPC mode and the Turbo mode. The and Turbo decoders.
functionality of the proposed FFU architecture is sum- Figure 16 shows the bit error rate (BER) simulation
marized in Table 3. result for a WiMAX LDPC code with code-rate =
The word lengths for X, Y, V, and W are all 9 bits. 1/2, and code-length = 2,304. The maximum number
To evaluate the area efficiency of the proposed FFU, of iterations is 15. As can be seen from Fig. 16, the
we have described the LDPC f (a, b ) unit, the Turbo fixed-point FFU solution has a very small performance
ACSA unit, and the proposed FFU in Verilog HDL, degradation (< 0.05 dB) at BER level of 10−6 com-
pared to the floating point solution. We also plot a
BER curve for the scaled minsum solution [11], which
is a sub-optimal approximation algorithm without using
bypass1 the look-up tables. As can be seen from the figure,
X the look-up table based FFU solution can deliver a
Y + LUT-U
bypass2 better decoding performance than the scaled minsum
solution. The complexity of adding the look-up tables is
bypass1 + LUT-S
relatively small because the word length of the data in
V -
W + LUT-S
MSB select
Z
MSB
0
1
0
1
+ Table 4 Synthesis results for different functional units.
D
0 Functional unit | f (a, b )| ACSA FFU
1
Max frequency 920 MHz 885 MHz 815 MHz
Area (400 MHz) 1,192 µm2 1,263 µm2 1,419 µm2
Figure 15 Circuit diagram for the flexible functional unit (FFU)
for LDPC/Turbo decoding. Area (800 MHz) 1,882 µm2 2,086 µm2 2,423 µm2
10 J Sign Process Syst (2011) 64:1–16

10 0 10 –1
Fixed point scaled minsum
Fixed point FFU
Floating point
–1
10 10 –2

–2
10
10 –3
Bit Error Rate (BER)

Bit Error Rate (BER)


–3
10
10 –4

–4
10
10 –5

–5
10
10 –6
Floating point, N=6144
Floating point, N=1024
–6 Floating point, N=240
10
Floating point, N=40
10 –7 Fixed point, N=6144
Fixed point, N=1024
–7 Fixed point, N=240
10 Fixed point, N=40
0.4 0.6 0.8 1 1.2 1.4 1.6 1.8 2 2.2 2.4
10 –8
Eb/N0 [dB] 0 0.5 1 1.5 2 2.5 3
Eb/N0 [dB]
Figure 16 Simulation results for a rate 1/2, length 2304 WiMAX
LDPC code. Figure 18 Simulation results for 3GPP-LTE Turbo codes with a
variety of block sizes.

the look-up table is only 2-bit. Figure 17 compares the


convergence speed of the layered decoding algorithm solution will deliver a better decoding performance
with the standard two-phase decoding algorithm. than the sub-optimal max-logMAP solution.
Figure 18 shows the BER simulation result for From these simulation results, we conclude that the
3GPP-LTE Turbo codes with block sizes of 6,144, 1,024, proposed look-up table based FFU is a good solution
240, and 40. The maximum number of Turbo iterations for supporting high performance LDPC and Turbo
is 6 (12 half iterations). The sliding window length is decoding requirements.
32. As can be seen from the figure, the FFU based
fixed-point decoder has almost no performance loss
compared to the floating point case. The proposed FFU 5 Design of A Flexible SISO Decoder

Built on top of the FFU arithmetic unit, we introduce


a flexible SISO decoder architecture to handle LDPC
15 and Turbo codes. Figure 19 illustrates the proposed
14 Standard algorithm
Layered algorithm dual-mode SISO decoder architecture. The decoder
13
12
comprises four major functional units: alpha unit (α),
beta unit (β), extrinsic-1 unit, and extrinsic-2 unit. The
Average number of iterations

11
10 decoder can be reconfigured to process: i) an eight-state
9 convolutional Turbo code, or ii) 8 single parity check
8 codes.
7
6 5.1 Turbo Mode
5
4
In the Turbo mode, all the elements in the Flex-SISO
3
2
decoder will be activated. For Turbo decoding, we use
1 the Next Iteration Initialization (NII) sliding window
0 algorithm as suggested in [1, 19]. The NII approach
0.75 1 1.25 1.5 1.75 2 2.25
can avoid the calculation of training sequences as ini-
Eb/No [dB]
tialization values for the β state metrics, instead the
Figure 17 Comparison of the convergence speed. boundary metrics are initialized from the previous iter-
J Sign Process Syst (2011) 64:1–16 11

FFU 1
Flex-SISO Decoder

Dispatcher
FFU 2 α
λ i(u) λ t (u) ..
λ e(u;old) - BMC γ
.
Unit (γ) D
λ c(p) FFU 8
FFU 1 max*
γ stack Alpha Unit (α) α max* select

Dispatcher
max* D

Slicing
FFU 2
α stack .. - 0 λ e(u;new)
PADD
β+γe . D max* 1
FFU 8 max*
max* D λ o(u)
Extrinsic-1 Unit +
FFU 1
Extrinsic-2 Unit
Dispatcher

FFU 2 β
.. From γ stack: λ t(u)
. D
NII initialization FFU 8
Beta Unit (β)

Figure 19 Flexible SISO decoder architecture.

ation. As a result, the decoding latency is smaller than order. By using multiple FFUs, the α and β units are
the traditional sliding window algorithm which requires able to compute the state metrics in parallel, leading to
a calculation of training sequences [25, 43], and thus a real time decoding with a latency of L.
only one β unit is required. Moreover, this solution The decoder works as follows. The decoder uses
is very suitable for high code-rate Turbo codes, which soft LLR value λi (u) and old extrinsic value λe (u; old)
require a very long training sequence to obtain reliable to compute λt (u) based on Eq. 16. A branch metric
boundary state metrics. Note that this scheme would calculation (BMC) unit is used to compute the branch
require an additional memory to store the boundary metrics γ (u, p) based on Eq. 18, where u, p ∈ {0, 1}.
state metrics. Then the branch metrics are buffered in a γ stack for
A dataflow graph for NII sliding window algorithm backward (β) metric calculation. The α and β metrics
is depicted in Fig. 20, where the X-axis represents the are computed using Eqs. 10 and 11. The boundary β
trellis flow and the Y-axis represents the decoding time metrics are initialized from an NII buffer (not shown in
so that a box may represent the processing of a block Fig. 19). A dispatcher unit is used to dispatch the data
of L data in L time steps, where L is the sliding to the correct FFUs in the α/β unit. Each α/β unit has
window size. In the decoding process, the α metrics are fully-parallel FFUs (eight of them), so the eight-state
computed in the natural order whereas the β metrics convolutional trellis can be processed at a rate of one-
and the extrinsic LLR (λe ) are computed in the reverse stage per clock cycle.
To compute the extrinsic LLR as defined in Eq. 9,
we first add β metrics with the extrinsic branch metrics
γ e ( p), where γ e ( p) is retrieved from the γ stack, as
Trellis L 2L 3L 4L γ e (0) = 0, γ e (1) = γ (0, 1) = λc ( p). The extrinsic LLR
calculation is separated into two phases which is shown
Time

α in the right part of Fig. 19. In phase 1, the extrinsic-1


NII Init unit performs eight ACSA operations in parallel using
eight FFUs. In phase 2, the extrinsic-2 unit performs
β α 6 max∗ (a, b ) operations and 1 subtraction. Finally, the
λ
soft LLR λo (u) is obtained by adding λe (u; new) with
β α λt (u), where λt (u) is also retrieved from the γ stack, as
λ λt (u) = γ (1, 0).
β α
λ 5.2 LDPC Mode

In the LDPC mode, a substantial subset (more than


Figure 20 Data flow graph for Turbo decoding. 90%) of the logic gates will be reused from the Turbo
12 J Sign Process Syst (2011) 64:1–16

Flex-SISO Decoder
FFU 1
(LDPC Mode)

Dispatcher
FFU 2
λ i(u) λ t (u) .. α
λ e(u;old) - BMC γ
.
Unit (γ) D
0 FFU 8
FFU 1
γ stack Alpha Unit (α) α select=1

Dispatcher
FFU 2
α stack .. - 0 λ e(u;new)
PADD
β+ 0 . D 1
FFU 8
Extrinsic-1 Unit λ o(u)
Dispatcher
FFU 1 +
FFU 2 β
.. From γ stack: λ t(u)
. D
FFU 8
Beta Unit (β)

Figure 21 Flexible SISO decoder architecture in LDPC mode.

mode. As shown in Fig. 21, three major functional 5.3 Performance


units (α unit, β unit, and the extrinsic-1 unit) and two
stack memories are reused in the LDPC mode. The The proposed Flex-SISO decoder has been synthesized
extrinsic-2 unit will be de-activated in the LDPC mode. on a TSMC 90 nm CMOS technology. Table 5 summa-
The decoder can process 8 single parity check codes in rizes the area distribution of this decoder. The maxi-
parallel because each of the α unit, β unit, and extrinsic- mum clock frequency is 500 MHz and the synthesized
1 unit has eight parallel FFUs. area is 0.098 mm2 . The Flex-SISO is a basic building
The dataflow graph of the LDPC decoding (c.f. block in a LDPC decoder or a Turbo decoder, and
Fig. 12) is very similar to that of the Turbo decoding can be reconfigured to process an eight-state trellis
(c.f. Fig. 20). The decoder works as follows. The de- for a Turbo code, or eight check rows for a LDPC
coder first computes λt (u) based on Eq. 5. In the LDPC code. As the baseline design, a single Flex-SISO de-
mode, the branch metric γ is equal to λt (u). Prior to coder can approximately support 30–40 Mbps (LTE)
decoding, the α and β metrics are initialized to the Turbo decoding, or 40–50 Mbps (802.16e or 802.11n)
maximum value. Assuming the check node degree is L. LDPC decoding. In a parallel processing environment,
In the first L cycles, the α unit recursively computes the multiple SISO decoders can be used to increase the
α metrics in the forward direction and store them in an throughput.
α stack. In the next L cycles, the β unit recursively com-
putes the β metrics in the backward direction. At the
same time, the extrinsic-1 unit computes the extrinsic
LLRs using the α and β metrics. While the β unit and
the extrinsic-1 unit are working on the first data stream, APP Mem
the α unit can work on the second stream which leads Turbo
to a pipelined implementation. Permuter Parity Mem

λ e(u;old) λ i (u)
λ c(p)

Table 5 Flex-SISO decoder area distribution.


Ext-Mem

Ext-Mem

Ext-Mem

Flex- Flex- Flex-


SISO SISO ... SISO
Unit Area (mm2 )
Core 1 Core 2 Core P
α-unit 0.014
λ o(u)
β-unit 0.014
λ e(u;new)
Extrinsic-1 unit 0.014
Extrinsic-2 unit 0.004 Permuter
α and γ stack memories 0.045
Control logic & others 0.007
Figure 22 Parallel LDPC/Turbo decoder architecture based on
Total 0.098 multiple Flex-SISO decoder cores.
J Sign Process Syst (2011) 64:1–16 13

Table 6 Performance of the proposed parallel decoder (3.2 mm2 core area, 500 MHz clock frequency, TSMC 90 nm technology).
Supported codes Code size (bit) Parallelism Quantization Max. iteration Max. throughput (Mbps) Latency
LDPC 802.16e 576–2,304 z = 24–96 6.2 15 600 1,590 cycles
LDPC 802.11n 648–1,944 z = 27–81 6.2 15 500 1,620 cycles
Turbo 3GPP-LTE 40–6,144 Sub-block = 1–12 6.2 6 450 6,822 cycles

6 Parallel Decoder Architecture Using Multiple For LDPC decoding, with 12 available Flex-SISO
Flex-SISO Decoder Cores cores the decoder can process up to 12 × 8 = 96 check
nodes simultaneously. Because the sub-matrix size z is
For high throughput applications, it is necessary to use between 24 to 96 for 802.16e LDPC codes, and 27 to 81
multiple SISO decoders working in parallel to increase for 802.11n, the proposed decoder always guarantees
the decoding speed. For parallel Turbo decoding, mul- that all of the z check nodes within a layer can be
tiple SISO decoders can be employed by dividing a processed in parallel.
codeword block into several sub-blocks and then each For 3GPP-LTE Turbo decoding, the codeword can
sub-block is processed separately by a dedicated SISO be partitioned into M sub-blocks for parallel process-
decoder [7, 20, 30, 41, 42]. For LDPC decoding, the ing. LTE Turbo code uses a quadratic permutation
decoder parallelism can be achieved by employing mul- polynomial (QPP) interleaver [36] so that it allows
tiple check node processors [10, 14, 32, 40, 49]. conflict free memory access as long as M is a factor of
Based on the Flex-SISO decoder core, we proposed the codeword length. There are 188 different codeword
a parallel LDPC/Turbo decoder architecture which is sizes defined in LTE. For LTE Turbo codes, all of the
shown in Fig. 22. As depicted, the parallel decoder codewords can support a parallelism level of 8, some of
comprises P Flex-SISO decoder cores. In this architec- the codewords can support parallelism level of 10 or 12.
ture, there are three types of storage. Extrinsic memory Because we have 12 Flex-SISO cores available, we will
(Ext-Mem) is used for storing the extrinsic LLR values dynamically allocate the maximum possible number
produced by each SISO core. APP memory (APP- of Flex-SISO cores (8 ≤ M ≤ 12) constrained on the
Mem) is used to store the initial and updated LLR QPP interleaver parallelism. As an example, for the
values. The APP memory is partitioned into multiple maximum codeword size of 6144, we can allocate all of
banks to allow parallel data transfer. Turbo parity the 12 Flex-SISO cores to work in parallel. It should
memory is used to store the channel LLR values for be noted that the parallelism level has some impact on
each parity bit in a Turbo codeword. This memory is the error performance of the decoder due to the edge
not used for LDPC decoding (parity bits are treated as effects caused by the sub-block partitioning [17].
information bits for LDPC decoding). Two permuters This parallel and flexible decoder has been imple-
are used to perform the permutation of the APP values mented in Verilog HDL and synthesized on a TSMC
back and forth. 90 nm CMOS technology using Synopsys Design Com-
As a case study, we have designed a high-throughput, piler. The maximum clock frequency of this decoder
flexible LDPC/Turbo decoder to support the following is 500 MHz. The synthesized core area is 3.2 mm2 ,
three codes: 1) 802.16e WiMAX LDPC code, 2) 802.11n which includes all of the components in this decoder.
WLAN LDPC code, and 3) 3GPP-LTE Turbo code. Table 6 summarizes the features of this decoder. The
Table 6 summarizes the performance and design para- decoder can be configured to support IEEE 802.16e
meters for this decoder. The number of the Flex-SISO LDPC codes, IEEE 802.11n LDPC codes, and 3GPP
decoders is chosen to be 12. LTE Turbo codes. Compared to a dedicated LDPC

Table 7 Turbo decoder architecture comparison with existing solutions.


This work [2] [34] [28]
Modes Turbo, LDPC Viterbi, Turbo, LDPC Turbo, LDPC Viterbi, Turbo, LDPC, RS
Technology 90 nm 65 nm 130 nm 90 nm
Clock frequency 500 MHz 400 MHz 200 MHz NA
Core area 3.2 mm2 0.62 mm2 NA NA
Throughput (LDPC) 600 Mbps (@15 iter.) 257 Mbps (@10 iter.) 11.2 Mbps (@10 iter.) 70 Mbps
Throughput (Turbo) 450 Mbpsa (@6 iter.) 18.6 Mbpsa (@5 iter.) 86.5 Mbpsb (@8 iter.) 14 Mbpsa
a Binary Turbo code
b Double-binary Turbo code
14 J Sign Process Syst (2011) 64:1–16

decoder solution [37], this flexible decoder has only (TI), and US National Science Foundation (under grants CCF-
about 15–20% area overhead when normalized to the 0541363, CNS-0551692, CNS-0619767, CNS-0923479, and EECS-
0925942) for their support of the research.
same throughput target (with the same number of
iterations). Compared to a dedicated Turbo decoder
solution [30], our flexible decoder shows only about
10–20% area overhead when normalized to the same References
technology and the same throughput and code length.
1. Abbasfar, A., & Yao, K. (2003). An efficient and practical
architecture for high speed turbo decoders. IEEE Vehicular
Technology Conference, 1, 337–341.
7 Related Work and Architecture Comparison 2. Alles, M., Vogt, T., & Wehn, N. (2008). FlexiChaP: A re-
configurable ASIP for convolutional, turbo, and LDPC code
Multi-mode Turbo decoders are an increasingly impor- decoding. In 2008 5th International symposium on turbo codes
tant component in mobile wireless devices. To support and related topics (pp. 84–89).
3. Bahl, L., Cocke, J., Jelinek, F., & Raviv, J. (1974). Op-
multi-mode decoding, the ASIC/ASIP/MPSoC/SIMD timal decoding of linear codes for minimizing symbol er-
architectures have been recently proposed [2, 28, 34]. ror rate. IEEE Transactions on Information Theory IT-20,
In [2], a reconfigurable application-specific instruction- 284–287.
set processor (ASIP) architecture is presented for con- 4. Berrou, C., Glavieux, A., & Thitimajshima, P. (1993).
Near Shannon limit error-correcting coding and decod-
volutional, Turbo, and LDPC code decoding. In [34], a ing: Turbo-codes. In IEEE Int. conf. commun. (pp. 1064–
multi processor system on chip (MPSoC) architecture 1070).
is described for LDPC and Turbo code decoding. In 5. Bickerstaff, M., Davis, L., Thomas, C., Garrett, D., & Nicol,
[28], a SIMD-like processor architecture is proposed for C. (2003). A 24Mb/s radix-4 logMAP turbo decoder for
3GPP-HSDPA mobile wireless. In IEEE Int. solid-state cir-
Viterbi, Turbo, Reed-Solomon, and LDPC decoding. cuit conf. (ISSCC).
Table 7 shows the architecture comparison and tradeoff 6. Blanksby, A. J., & Howland, C. J. (2002). A 690-mW 1-
analysis of these decoders. Each approach has different Gb/s 1024-b, rate-1/2 low-density parity-check code decoder.
benefit in terms of flexibility. Our focus is to achieve IEEE Journal of Solid-State Circuits, 37, 404–412.
7. Bougard, B., Giulietti, A., Derudder, V., Weijers, J. W.,
highest throughput for both LDPC and Turbo codes. Dupont, S., Hollevoet, L., Catthoor, F., et al. (2003). A scal-
As can be seen from the table, the proposed decoder able 8.7-nJ/bit 75.6-Mb/s parallel concatenated convolutional
can support very high throughput LDPC/Turbo decod- (turbo-) codec. In IEEE International solid-state circuit con-
ing at a small silicon area cost. ference (ISSCC).
8. Bougard, B., Giulietti, A., Van der Perre, L., & Catthoor, F.
(2002). A class of power efficient VLSI architectures for high
speed turbo-decoding. In IEEE conf. global telecommunica-
8 Conclusion tions (Vol. 1, pp. 549–553).
9. Brack, T., Alles, M., Kienle, F., & Wehn, N. (2006). A synthe-
sizable IP core for WIMAX 802.16e LDPC code decoding.
In this work, we present a flexible decoder architecture In IEEE 17th Int. symp. personal, indoor and mobile radio
to support LDPC and Turbo codes. We propose a communications (pp. 1–5).
dual-mode Flex-SISO decoder as a basic building block 10. Brack, T., Alles, M., Lehnigk-Emden, T., Kienle, F., Wehn,
N., L’Insalata, N., et al. (2007). Low complexity LDPC
in LDPC and Turbo decoders. Our study has been code decoders for next generation standards. In Design,
focused on the Flex-SISO decoder architecture design automation, and test in Europe (pp. 331–336). New York:
and implementation. We unify the decoding process ACM
for LDPC and Turbo codes so that the same Flex- 11. Chen, J., Dholakia, A., Eleftheriou, E., Fossorier, M., &
Hu, X. (2005). Reduced-complexity decoding of LDPC
SISO decoder can be re-used for both cases resulting codes. IEEE Transactions on Communications, 53, 1288–
in more than 80% resource sharing. To increase de- 1299.
coding throughput, we propose a parallel LDPC/Turbo 12. Dai, Y., Yan, Z., & Chen, N. (2006). High-throughput turbo-
decoder using multiple Flex-SISO cores. With a core sum-product decoding of QC LDPC codes. In 40th Annual
conf. on info. sciences and syst. (Vol. 11, pp. 839– 8446).
area of 3.2 mm2 , the decoder is able to sustain 600 Mbps 13. Gallager, R. (1963). Low-density parity-check codes.
802.11e LDPC decoding, 500 Mbps 802.11n LDPC de- Cambridge: MIT.
coding, or 450 Mbps 3GPP LTE Turbo decoding. The 14. Gunnam, K. K., Choi, G. S., Yeary, M. B., & Atiquzzaman,
proposed architecture can significantly reduce the cost M. (2007). VLSI architectures for layered decoding for irreg-
ular LDPC codes of WiMax. In IEEE International Confer-
of a multi-mode receiver. ence on Communications (ICC) (pp. 4542–4547).
15. Hagenauer, J., Offer, E., & Papke, L. (1996). Iterative decod-
Acknowledgements The authors would like to thank Nokia, ing of binary block and convolutional codes. IEEE Transac-
Nokia Siemens Networks (NSN), Xilinx, Texas Instruments tions on Information Theory, 42(2), 429–445.
J Sign Process Syst (2011) 64:1–16 15

16. Hagenauer, J., Offer, E., & Papke, L. (1996). Iterative decod- A NoC case study. In IEEE 10th International symposium on
ing of binary block and convolutional codes. IEEE Transac- spread spectrum techniques and applications (pp. 671–676).
tions on Information Theory, 42, 429–445. 35. Shih, X. Y., Zhan, C. Z., Lin, C. H., & Wu, A. Y. (2008). An
17. He, Z., Fortier, P., & Roy, S. (2006). Highly-parallel decoding 8.29 mm2 52 mW multi-mode LDPC decoder design for mo-
architectures for convolutional turbo codes. IEEE Transac- bile WiMAX system in 0.13 m CMOS process. IEEE Journal
tions on Very Large Scale Integration (VLSI) Systems, 14(10), of Solid-State Circuits, 43, 672–683.
1147–1151. 36. Sun, J., & Takeshita, O. (2005). Interleavers for turbo codes
18. Hocevar, D. (2004). A reduced complexity decoder architec- using permutation polynomials over integer rings. IEEE
ture via layered decoding of LDPC codes. In IEEE workshop Transactions on Information Theory, 51, 101–119.
on signal processing systems (SIPS) (pp. 107–112). 37. Sun, Y., & Cavallaro, J. R. (2008). A low-power 1-Gbps re-
19. Dielissen, J., & Huisken, J. (2000). State vector reduction for configurable LDPC decoder design for multiple 4G wireless
initialization of sliding windows MAP. In 2nd International standards. In IEEE International SOC conference (pp. 367–
symposium on turbo codes and related topics. 370).
20. Lee, S. J., Shanbhag, N., & Singer, A. (2005). Area-efficient 38. Sun, Y., & Cavallaro, J. R. (2008). Unified decoder architec-
high-throughput MAP decoder architectures. IEEE Trans- ture For LDPC/Turbo codes. In IEEE Workshop on Signal
actions on Very Large Scale Integration (VLSI) Systems, 13, Processing Systems (SIPS) (pp. 13–18).
921–933. 39. Sun, Y., Karkooti, M., & Cavallaro, J. R. (2006). High
21. Lin, Y., Mahlke, S., Mudge, T., & Chakrabarti, C. (2006). throughput, parallel, scalable LDPC encoder/decoder archi-
Design and implementation of turbo decoders for software tecture for OFDM systems. In IEEE workshop on design,
defined radio. In IEEE SIPS (pp. 22–27). applications, integration and software (pp. 39–42).
22. Lu, J., & Moura, J. (2003). Turbo like decoding of LDPC 40. Sun, Y., Karkooti, M., & Cavallaro, J. R. (2007). VLSI de-
codes. In IEEE Int. conf. on magnetics (pp. DT-11). coder architecture for high throughput, variable block-size
23. MacKay, D. J. C. (1998). Turbo codes are low density par- and multi-rate LDPC codes. In IEEE International sympo-
ity check codes. Available online, http://www.inference.phy. sium on circuits and systems (ISCAS) (pp. 2104–2107).
cam.ac.uk/mackay/turbo-ldpc.pdf. 41. Sun, Y., Zhu, Y., Goel, M., & Cavallaro, J. R. (2008).
24. Mansour, M. M., & Shanbhag, N. R. (2003). High-throughput Configurable and scalable high throughput turbo decoder
LDPC decoders. IEEE Transactions on Very Large Scale In- architecture for multiple 4G wireless standards. In IEEE In-
tegration (VLSI) Systems, 11, 976–996. ternational conference on application-specif ic systems, archi-
25. Masera, G., Piccinini, G., Roch, M., & Zamboni, M. (1999). tectures and processors (ASAP) (pp. 209–214).
VLSI architecture for turbo codes. IEEE Transactions 42. Thul, M. J., Gilbert, F., Vogt, T., Kreiselmaier, G., & Wehn,
on Very Large Scale Integration (VLSI) Systems, 7, 369– N. (2005). A scalable system architecture for high-throughput
3797. turbo-decoders. Journal of VLSI Signal Processing, 39,
26. Masera, G., Quaglio, F., & Vacca, F. (2005). Finite precision 63–77.
implementation of LDPC decoders. In IEEE proc. commun. 43. Viterbi, A. (1998). An intuitive justification and a simplified
(Vol. 152, pp. 1098–1102). implementation of the MAP decoder for convolutional codes.
27. Mohsenin, T., Truong, D., & Baas, B. (2009). Multi-split- IEEE Journal on Selected Areas in Communications, 16, 260–
row threshold decoding implementations for LDPC codes. In 264.
IEEE International symposium on circuits and systems (IS- 44. Wang, Z., Chi, Z., & Parhi, K. (2002). Area-efficient high-
CAS’09) (pp. 2449–2452). speed decoding schemes for turbo decoders. IEEE Trans-
28. Niktash, A., Parizi, H., Kamalizad, A., & Bagherzadeh, actions on Very Large Scale Integration (VLSI) Systems, 10,
N. (2008). RECFEC: A reconfigurable FEC processor 902–912.
for Viterbi, turbo, Reed-Solomon and LDPC coding. In 45. Wang, Z., & Cui, Z. (2007). Low-complexity high-speed de-
IEEE Wireless communications and networking conference coder design for quasi-cyclic LDPC codes. IEEE Transac-
(WCNC) (pp. 605–610). tions on Very Large Scale Integration (VLSI) Systems, 15,
29. Oh, D., & Parhi, K. (2006). Low complexity implementations 104–114.
of sum-product algorithm for decoding low-density parity- 46. Zhang, J., & Fossorier, M. (2002). Shuffled belief propagation
check codes. In IEEE Workshop on signal processing systems decoding. In Asilomar Conference on signals, systems and
(SIPS) (pp. 262–267). computers (Vol. 1, pp. 8–15).
30. Prescher, G., Gemmeke, T., & Noll, T. (2005). A parametriz- 47. Zhang, K., Huang, X., & Wang, Z. (2009). High-throughput
able low-power high-throughput turbo-decoder. In IEEE Int. layered decoder implementation for quasi-cyclic LDPC
conf. acoustics, speech, and signal processing (Vol. 5, pp. 25– codes. IEEE Journal on Selected Areas in Communications,
28). 27(6), 985–994.
31. Robertson, P., Villebrun, E., & Hoeher, P. (1995). A com- 48. Zhang, T., Wang, Z., & Parhi, K. (2001). On finite precision
parison of optimal and sub-optimal MAP decoding algorithm implementation of low density parity check codes decoder.
operating in the log domain. In IEEE Int. conf. commun. In IEEE Int. symposium on circuits and systems (ISCAS)
(ICC) (pp. 1009–1013). (Vol. 4, pp. 202–205).
32. Rovini, M., Gentile, G., Rossi, F., & Fanucci, L. (2007). A 49. Zhong, H., & Zhang, T. (2005). Block-LDPC: A practical
scalable decoder architecture for IEEE 802.11n LDPC codes. LDPC coding system design approach. IEEE Transactions
In IEEE global telecommunications conference (pp. 3270– on Circuits and Systems I: Fundamental Theory and Applica-
3274). tions, 52(4), 766–775 (see also IEEE Transactions on Circuits
33. Salmela, P., Sorokin, H., & Takala, J. (2008). A pro- and Systems I: Regular Papers).
grammable Max-Log-MAP turbo decoder implementation. 50. Zhu, Y., & Chakrabarti, C. (2009). Architecture-aware
Hindawi VLSI Design, 2008, 636–640. LDPC code design for multiprocessor software defined radio
34. Scarpellino, M., Singh, A., Boutillon, E., & Masera, G. (2008). systems. In IEEE transactions on signal processing (Vol. 57,
Reconfigurable architecture for LDPC and turbo decoding: pp. 3679–3692).
16 J Sign Process Syst (2011) 64:1–16

Yang Sun received the B.S. degree in Testing Technology & In- Joseph R. Cavallaro received the B.S. degree from the Uni-
strumentation in 2000 and the M.S. degree in Instrument Science versity of Pennsylvania, Philadelphia, Pa, in 1981, the M.S. de-
& Technology in 2003, from Zhejiang University, Hangzhou, gree from Princeton University, Princeton, NJ, in 1982, and the
China. From 2003 to 2004, he was with S3 Graphics Co. Ltd. as Ph.D. degree from Cornell University, Ithaca, NY, in 1988, all
an ASIC design engineer, developing Graphics Processing Unit in electrical engineering. From 1981 to 1983, he was with AT&T
(GPU) cores for graphics chipsets. From 2004 to 2005, he was Bell Laboratories, Holmdel, NJ. In 1988, he joined the faculty of
with Conexant Systems Inc. as an ASIC design engineer, devel- Rice University, Houston, TX, where he is currently a Professor
oping video decoder cores for set-top box (STB) chipsets. During of electrical and computer engineering. His research interests
the summer of 2007 and 2008, he worked at Texas Instruments - include computer arithmetic, VLSI design and microlithogra-
R&D center as an intern, developing LDPC and Turbo error- phy, and DSP and VLSI architectures for applications in wire-
correcting decoders. less communications. During the 1996–1997 academic year, he
He is currently a PhD student in the Department of Electrical served at the National Science Foundation as Director of the
and Computer Engineering at Rice University, Houston, Texas. Prototyping Tools and Methodology Program. He was a Nokia
His research interests include parallel algorithms and VLSI ar- Foundation Fellow and a Visiting Professor at the University of
chitectures for wireless communication systems. He received the Oulu, Finland in 2005 and continues his affiliation there as an
2008 IEEE SoC Conference Best Paper Award, the 2008 IEEE Adjunct Professor. He is currently the Associate Director of the
Workshop on Signal Processing Systems Bob Owens Memory Center for Multimedia Communication at Rice University. He
Paper Award, and the 2009 ACM GLSVLSI Best Student Paper is a Senior Member of the IEEE. He was Co-chair of the 2004
Award. Signal Processing for Communications Symposium at the IEEE
Global Communications Conference and General Co-chair of
the 2004 IEEE 15th International Conference on Application-
Specific Systems, Architectures and Processors (ASAP).

You might also like