Flexible LDPC/Turbo Decoder Design
Flexible LDPC/Turbo Decoder Design
DOI 10.1007/s11265-010-0477-6
Received: 21 November 2009 / Revised: 11 March 2010 / Accepted: 12 March 2010 / Published online: 9 April 2010
© Springer Science+Business Media, LLC 2010
Abstract Low-density parity-check (LDPC) codes and Turbo codes. Running at 500 MHz clock frequency, the
convolutional Turbo codes are two of the most power- decoder can sustain up to 600 Mbps LDPC decoding or
ful error correcting codes that are widely used in mod- 450 Mbps Turbo decoding.
ern communication systems. In a multi-mode baseband
receiver, both LDPC and Turbo decoders may be re- Keywords SISO decoder · LDPC decoder ·
quired. However, the different decoding approaches Turbo decoder · Error correcting codes ·
for LDPC and Turbo codes usually lead to different MAP algorithm · Reconfigurable architecture
hardware architectures. In this paper we propose a uni-
fied message passing algorithm for LDPC and Turbo
codes and introduce a flexible soft-input soft-output 1 Introduction
(SISO) module to handle LDPC/Turbo decoding. We
employ the trellis-based maximum a posteriori (MAP) Practical wireless communication channels are inher-
algorithm as a bridge between LDPC and Turbo codes ently “noisy” due to the impairments caused by channel
decoding. We view the LDPC code as a concatenation distortions and multipath effect. Error correcting codes
of n super-codes where each super-code has a simpler are widely used to increase the bandwidth and energy
trellis structure so that the MAP algorithm can be efficiency of wireless communication systems. As a core
easily applied to it. We propose a flexible functional technology in wireless communications, forward error
unit (FFU) for MAP processing of LDPC and Turbo correction (FEC) coding has migrated from basic con-
codes with a low hardware overhead (about 15% area volutional/block codes to more powerful Turbo codes
and timing overhead). Based on the FFU, we propose and LDPC codes. Turbo codes, introduced by Berrou
an area-efficient flexible SISO decoder architecture to et al. in 1993 [4], have been employed in 3G and
support LDPC/Turbo codes decoding. Multiple such beyond 3G wireless systems, such as UMTS/WCDMA
SISO modules can be embedded into a parallel decoder and 3GPP Long-Term Evolution (LTE) systems. As a
for higher decoding throughput. As a case study, a candidate for 4G coding scheme, LDPC codes, which
flexible LDPC/Turbo decoder has been synthesized on were introduced by Gallager in 1963 [13], have re-
a TSMC 90 nm CMOS technology with a core area of cently received significant attention in coding theory
3.2 mm2 . The decoder can support IEEE 802.16e LDPC and have been adopted by some advanced wireless sys-
codes, IEEE 802.11n LDPC codes, and 3GPP LTE tems such as IEEE 802.16e WiMAX system and IEEE
802.11n WLAN system. In future 4G networks, inter-
networking and roaming between different networks
Y. Sun (B) · J. R. Cavallaro
Department of Electrical and Computer Engineering Rice
would require a multi-standard FEC decoder. Since
University, 6100 Main Street, Houston, TX 77005, USA Turbo codes and LDPC codes are widely used in many
e-mail: [email protected] different 3G/4G systems, it is important to design a
J. R. Cavallaro configurable decoder to support multiple FEC coding
e-mail: [email protected] schemes.
2 J Sign Process Syst (2011) 64:1–16
In the literature, many efficient LDPC decoder VLSI code has a simpler trellis structure so that the maxi-
architectures have been studied [6, 9, 12, 14, 18, 24, 27, mum a posteriori (MAP) algorithm can be efficiently
29, 35, 37, 39, 45, 47]. Turbo decoder VLSI architec- performed. In the Turbo decoding, we modify the tradi-
tures have also been extensively investigated by many tional message passing flow so that the proposed super-
researchers [5, 8, 20, 21, 25, 30, 33, 41, 44]. However, code based decoding scheme works for Turbo codes as
designing a flexible decoder to support both LDPC well.
and Turbo codes still remains very challenging. In this Contributions of this paper are as follows. First, we
paper, we aim to provide an alternative to dedicated introduce a flexible soft-input soft-output (Flex-SISO)
silicon that reduces the cost of supporting both LDPC module for LDPC and Turbo codes decoding. Sec-
and Turbo codes with a small additional overhead. We ond, we introduce an area-efficient flexible functional
propose a flexible decoder architecture to meet the unit (FFU) for implementing the MAP algorithm in
needs of a multi-standard FEC decoder. hardware. Third, we propose a flexible SISO decoder
From the theoretical point of view, there are some hardware architecture based on the FFU. Finally, we
similarities between LDPC and Turbo codes. They can show how to enable parallel decoding by using multiple
both be represented as codes on graphs which define such Flex-SISO decoders.
the constraints satisfied by codewords. Both families The remainder of the paper is organized as follows.
of codes are decoded in an iterative manner by em- Section 2 reviews the super-code based decoding al-
ploying the sum-product algorithm or belief propa- gorithm for LDPC codes. Section 3 presents a Flex-
gation algorithm. For example, MacKay has related SISO module for LDPC/Turbo decoding. Section 4
these two codes by treating a Turbo code as a low- introduces a flexible functional unit (FFU) for LDPC
density parity-check code [23]. On the other hand, a and Turbo decoding. Based on the FFU, Section 5
few other researchers have tried to treat a LDPC code describes a dual-mode Flex-SISO decoder architecture.
as a Turbo code and apply a turbo-like message passing Section 6 presents a parallel decoder architecture us-
algorithm to LDPC codes. For example, Mansour and ing multiple Flex-SISO cores. Section 7 compares our
Shanbhag [24] introduce an efficient turbo message flexible decoder with existing decoders in the literature.
passing algorithm for architecture-aware LDPC codes. Finally, Section 8 concludes the paper.
Hocevar [18] proposes a layered decoding algorithm
which treats the parity check matrix as horizontal lay-
ers and passes the soft information between layers to 2 Review of Super-code Based Decoding Algorithm
improve the performance. Zhu and Chakrabarti [50] for LDPC Codes
looked at the super-code based LDPC construction and
decoding. Zhang and Fossorier [46] suggest a shuffled By definition, a Turbo code is a parallel concatenation
belief propagation algorithm to achieve a faster decod- of two super-codes, where each super-code is a con-
ing speed. Lu and Moura [22] propose to partition the stituent convolutional code. Naturally, Turbo decoding
Tanner graph into several trees and apply the turbo-like procedure can be partitioned into two phases where
decoding algorithm in each tree for faster convergence each phase corresponds to one super-code processing.
rate. Dai et al. [12] introduce a turbo-sum-product Similarly, LDPC codes can also be partitioned into
hybrid decoding algorithm for quasi-cyclic (QC) LDPC super-codes for efficient processing as previously men-
codes by splitting the parity check matrix into two sub- tioned in Section 1. Before proceeding with a discussion
matrices where the information is exchanged. of the proposed flexible decoder architecture, it is de-
In our early work [38], we have proposed a super- sirable to review the super-code based LDPC decoding
code based decoding algorithm for LDPC codes. In scheme in this section.
this paper, we extend this algorithm and present a
more generic message passing algorithm for LDPC 2.1 Trellis Structure for LDPC Codes
and Turbo decodings, and then exploit the architecture
commonalities between LDPC and Turbo decoders. A binary LDPC code is a linear block code specified by
We create a connection between LDPC and Turbo a very sparse binary M × N parity check matrix:
codes by applying a super-code based decoding algo-
H · xT = 0, (1)
rithm, where a code is divided into multiple super-codes
and then the decoding operation is performed by iter- where x is a codeword (x ∈ C) and H can be viewed
atively exchanging the soft information between super- as a bipartite graph where each column and row in
codes. In the LDPC decoding, we treat a LDPC code H represent a variable node and a check node, re-
as a concatenation of n super-codes, where each super- spectively. Each element of the parity check matrix is
J Sign Process Syst (2011) 64:1–16 3
Variable Nodes
…
Degree i I Super-code 1
Interconnect Network (Π)
Degree j
0 Super-code 2
… Super-code ...
1 2… j
Check Nodes
x1+x2+…+xj =0 Super-code n
2-state trellis
Figure 1 Trellis representation for LDPC codes where a two- Figure 3 A block-structured parity check matrix, where each
state trellis diagram is associated with each check node. block row (or layer) defines a super-code. Each sub-matrix of the
parity check matrix is either a zero matrix or a z × z cyclically
shifted identity matrix.
λ c (p1) λ c (p2)
λ 1 a(u) λ 2e (u)
where λi (uk ) is the soft input log likelihood ratio (LLR) 1
λ e(u) λ 2a (u)
and λe (um,k ; old) is the old extrinsic value generated by λ c (u) SISO 1 Π SISO 2
1
λ (u) o λ 2o(u)
this MAP processor in the previous iteration. Then the
Π
new extrinsic value can be computed as:
Figure 7 Traditional Turbo decoding procedure using two SISO
λe (um,k ; new) = ⊞λt (um, j), (6) decoders, where the extrinsic LLR values are exchanged between
j: j=k two SISO decoders.
6 J Sign Process Syst (2011) 64:1–16
In the decoding process, the SISO decoder computes 3.3.2 Modif ied Turbo Decoder Structure Using
the extrinsic LLR value at time k as follows: Flex-SISO Modules
∗
λe (uk ) = max {αk−1 (sk−1 ) + γke (sk−1 , sk ) + βk (sk )} In order to use the proposed Flex-SISO module for
u:uk =1
Turbo decoding, we modify the traditional Turbo de-
∗
− max {αk−1 (sk−1 ) + γke (sk−1 , sk ) + βk (sk )}. coder structure. Figure 8 shows the modified Turbo
u:uk =0
decoder structure based on the Flex-SISO modules.
(9) It should be noted that the modified Turbo decoding
flow is mathematically equivalent to the original Turbo
The α and β metrics are computed based on the for- decoding flow, but uses a different message passing
ward and backward recursions: method. The modified data flow is as follows. In the
∗ first half iteration, Flex-SISO decoder 1 receives soft
αk (sk ) = max{αk−1 (sk−1 ) + γk (sk−1 , sk )} (10) LLR value λi1 (uk ) from Flex-SISO decoder 2 through
sk−1
∗ de-interleaving (λi1 (uk ) is initialized to channel value
βk (sk ) = max{βk+1 (sk+1 ) + γk (sk , sk+1 )}, (11) λc (uk ) prior to decoding). Then it removes the old ex-
sk+1
trinsic value λ1e (uk ; old) from the soft input LLR λi1 (uk )
where the branch metric γk is computed as: to form a temporary message λ1t (uk ) as follows (for
brevity, we drop the superscript “1" in the following
n
equations)
γk = uk · (λc (uk ) + λa (uk )) + pk(i) · λc ( pk(i) ). (12)
i λt (uk ) = λi (uk ) − λe (uk ; old). (16)
The extrinsic branch metric γke in Eq. 9 is computed as: To relate to the traditional Turbo decoder structure,
this temporary message is mathematically equal to the
n
sum of the channel value λc (uk ) and the a priori value
γke = pk(i) · λc ( pk(i) ). (13)
λa (uk ) in Fig. 7:
i
α0
APP λ c (u) α0
γ0 +
Memory γ0
State α' 0 + LUT-S
λ c(p) m
α1 -
α1
γ1 + MSB
Flex-SISO γ1 α' 0
0 +
λ i (u) λ t (u) Turbo λ o(u) 1
Extrinsic
Memory unit to decode LDPC and Turbo codes with a small
additional overhead.
Figure 9 Turbo decoder architecture based on the Flex-SISO
module.
4.1 MAP Functional Unit for Turbo Codes
Forward Recursion: ak+1=f (ak, γ k) Table 1 LUT approximation for g(x) = log(1 + e−|x| ).
γ0 γ1 γ2
a0=+∞ α 0 α1 α2 α3 |x| |x| = 0 0 < |x| ≤ 0.75 0.75 < |x| ≤ 2 |x| > 2
g(x) 0.75 0.5 0.25 0
λ0 λ1 λ2 λ 3 λ k= f (α k , β k )
γ1 γ2 γ3
β0 β1 β2 β 3 β 3 =+∞
Backward Recursion: β k= f (β k+1, γ k+1)
as shown in Fig. 6, the inputs to this MAP processor
are the temporary metrics λt (um,k ), and the outputs
Figure 12 A forward–backward decoding flow to compute the from this MAP processor are the extrinsic metrics
extrinsic LLRs for single parity check code. λe (um,k ; new).
To compute Eq. 22 in hardware, we separate the
operation into sign and magnitude calculations:
random variables u0 , u1 , ..., ul the extrinsic LLR value
for bit uk is computed as: sign( f (a, b )) = sign(a) sign(b ),
λ(uk ) = ⊞λi (ui ), (21) | f (a, b )| = min(|a|, |b |) + log(1 + e−(|a|+|b |) )
∼{uk }
− log 1 + e− |a|−|b | .
(26)
where the compact notation ∼{uk } represents the set
of all the variables with uk excluded. For brevity, we Compared to the classical “tanh” function used in
define a function f (a, b ) to represent the operation LDPC decoding (x) = − log(tanh(|x/2|)), the f (·)
λi (u1 ) ⊞ λi (u2 ) as follows function is numerically more robust and less sensitive to
quantization noise. Due to its widely dynamic range (up
1 + ea eb
f (a, b ) = log , (22) to +∞), the (x) function has a high complexity and is
ea + eb
prone to quantization noise. Although many approxi-
where a λi (u1 ) and b λi (u2 ). Figure 12 shows a mations have been proposed to improve the numerical
forward–backward decoding flow to implement Eq. 21. accuracy of (x) [26, 29, 48], it is still expensive to
The forward (α) and backward (β) recursions are implement the (x) function in hardware. However,
defined as: the non-linear term in the f (·) function has a very small
dynamic range:
αk+1 = f (αk , γk ) (23)
0 < g(x) log(1 + e−|x| ) < 0.7,
βk = f (βk+1 , γk+1 ), (24)
where γk = λi (uk ) and is referred to as the branch thus the f (·) function is more easily to be implemented
metric as an analogy to a Turbo decoder. The α and β in hardware by using a low complexity look-up table
metrics are initialized to +∞ in the beginning. Based (LUT). To implement g(x) in hardware, we propose to
on the α and β metrics, the extrinsic LLR for uk is use a four-value LUT approximation which is shown in
computed as: Table 1. For fixed point implementation, we propose
to use Q.2 quantization scheme (Q total bits with 2
λ(uk ) = f (αk , βk ). (25) fractional bits). Table 2 shows the proposed LUT im-
plementation for Q.2 quantization. It should be noted
Figure 13 shows a MAP processor structure to de-
that g(x) is the same as the non-linear term in the
code the single parity check code. Three identical
Turbo max∗ (·) function (c.f. Eq. 14). Thus, the same
f (a, b ) units are used to compute α, β, and λ values.
look-up table configuration can be applied to the Turbo
To relate to the top level LDPC decoder architecture
ACSA unit. In Section 4.4, we will show the decoding
performance by using this look-up table.
Figure 14 depicts a circuit implementation for the
Input stream f (.) α LDPC | f (a, b )| functional unit using two look-up ta-
…γ2 γ1 γ0 bles “LUT-S” and “LUT-U”,
D where LUT-S and LUT-
Output stream −|a|−|b |
f (.) λ 0 λ1 λ 2 … U implement log(1 + e ) and log(1 + e−(|a|+|b |) ),
Stack Stack
-| b| + MSB
LUT-S X
Y
|a|
|b |
α0
γ0
0 + V |a| α1
1 W −|b | γ1
Z | f (a, b )| max∗ (α0 + γ0 , α1 + γ1 )
Figure 14 Circuit diagram for the LDPC | f (a, b )| functional
unit.
10 0 10 –1
Fixed point scaled minsum
Fixed point FFU
Floating point
–1
10 10 –2
–2
10
10 –3
Bit Error Rate (BER)
–4
10
10 –5
–5
10
10 –6
Floating point, N=6144
Floating point, N=1024
–6 Floating point, N=240
10
Floating point, N=40
10 –7 Fixed point, N=6144
Fixed point, N=1024
–7 Fixed point, N=240
10 Fixed point, N=40
0.4 0.6 0.8 1 1.2 1.4 1.6 1.8 2 2.2 2.4
10 –8
Eb/N0 [dB] 0 0.5 1 1.5 2 2.5 3
Eb/N0 [dB]
Figure 16 Simulation results for a rate 1/2, length 2304 WiMAX
LDPC code. Figure 18 Simulation results for 3GPP-LTE Turbo codes with a
variety of block sizes.
11
10 decoder can be reconfigured to process: i) an eight-state
9 convolutional Turbo code, or ii) 8 single parity check
8 codes.
7
6 5.1 Turbo Mode
5
4
In the Turbo mode, all the elements in the Flex-SISO
3
2
decoder will be activated. For Turbo decoding, we use
1 the Next Iteration Initialization (NII) sliding window
0 algorithm as suggested in [1, 19]. The NII approach
0.75 1 1.25 1.5 1.75 2 2.25
can avoid the calculation of training sequences as ini-
Eb/No [dB]
tialization values for the β state metrics, instead the
Figure 17 Comparison of the convergence speed. boundary metrics are initialized from the previous iter-
J Sign Process Syst (2011) 64:1–16 11
FFU 1
Flex-SISO Decoder
Dispatcher
FFU 2 α
λ i(u) λ t (u) ..
λ e(u;old) - BMC γ
.
Unit (γ) D
λ c(p) FFU 8
FFU 1 max*
γ stack Alpha Unit (α) α max* select
Dispatcher
max* D
Slicing
FFU 2
α stack .. - 0 λ e(u;new)
PADD
β+γe . D max* 1
FFU 8 max*
max* D λ o(u)
Extrinsic-1 Unit +
FFU 1
Extrinsic-2 Unit
Dispatcher
FFU 2 β
.. From γ stack: λ t(u)
. D
NII initialization FFU 8
Beta Unit (β)
ation. As a result, the decoding latency is smaller than order. By using multiple FFUs, the α and β units are
the traditional sliding window algorithm which requires able to compute the state metrics in parallel, leading to
a calculation of training sequences [25, 43], and thus a real time decoding with a latency of L.
only one β unit is required. Moreover, this solution The decoder works as follows. The decoder uses
is very suitable for high code-rate Turbo codes, which soft LLR value λi (u) and old extrinsic value λe (u; old)
require a very long training sequence to obtain reliable to compute λt (u) based on Eq. 16. A branch metric
boundary state metrics. Note that this scheme would calculation (BMC) unit is used to compute the branch
require an additional memory to store the boundary metrics γ (u, p) based on Eq. 18, where u, p ∈ {0, 1}.
state metrics. Then the branch metrics are buffered in a γ stack for
A dataflow graph for NII sliding window algorithm backward (β) metric calculation. The α and β metrics
is depicted in Fig. 20, where the X-axis represents the are computed using Eqs. 10 and 11. The boundary β
trellis flow and the Y-axis represents the decoding time metrics are initialized from an NII buffer (not shown in
so that a box may represent the processing of a block Fig. 19). A dispatcher unit is used to dispatch the data
of L data in L time steps, where L is the sliding to the correct FFUs in the α/β unit. Each α/β unit has
window size. In the decoding process, the α metrics are fully-parallel FFUs (eight of them), so the eight-state
computed in the natural order whereas the β metrics convolutional trellis can be processed at a rate of one-
and the extrinsic LLR (λe ) are computed in the reverse stage per clock cycle.
To compute the extrinsic LLR as defined in Eq. 9,
we first add β metrics with the extrinsic branch metrics
γ e ( p), where γ e ( p) is retrieved from the γ stack, as
Trellis L 2L 3L 4L γ e (0) = 0, γ e (1) = γ (0, 1) = λc ( p). The extrinsic LLR
calculation is separated into two phases which is shown
Time
Flex-SISO Decoder
FFU 1
(LDPC Mode)
Dispatcher
FFU 2
λ i(u) λ t (u) .. α
λ e(u;old) - BMC γ
.
Unit (γ) D
0 FFU 8
FFU 1
γ stack Alpha Unit (α) α select=1
Dispatcher
FFU 2
α stack .. - 0 λ e(u;new)
PADD
β+ 0 . D 1
FFU 8
Extrinsic-1 Unit λ o(u)
Dispatcher
FFU 1 +
FFU 2 β
.. From γ stack: λ t(u)
. D
FFU 8
Beta Unit (β)
λ e(u;old) λ i (u)
λ c(p)
Ext-Mem
Ext-Mem
Table 6 Performance of the proposed parallel decoder (3.2 mm2 core area, 500 MHz clock frequency, TSMC 90 nm technology).
Supported codes Code size (bit) Parallelism Quantization Max. iteration Max. throughput (Mbps) Latency
LDPC 802.16e 576–2,304 z = 24–96 6.2 15 600 1,590 cycles
LDPC 802.11n 648–1,944 z = 27–81 6.2 15 500 1,620 cycles
Turbo 3GPP-LTE 40–6,144 Sub-block = 1–12 6.2 6 450 6,822 cycles
6 Parallel Decoder Architecture Using Multiple For LDPC decoding, with 12 available Flex-SISO
Flex-SISO Decoder Cores cores the decoder can process up to 12 × 8 = 96 check
nodes simultaneously. Because the sub-matrix size z is
For high throughput applications, it is necessary to use between 24 to 96 for 802.16e LDPC codes, and 27 to 81
multiple SISO decoders working in parallel to increase for 802.11n, the proposed decoder always guarantees
the decoding speed. For parallel Turbo decoding, mul- that all of the z check nodes within a layer can be
tiple SISO decoders can be employed by dividing a processed in parallel.
codeword block into several sub-blocks and then each For 3GPP-LTE Turbo decoding, the codeword can
sub-block is processed separately by a dedicated SISO be partitioned into M sub-blocks for parallel process-
decoder [7, 20, 30, 41, 42]. For LDPC decoding, the ing. LTE Turbo code uses a quadratic permutation
decoder parallelism can be achieved by employing mul- polynomial (QPP) interleaver [36] so that it allows
tiple check node processors [10, 14, 32, 40, 49]. conflict free memory access as long as M is a factor of
Based on the Flex-SISO decoder core, we proposed the codeword length. There are 188 different codeword
a parallel LDPC/Turbo decoder architecture which is sizes defined in LTE. For LTE Turbo codes, all of the
shown in Fig. 22. As depicted, the parallel decoder codewords can support a parallelism level of 8, some of
comprises P Flex-SISO decoder cores. In this architec- the codewords can support parallelism level of 10 or 12.
ture, there are three types of storage. Extrinsic memory Because we have 12 Flex-SISO cores available, we will
(Ext-Mem) is used for storing the extrinsic LLR values dynamically allocate the maximum possible number
produced by each SISO core. APP memory (APP- of Flex-SISO cores (8 ≤ M ≤ 12) constrained on the
Mem) is used to store the initial and updated LLR QPP interleaver parallelism. As an example, for the
values. The APP memory is partitioned into multiple maximum codeword size of 6144, we can allocate all of
banks to allow parallel data transfer. Turbo parity the 12 Flex-SISO cores to work in parallel. It should
memory is used to store the channel LLR values for be noted that the parallelism level has some impact on
each parity bit in a Turbo codeword. This memory is the error performance of the decoder due to the edge
not used for LDPC decoding (parity bits are treated as effects caused by the sub-block partitioning [17].
information bits for LDPC decoding). Two permuters This parallel and flexible decoder has been imple-
are used to perform the permutation of the APP values mented in Verilog HDL and synthesized on a TSMC
back and forth. 90 nm CMOS technology using Synopsys Design Com-
As a case study, we have designed a high-throughput, piler. The maximum clock frequency of this decoder
flexible LDPC/Turbo decoder to support the following is 500 MHz. The synthesized core area is 3.2 mm2 ,
three codes: 1) 802.16e WiMAX LDPC code, 2) 802.11n which includes all of the components in this decoder.
WLAN LDPC code, and 3) 3GPP-LTE Turbo code. Table 6 summarizes the features of this decoder. The
Table 6 summarizes the performance and design para- decoder can be configured to support IEEE 802.16e
meters for this decoder. The number of the Flex-SISO LDPC codes, IEEE 802.11n LDPC codes, and 3GPP
decoders is chosen to be 12. LTE Turbo codes. Compared to a dedicated LDPC
decoder solution [37], this flexible decoder has only (TI), and US National Science Foundation (under grants CCF-
about 15–20% area overhead when normalized to the 0541363, CNS-0551692, CNS-0619767, CNS-0923479, and EECS-
0925942) for their support of the research.
same throughput target (with the same number of
iterations). Compared to a dedicated Turbo decoder
solution [30], our flexible decoder shows only about
10–20% area overhead when normalized to the same References
technology and the same throughput and code length.
1. Abbasfar, A., & Yao, K. (2003). An efficient and practical
architecture for high speed turbo decoders. IEEE Vehicular
Technology Conference, 1, 337–341.
7 Related Work and Architecture Comparison 2. Alles, M., Vogt, T., & Wehn, N. (2008). FlexiChaP: A re-
configurable ASIP for convolutional, turbo, and LDPC code
Multi-mode Turbo decoders are an increasingly impor- decoding. In 2008 5th International symposium on turbo codes
tant component in mobile wireless devices. To support and related topics (pp. 84–89).
3. Bahl, L., Cocke, J., Jelinek, F., & Raviv, J. (1974). Op-
multi-mode decoding, the ASIC/ASIP/MPSoC/SIMD timal decoding of linear codes for minimizing symbol er-
architectures have been recently proposed [2, 28, 34]. ror rate. IEEE Transactions on Information Theory IT-20,
In [2], a reconfigurable application-specific instruction- 284–287.
set processor (ASIP) architecture is presented for con- 4. Berrou, C., Glavieux, A., & Thitimajshima, P. (1993).
Near Shannon limit error-correcting coding and decod-
volutional, Turbo, and LDPC code decoding. In [34], a ing: Turbo-codes. In IEEE Int. conf. commun. (pp. 1064–
multi processor system on chip (MPSoC) architecture 1070).
is described for LDPC and Turbo code decoding. In 5. Bickerstaff, M., Davis, L., Thomas, C., Garrett, D., & Nicol,
[28], a SIMD-like processor architecture is proposed for C. (2003). A 24Mb/s radix-4 logMAP turbo decoder for
3GPP-HSDPA mobile wireless. In IEEE Int. solid-state cir-
Viterbi, Turbo, Reed-Solomon, and LDPC decoding. cuit conf. (ISSCC).
Table 7 shows the architecture comparison and tradeoff 6. Blanksby, A. J., & Howland, C. J. (2002). A 690-mW 1-
analysis of these decoders. Each approach has different Gb/s 1024-b, rate-1/2 low-density parity-check code decoder.
benefit in terms of flexibility. Our focus is to achieve IEEE Journal of Solid-State Circuits, 37, 404–412.
7. Bougard, B., Giulietti, A., Derudder, V., Weijers, J. W.,
highest throughput for both LDPC and Turbo codes. Dupont, S., Hollevoet, L., Catthoor, F., et al. (2003). A scal-
As can be seen from the table, the proposed decoder able 8.7-nJ/bit 75.6-Mb/s parallel concatenated convolutional
can support very high throughput LDPC/Turbo decod- (turbo-) codec. In IEEE International solid-state circuit con-
ing at a small silicon area cost. ference (ISSCC).
8. Bougard, B., Giulietti, A., Van der Perre, L., & Catthoor, F.
(2002). A class of power efficient VLSI architectures for high
speed turbo-decoding. In IEEE conf. global telecommunica-
8 Conclusion tions (Vol. 1, pp. 549–553).
9. Brack, T., Alles, M., Kienle, F., & Wehn, N. (2006). A synthe-
sizable IP core for WIMAX 802.16e LDPC code decoding.
In this work, we present a flexible decoder architecture In IEEE 17th Int. symp. personal, indoor and mobile radio
to support LDPC and Turbo codes. We propose a communications (pp. 1–5).
dual-mode Flex-SISO decoder as a basic building block 10. Brack, T., Alles, M., Lehnigk-Emden, T., Kienle, F., Wehn,
N., L’Insalata, N., et al. (2007). Low complexity LDPC
in LDPC and Turbo decoders. Our study has been code decoders for next generation standards. In Design,
focused on the Flex-SISO decoder architecture design automation, and test in Europe (pp. 331–336). New York:
and implementation. We unify the decoding process ACM
for LDPC and Turbo codes so that the same Flex- 11. Chen, J., Dholakia, A., Eleftheriou, E., Fossorier, M., &
Hu, X. (2005). Reduced-complexity decoding of LDPC
SISO decoder can be re-used for both cases resulting codes. IEEE Transactions on Communications, 53, 1288–
in more than 80% resource sharing. To increase de- 1299.
coding throughput, we propose a parallel LDPC/Turbo 12. Dai, Y., Yan, Z., & Chen, N. (2006). High-throughput turbo-
decoder using multiple Flex-SISO cores. With a core sum-product decoding of QC LDPC codes. In 40th Annual
conf. on info. sciences and syst. (Vol. 11, pp. 839– 8446).
area of 3.2 mm2 , the decoder is able to sustain 600 Mbps 13. Gallager, R. (1963). Low-density parity-check codes.
802.11e LDPC decoding, 500 Mbps 802.11n LDPC de- Cambridge: MIT.
coding, or 450 Mbps 3GPP LTE Turbo decoding. The 14. Gunnam, K. K., Choi, G. S., Yeary, M. B., & Atiquzzaman,
proposed architecture can significantly reduce the cost M. (2007). VLSI architectures for layered decoding for irreg-
ular LDPC codes of WiMax. In IEEE International Confer-
of a multi-mode receiver. ence on Communications (ICC) (pp. 4542–4547).
15. Hagenauer, J., Offer, E., & Papke, L. (1996). Iterative decod-
Acknowledgements The authors would like to thank Nokia, ing of binary block and convolutional codes. IEEE Transac-
Nokia Siemens Networks (NSN), Xilinx, Texas Instruments tions on Information Theory, 42(2), 429–445.
J Sign Process Syst (2011) 64:1–16 15
16. Hagenauer, J., Offer, E., & Papke, L. (1996). Iterative decod- A NoC case study. In IEEE 10th International symposium on
ing of binary block and convolutional codes. IEEE Transac- spread spectrum techniques and applications (pp. 671–676).
tions on Information Theory, 42, 429–445. 35. Shih, X. Y., Zhan, C. Z., Lin, C. H., & Wu, A. Y. (2008). An
17. He, Z., Fortier, P., & Roy, S. (2006). Highly-parallel decoding 8.29 mm2 52 mW multi-mode LDPC decoder design for mo-
architectures for convolutional turbo codes. IEEE Transac- bile WiMAX system in 0.13 m CMOS process. IEEE Journal
tions on Very Large Scale Integration (VLSI) Systems, 14(10), of Solid-State Circuits, 43, 672–683.
1147–1151. 36. Sun, J., & Takeshita, O. (2005). Interleavers for turbo codes
18. Hocevar, D. (2004). A reduced complexity decoder architec- using permutation polynomials over integer rings. IEEE
ture via layered decoding of LDPC codes. In IEEE workshop Transactions on Information Theory, 51, 101–119.
on signal processing systems (SIPS) (pp. 107–112). 37. Sun, Y., & Cavallaro, J. R. (2008). A low-power 1-Gbps re-
19. Dielissen, J., & Huisken, J. (2000). State vector reduction for configurable LDPC decoder design for multiple 4G wireless
initialization of sliding windows MAP. In 2nd International standards. In IEEE International SOC conference (pp. 367–
symposium on turbo codes and related topics. 370).
20. Lee, S. J., Shanbhag, N., & Singer, A. (2005). Area-efficient 38. Sun, Y., & Cavallaro, J. R. (2008). Unified decoder architec-
high-throughput MAP decoder architectures. IEEE Trans- ture For LDPC/Turbo codes. In IEEE Workshop on Signal
actions on Very Large Scale Integration (VLSI) Systems, 13, Processing Systems (SIPS) (pp. 13–18).
921–933. 39. Sun, Y., Karkooti, M., & Cavallaro, J. R. (2006). High
21. Lin, Y., Mahlke, S., Mudge, T., & Chakrabarti, C. (2006). throughput, parallel, scalable LDPC encoder/decoder archi-
Design and implementation of turbo decoders for software tecture for OFDM systems. In IEEE workshop on design,
defined radio. In IEEE SIPS (pp. 22–27). applications, integration and software (pp. 39–42).
22. Lu, J., & Moura, J. (2003). Turbo like decoding of LDPC 40. Sun, Y., Karkooti, M., & Cavallaro, J. R. (2007). VLSI de-
codes. In IEEE Int. conf. on magnetics (pp. DT-11). coder architecture for high throughput, variable block-size
23. MacKay, D. J. C. (1998). Turbo codes are low density par- and multi-rate LDPC codes. In IEEE International sympo-
ity check codes. Available online, http://www.inference.phy. sium on circuits and systems (ISCAS) (pp. 2104–2107).
cam.ac.uk/mackay/turbo-ldpc.pdf. 41. Sun, Y., Zhu, Y., Goel, M., & Cavallaro, J. R. (2008).
24. Mansour, M. M., & Shanbhag, N. R. (2003). High-throughput Configurable and scalable high throughput turbo decoder
LDPC decoders. IEEE Transactions on Very Large Scale In- architecture for multiple 4G wireless standards. In IEEE In-
tegration (VLSI) Systems, 11, 976–996. ternational conference on application-specif ic systems, archi-
25. Masera, G., Piccinini, G., Roch, M., & Zamboni, M. (1999). tectures and processors (ASAP) (pp. 209–214).
VLSI architecture for turbo codes. IEEE Transactions 42. Thul, M. J., Gilbert, F., Vogt, T., Kreiselmaier, G., & Wehn,
on Very Large Scale Integration (VLSI) Systems, 7, 369– N. (2005). A scalable system architecture for high-throughput
3797. turbo-decoders. Journal of VLSI Signal Processing, 39,
26. Masera, G., Quaglio, F., & Vacca, F. (2005). Finite precision 63–77.
implementation of LDPC decoders. In IEEE proc. commun. 43. Viterbi, A. (1998). An intuitive justification and a simplified
(Vol. 152, pp. 1098–1102). implementation of the MAP decoder for convolutional codes.
27. Mohsenin, T., Truong, D., & Baas, B. (2009). Multi-split- IEEE Journal on Selected Areas in Communications, 16, 260–
row threshold decoding implementations for LDPC codes. In 264.
IEEE International symposium on circuits and systems (IS- 44. Wang, Z., Chi, Z., & Parhi, K. (2002). Area-efficient high-
CAS’09) (pp. 2449–2452). speed decoding schemes for turbo decoders. IEEE Trans-
28. Niktash, A., Parizi, H., Kamalizad, A., & Bagherzadeh, actions on Very Large Scale Integration (VLSI) Systems, 10,
N. (2008). RECFEC: A reconfigurable FEC processor 902–912.
for Viterbi, turbo, Reed-Solomon and LDPC coding. In 45. Wang, Z., & Cui, Z. (2007). Low-complexity high-speed de-
IEEE Wireless communications and networking conference coder design for quasi-cyclic LDPC codes. IEEE Transac-
(WCNC) (pp. 605–610). tions on Very Large Scale Integration (VLSI) Systems, 15,
29. Oh, D., & Parhi, K. (2006). Low complexity implementations 104–114.
of sum-product algorithm for decoding low-density parity- 46. Zhang, J., & Fossorier, M. (2002). Shuffled belief propagation
check codes. In IEEE Workshop on signal processing systems decoding. In Asilomar Conference on signals, systems and
(SIPS) (pp. 262–267). computers (Vol. 1, pp. 8–15).
30. Prescher, G., Gemmeke, T., & Noll, T. (2005). A parametriz- 47. Zhang, K., Huang, X., & Wang, Z. (2009). High-throughput
able low-power high-throughput turbo-decoder. In IEEE Int. layered decoder implementation for quasi-cyclic LDPC
conf. acoustics, speech, and signal processing (Vol. 5, pp. 25– codes. IEEE Journal on Selected Areas in Communications,
28). 27(6), 985–994.
31. Robertson, P., Villebrun, E., & Hoeher, P. (1995). A com- 48. Zhang, T., Wang, Z., & Parhi, K. (2001). On finite precision
parison of optimal and sub-optimal MAP decoding algorithm implementation of low density parity check codes decoder.
operating in the log domain. In IEEE Int. conf. commun. In IEEE Int. symposium on circuits and systems (ISCAS)
(ICC) (pp. 1009–1013). (Vol. 4, pp. 202–205).
32. Rovini, M., Gentile, G., Rossi, F., & Fanucci, L. (2007). A 49. Zhong, H., & Zhang, T. (2005). Block-LDPC: A practical
scalable decoder architecture for IEEE 802.11n LDPC codes. LDPC coding system design approach. IEEE Transactions
In IEEE global telecommunications conference (pp. 3270– on Circuits and Systems I: Fundamental Theory and Applica-
3274). tions, 52(4), 766–775 (see also IEEE Transactions on Circuits
33. Salmela, P., Sorokin, H., & Takala, J. (2008). A pro- and Systems I: Regular Papers).
grammable Max-Log-MAP turbo decoder implementation. 50. Zhu, Y., & Chakrabarti, C. (2009). Architecture-aware
Hindawi VLSI Design, 2008, 636–640. LDPC code design for multiprocessor software defined radio
34. Scarpellino, M., Singh, A., Boutillon, E., & Masera, G. (2008). systems. In IEEE transactions on signal processing (Vol. 57,
Reconfigurable architecture for LDPC and turbo decoding: pp. 3679–3692).
16 J Sign Process Syst (2011) 64:1–16
Yang Sun received the B.S. degree in Testing Technology & In- Joseph R. Cavallaro received the B.S. degree from the Uni-
strumentation in 2000 and the M.S. degree in Instrument Science versity of Pennsylvania, Philadelphia, Pa, in 1981, the M.S. de-
& Technology in 2003, from Zhejiang University, Hangzhou, gree from Princeton University, Princeton, NJ, in 1982, and the
China. From 2003 to 2004, he was with S3 Graphics Co. Ltd. as Ph.D. degree from Cornell University, Ithaca, NY, in 1988, all
an ASIC design engineer, developing Graphics Processing Unit in electrical engineering. From 1981 to 1983, he was with AT&T
(GPU) cores for graphics chipsets. From 2004 to 2005, he was Bell Laboratories, Holmdel, NJ. In 1988, he joined the faculty of
with Conexant Systems Inc. as an ASIC design engineer, devel- Rice University, Houston, TX, where he is currently a Professor
oping video decoder cores for set-top box (STB) chipsets. During of electrical and computer engineering. His research interests
the summer of 2007 and 2008, he worked at Texas Instruments - include computer arithmetic, VLSI design and microlithogra-
R&D center as an intern, developing LDPC and Turbo error- phy, and DSP and VLSI architectures for applications in wire-
correcting decoders. less communications. During the 1996–1997 academic year, he
He is currently a PhD student in the Department of Electrical served at the National Science Foundation as Director of the
and Computer Engineering at Rice University, Houston, Texas. Prototyping Tools and Methodology Program. He was a Nokia
His research interests include parallel algorithms and VLSI ar- Foundation Fellow and a Visiting Professor at the University of
chitectures for wireless communication systems. He received the Oulu, Finland in 2005 and continues his affiliation there as an
2008 IEEE SoC Conference Best Paper Award, the 2008 IEEE Adjunct Professor. He is currently the Associate Director of the
Workshop on Signal Processing Systems Bob Owens Memory Center for Multimedia Communication at Rice University. He
Paper Award, and the 2009 ACM GLSVLSI Best Student Paper is a Senior Member of the IEEE. He was Co-chair of the 2004
Award. Signal Processing for Communications Symposium at the IEEE
Global Communications Conference and General Co-chair of
the 2004 IEEE 15th International Conference on Application-
Specific Systems, Architectures and Processors (ASAP).