Finite Wordlength Effects in Discrete-Time
Systems
V. Rajbabu
[email protected]
DSP Lab
Department of Electrical Engineering
Indian Institute of Technology Bombay
24 Jan 2024
1 / 35
Outline
Number Representations
Coefficient Quantization
Handling Quantization Effects
Today’s Lab
2 / 35
Finite Wordlength Effects
Compared to an analog implementation, discrete
implementation losses precision due to quantization
• A/D and D/A conversions
• Finite word-length arithmetic
Sources of error
• Quantization in A/D conversion - signal-to-noise ratio
(SNR) increases by 6 dB for each additional bit
• Rounding of multiple products
These errors have to be considered while designing DT
systems
3 / 35
Number Representations
Number Representation
Based on finite word-length numeric representations, digital
signal processors (DSPs) can be
• Fixed-point processors
• Floating-point processors
5 / 35
Number Representation
Based on finite word-length numeric representations, digital
signal processors (DSPs) can be
• Fixed-point processors
• Floating-point processors
Floating-point representation
• General purpose processors use this representation
x = 2E x̂M eg., in 64-bit, E = 11, 1 sign bit
52 bits for decimal part
E is the exponent, and x̂M ∈ (−1, +1) is the mantissa
5 / 35
Number Representation
Based on finite word-length numeric representations, digital
signal processors (DSPs) can be
• Fixed-point processors
• Floating-point processors
Floating-point representation
• General purpose processors use this representation
x = 2E x̂M eg., in 64-bit, E = 11, 1 sign bit
52 bits for decimal part
E is the exponent, and x̂M ∈ (−1, +1) is the mantissa
Floating-point representation
• supports wide range of values (with small number of bits)
• has complex hardware
• is simpler to design
5 / 35
Binary Number Systems
Basic number systems
• Signed magnitude
• One’s complement
• Two’s complement
Two’s complement
• Most commonly used in binary system
• Unique representation for zero
• Simple mathematical operations
• Subtraction performed using addition
6 / 35
Fractional Binary Numbers
• Designer keeps track of radix (decimal) point
Normalized Binary
Normalized M-bit binary numbers are written as:
x(B) = b0 .b1 b2 · · · bM−1
where b0 is the sign bit
Decimal representation
M−1
X
x(10) = −b0 + bi 2−i
i=1
7 / 35
Fractional Binary Numbers
• Designer keeps track of radix (decimal) point
Normalized Binary
Normalized M-bit binary numbers are written as:
x(B) = b0 .b1 b2 · · · bM−1
where b0 is the sign bit
Decimal representation
M−1
X
x(10) = −b0 + bi 2−i
i=1
Unscaled
real number - infinite precision
∞
bi 2−i
P
x(10) = Xm −b0 +
i=1
7 / 35
Number Circle
An easy way to visualize two’s complement representation
0.00 ≡ 0⋅ 2−2 = 0
1.11 ≡ −1⋅ 2−2 = −0.25 0.01 ≡ 1⋅ 2−2 = 0.25
1.10 ≡ −2⋅ 2−2 = −0.5 0.10 ≡ 2⋅ 2−2 = 0.5
1.01 ≡ −3⋅ 2−2 = −0.75 0.11 ≡ 3⋅ 2−2 = 0.75
1.00 ≡ −4⋅ 2−2 = −1
Figure: 3-bit Q2 numbers
8 / 35
Q-Format
Q-format is a formal mechanism to keep track of radix (fixed)
point
Q-Format: Q##
refers to a binary number with ## bits to the right of the radix
point
• Total word length depends on the system
• In DSPs, Q15 is a common format
• A 16-bit number in Q15 has 1 sign bit and 15 fractional bits
s.b0 b1 · · · b14
Alternate form: Q(I.F )
• I indicates number of bits to the left of radix point (for sign
and integer part of a number)
• F bits to the right of the radix point
9 / 35
Q-Format - Example
Convert the following numbers to their signed integer value in
Q15
Q-Format
0.5
−0.5
−1.0
1.0
10 / 35
Q-Format - Example
Convert the following numbers to their signed integer value in
Q15
Q-Format
0.5 = 16384
−0.5 = −16384
−1.0 = −32768
1.0 = out of range
≈ 32767 = 1 − 2−15
10 / 35
Q-Format Conversion
Q12 Number
15 11 0
S
IWL WL − 1 − IWL
−2IWL ≤ Range < 2IWL − 2−12
Let x be a fractional number that needs to be represented as a
B-bit (WL) signed integer, Qf format as xq
• For positive x, xq = round(x · 2f )
• For negative x, xq = −round(|x| · 2f )
11 / 35
Q-Format: Addition
To obtain: C = A + B, with Qc , Qa , Qb
• Require Qa and Qb to be equal
• Let Ma and Mb be size of registers for A and B
• Intermediate values
• Intermediate result size = max(Ma , Mb ) + 1
• Intermediate QI = Qa = Qb
• Final values
• Top Mc bits are used, lowest fractional bits are discarded
Qc = Qa − (Mc − max(Ma , Mb ) − 1)
= Qb − (Mc − max(Ma , Mb ) − 1)
Adding N numbers of length M
Final word length : ???
12 / 35
Q-Format: Addition
To obtain: C = A + B, with Qc , Qa , Qb
• Require Qa and Qb to be equal
• Let Ma and Mb be size of registers for A and B
• Intermediate values
• Intermediate result size = max(Ma , Mb ) + 1
• Intermediate QI = Qa = Qb
• Final values
• Top Mc bits are used, lowest fractional bits are discarded
Qc = Qa − (Mc − max(Ma , Mb ) − 1)
= Qb − (Mc − max(Ma , Mb ) − 1)
Adding N numbers of length M
Final word length : M + ⌈log2 N⌉
12 / 35
Q-Format: Multiplication
To obtain: C = A × B, with Qc , Qa , Qb
• Let Ma and Mb be size of registers for A and B
• Ma and Mb or Qa and Qb need not be equal
• Intermediate values
• Intermediate result size = Ma + Mb
• Intermediate QI = Qa + Qb
• Final values
• Top Mc bits are used and lowest fractional bits are
discarded
• Qc = (Qa + Qb ) − (Ma + Mb − Mc )
13 / 35
Fixed-point Arithmetic
Table: Fixed-point Arithmetic
Floating-point Fixed-point IWL of result
IX > IY IX < IY
X := Y X := (Y ≫ (IX − IY )) X := (Y ≪ (IY − IX )) IX
X +Y X + (Y ≫ (IX − IY )) (X ≫ (IY − IX )) + Y max(IX , IY ) + 1
X ∗Y X ∗Y X ∗Y IX + IY
a
IX , IY - Integer word length (IWL) of X and Y
b
Overflow needs to be avoided for valid results
14 / 35
Coefficient Quantization
Quantization
• Quantization - represents numerical values with finite
number of bits
• Significant in fixed-point representation
Quantization in signal processing
• data - round-off errors
• coefficients/parameters - changes system transfer function
16 / 35
Implementation of DT Filter
17 / 35
Quantization Errors
Possible errors in a quantized (fixed-point) system
• Input quantization
• Coefficient quantization
• Product quantization (round-off error, underflow)
• Overflow
18 / 35
Quantization Example
Suppose you want to implement a filter using fractional
arithmetic (Q1.3)
h[n] = [ 13 1
3
1
3]
x[n] = [ 15 2
5
3
5
4
5]
y [n] = [0.0667 0.2 0.4 0.6 0.4667 0.2667] (ideal)
19 / 35
Quantization Example
Suppose you want to implement a filter using fractional
arithmetic (Q1.3)
h[n] = [ 13 1
3
1
3]
x[n] = [ 15 2
5
3
5
4
5]
y [n] = [0.0667 0.2 0.4 0.6 0.4667 0.2667] (ideal)
Q-format representation: Q1.3 refers to 4 bits with the MSB
representing sign and three bits representing fractional part
19 / 35
Quantization Example
Consider h = 0.3333 represented in Q(1.3) or Q3 (Qf )
• Word length is 4 bits
• hq = round(h × 2f ) = round(2.66) = 3
• Q{h}bin = 0 011
• Equivalent decimal value 0.375
20 / 35
Quantization Example
Consider h = 0.3333 represented in Q(1.3) or Q3 (Qf )
• Word length is 4 bits
• hq = round(h × 2f ) = round(2.66) = 3
• Q{h}bin = 0 011
• Equivalent decimal value 0.375
Typically output of product or sum will have register with more
bits (say in this case 8 bits) to avoid overflow
20 / 35
Quantization Example
Consider h = 0.3333 represented in Q(1.3) or Q3 (Qf )
• Word length is 4 bits
• hq = round(h × 2f ) = round(2.66) = 3
• Q{h}bin = 0 011
• Equivalent decimal value 0.375
Typically output of product or sum will have register with more
bits (say in this case 8 bits) to avoid overflow
• Say x = 0.2, we have Q{x}bin = 0 010
• Product Q{hbin } × Q{xbin } = 00 000110
20 / 35
Quantization Example
Consider h = 0.3333 represented in Q(1.3) or Q3 (Qf )
• Word length is 4 bits
• hq = round(h × 2f ) = round(2.66) = 3
• Q{h}bin = 0 011
• Equivalent decimal value 0.375
Typically output of product or sum will have register with more
bits (say in this case 8 bits) to avoid overflow
• Say x = 0.2, we have Q{x}bin = 0 010
• Product Q{hbin } × Q{xbin } = 00 000110
• Use nearest rounded value hbin × xbin = 0 001
• Comparing to actual value of 0.2 × 0.3333 = 0.0667 we
have 0.125
20 / 35
Quantization Example
Quantized h[n] - Q(1.3)
h[0] h[1] h[2]
h[n] 0.3333 0.3333 0.3333
Q{h[n]} 0.375 0.375 0.375
Q{h[n]}bin 0.011 0.011 0.011
21 / 35
Quantization Example
Quantized h[n] - Q(1.3)
h[0] h[1] h[2]
h[n] 0.3333 0.3333 0.3333
Q{h[n]} 0.375 0.375 0.375
Q{h[n]}bin 0.011 0.011 0.011
Quantized x[n] - Q(1.3)
x[0] x[1] x[2] x[3]
x[n] 0.2 0.4 0.6 0.8
Q{x[n]} 0.25 0.375 0.625 0.75
Q{x[n]}bin 0.010 0.011 0.101 0.110
21 / 35
Quantization Example
Quantized h[n] - Q(1.3)
h[0] h[1] h[2]
h[n] 0.3333 0.3333 0.3333
Q{h[n]} 0.375 0.375 0.375
Q{h[n]}bin 0.011 0.011 0.011
Quantized x[n] - Q(1.3)
x[0] x[1] x[2] x[3]
x[n] 0.2 0.4 0.6 0.8
Q{x[n]} 0.25 0.375 0.625 0.75
Q{x[n]}bin 0.010 0.011 0.101 0.110
Quantized y[n] - Q(1.3)
y[0] y[1] y[2] y[3] y[4] y[5]
y[n] 0.0667 0.2 0.4 0.6 0.4667 0.2667
Q{y[n]} 0.125 0.25 0.375 0.625 0.5 0.25
Q{y [n]}actual 0.125 0.25 0.5 0.625 0.5 0.25
21 / 35
Coefficient Quantization Effects
In IIR filters the response can be sensitive to coefficient values
• Quantization of filter coefficients causes the roots of the
numerator and denominator polynomials of the z-transform
to move
• Changing the position of these roots in z-plane causes the
frequency response to change and may even cause the
filter to go unstable as poles are moved onto or outside the
unit circle
22 / 35
Coefficient Quantization Severity
Severity of the effect is effected by
• Tightly clustered roots
• Roots close to the unit circle
• Many roots (a long filter)
• Filter structure
• Roots close to the real axis (for DF implementation)
23 / 35
Coefficient Quantization Severity
To reduce severity of these effects
• Use smaller filter sections
• Select filters with less tightly clustered roots
• Use more bits of precision
• Try slight variations on the filter to find which provides the
best response under the quantization constraints
• Scale the coefficients (by choosing the Q#) to
advantageously trade-off significant bits and overflow
probability
24 / 35
Quantization Effects - Filter Sturcuture
• Direct form (I and II) structures - filter and signal
quantization errors can accumulate
25 / 35
Quantization Effects - Filter Sturcuture
• Direct form (I and II) structures - filter and signal
quantization errors can accumulate
Cascaded Second-order Sections
• SOSs require a much smaller range of coefficient values
• Filter coefficient errors in one SOS do not affect other
SOSs
• Scaling may be applied separately to each section to
reduce signal quantization effects
• Sets of poles and zeros may be matched to limit the
dynamic range at the output of each section
• SOSs may be ordered to minimize quantization effects
• Cascades of SOSs require more operations than a DF-II
filter
25 / 35
Quantization Effects - Filter Sturcuture
Folded FIR Filter Structure
• Order of adds matter - folded FIR filters add from the
outside in
26 / 35
Quantization Effects - Filter Sturcuture
Folded FIR Filter Structure
• Order of adds matter - folded FIR filters add from the
outside in
• FIR filters almost always have their smallest values at the
ends
26 / 35
Quantization Effects - Filter Sturcuture
Folded FIR Filter Structure
• Order of adds matter - folded FIR filters add from the
outside in
• FIR filters almost always have their smallest values at the
ends
• For long FIR filters, folded structures improve numerical
accuracy over Direct Form
26 / 35
Handling Quantization Effects
Quantization Effects
Finite word length effects
• Overflow errors
Can be avoided by appropriate scaling
• Round-off errors
Difficult to avoid - requires appropriate fixed-point arithmetic
28 / 35
Round-off Noise
Product of fixed-point numbers
• Product output requires more bits than inputs
• Truncation or rounding of result can lead to errors
• Extended precision registers help in reducing this error
Sum of fixed-point numbers
• Output sum requires one-bit more than inputs
• Truncation or rounding of result can lead to errors
• Not as severe in product
29 / 35
Scaling
Scaling
• Prevents overflow
• Provides a trade-off between SNR and overflow
Scaling in filter design/implementation
• Normalize inputs, coefficients to ±1
• Based on magnitude of frequency response
30 / 35
Approaches to Scaling
Absolute scaling
• Scale assuming worst-case inputs/data
• Guarantees no overflow
• Leads to less accurate results (more quantization error)
Dynamic scaling
• Monitor range of variables and scale if required
• Increases computation
31 / 35
Floating-point to Fixed-point
• Implement and verify floating-point algorithm
• Estimate minimum/maximum ( range) of variables
• Convert floating-point variables to fixed-point
• Decide on scaling, based on architecture (word length)
• Range of variables can help in fixing integer word length
(IWL)
• Replace floating-point arithmetic with fixed-point arithmetic
• Consider available accumulator and register word lengths
32 / 35
FFT Computation: Finite Word-length
• Similar effects (quantization, round-off and overflow) also
affect FFT computation
• Depends on FFT length used
• Use appropriate scaling at each stage of FFT (before a
butterfly computation)
33 / 35
Today’s Lab
Lab Exercises
• Evaluating y = ax + b
• Evaluating y [n] = h[n] ∗ x[n]
35 / 35
Lab Exercises
• Evaluating y = ax + b
• Evaluating y [n] = h[n] ∗ x[n]
• FIR quantization effects
filterDesigner
35 / 35