Thanks to visit codestin.com
Credit goes to www.scribd.com

0% found this document useful (0 votes)
6 views50 pages

Finite Precision

The document discusses finite wordlength effects in discrete-time systems, focusing on quantization errors during A/D and D/A conversions and finite word-length arithmetic. It covers various number representations, including fixed-point and floating-point formats, and illustrates quantization effects through examples of digital signal processing. Key concepts include coefficient quantization, handling quantization errors, and the implications for system design.

Uploaded by

Aditya Raj
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
6 views50 pages

Finite Precision

The document discusses finite wordlength effects in discrete-time systems, focusing on quantization errors during A/D and D/A conversions and finite word-length arithmetic. It covers various number representations, including fixed-point and floating-point formats, and illustrates quantization effects through examples of digital signal processing. Key concepts include coefficient quantization, handling quantization errors, and the implications for system design.

Uploaded by

Aditya Raj
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 50

Finite Wordlength Effects in Discrete-Time

Systems

V. Rajbabu
[email protected]
DSP Lab

Department of Electrical Engineering


Indian Institute of Technology Bombay

24 Jan 2024

1 / 35
Outline

Number Representations

Coefficient Quantization

Handling Quantization Effects

Today’s Lab

2 / 35
Finite Wordlength Effects
Compared to an analog implementation, discrete
implementation losses precision due to quantization
• A/D and D/A conversions
• Finite word-length arithmetic
Sources of error
• Quantization in A/D conversion - signal-to-noise ratio
(SNR) increases by 6 dB for each additional bit
• Rounding of multiple products
These errors have to be considered while designing DT
systems

3 / 35
Number Representations
Number Representation
Based on finite word-length numeric representations, digital
signal processors (DSPs) can be
• Fixed-point processors
• Floating-point processors

5 / 35
Number Representation
Based on finite word-length numeric representations, digital
signal processors (DSPs) can be
• Fixed-point processors
• Floating-point processors
Floating-point representation
• General purpose processors use this representation

x = 2E x̂M eg., in 64-bit, E = 11, 1 sign bit


52 bits for decimal part

E is the exponent, and x̂M ∈ (−1, +1) is the mantissa

5 / 35
Number Representation
Based on finite word-length numeric representations, digital
signal processors (DSPs) can be
• Fixed-point processors
• Floating-point processors
Floating-point representation
• General purpose processors use this representation

x = 2E x̂M eg., in 64-bit, E = 11, 1 sign bit


52 bits for decimal part

E is the exponent, and x̂M ∈ (−1, +1) is the mantissa


Floating-point representation
• supports wide range of values (with small number of bits)
• has complex hardware
• is simpler to design
5 / 35
Binary Number Systems
Basic number systems
• Signed magnitude
• One’s complement
• Two’s complement

Two’s complement
• Most commonly used in binary system
• Unique representation for zero
• Simple mathematical operations
• Subtraction performed using addition

6 / 35
Fractional Binary Numbers
• Designer keeps track of radix (decimal) point

Normalized Binary
Normalized M-bit binary numbers are written as:

x(B) = b0 .b1 b2 · · · bM−1

where b0 is the sign bit


Decimal representation
M−1
X
x(10) = −b0 + bi 2−i
i=1

7 / 35
Fractional Binary Numbers
• Designer keeps track of radix (decimal) point

Normalized Binary
Normalized M-bit binary numbers are written as:

x(B) = b0 .b1 b2 · · · bM−1

where b0 is the sign bit


Decimal representation
M−1
X
x(10) = −b0 + bi 2−i
i=1

Unscaled 
real number - infinite precision


bi 2−i
P
x(10) = Xm −b0 +
i=1
7 / 35
Number Circle
An easy way to visualize two’s complement representation
0.00 ≡ 0⋅ 2−2 = 0

1.11 ≡ −1⋅ 2−2 = −0.25 0.01 ≡ 1⋅ 2−2 = 0.25

1.10 ≡ −2⋅ 2−2 = −0.5 0.10 ≡ 2⋅ 2−2 = 0.5

1.01 ≡ −3⋅ 2−2 = −0.75 0.11 ≡ 3⋅ 2−2 = 0.75

1.00 ≡ −4⋅ 2−2 = −1

Figure: 3-bit Q2 numbers

8 / 35
Q-Format
Q-format is a formal mechanism to keep track of radix (fixed)
point
Q-Format: Q##
refers to a binary number with ## bits to the right of the radix
point
• Total word length depends on the system
• In DSPs, Q15 is a common format
• A 16-bit number in Q15 has 1 sign bit and 15 fractional bits

s.b0 b1 · · · b14

Alternate form: Q(I.F )


• I indicates number of bits to the left of radix point (for sign
and integer part of a number)
• F bits to the right of the radix point
9 / 35
Q-Format - Example
Convert the following numbers to their signed integer value in
Q15
Q-Format

0.5
−0.5
−1.0
1.0

10 / 35
Q-Format - Example
Convert the following numbers to their signed integer value in
Q15
Q-Format

0.5 = 16384
−0.5 = −16384
−1.0 = −32768
1.0 = out of range
≈ 32767 = 1 − 2−15

10 / 35
Q-Format Conversion
Q12 Number
15 11 0
S

IWL WL − 1 − IWL
−2IWL ≤ Range < 2IWL − 2−12
Let x be a fractional number that needs to be represented as a
B-bit (WL) signed integer, Qf format as xq
• For positive x, xq = round(x · 2f )
• For negative x, xq = −round(|x| · 2f )

11 / 35
Q-Format: Addition
To obtain: C = A + B, with Qc , Qa , Qb
• Require Qa and Qb to be equal
• Let Ma and Mb be size of registers for A and B
• Intermediate values
• Intermediate result size = max(Ma , Mb ) + 1
• Intermediate QI = Qa = Qb
• Final values
• Top Mc bits are used, lowest fractional bits are discarded

Qc = Qa − (Mc − max(Ma , Mb ) − 1)
= Qb − (Mc − max(Ma , Mb ) − 1)

Adding N numbers of length M


Final word length : ???

12 / 35
Q-Format: Addition
To obtain: C = A + B, with Qc , Qa , Qb
• Require Qa and Qb to be equal
• Let Ma and Mb be size of registers for A and B
• Intermediate values
• Intermediate result size = max(Ma , Mb ) + 1
• Intermediate QI = Qa = Qb
• Final values
• Top Mc bits are used, lowest fractional bits are discarded

Qc = Qa − (Mc − max(Ma , Mb ) − 1)
= Qb − (Mc − max(Ma , Mb ) − 1)

Adding N numbers of length M


Final word length : M + ⌈log2 N⌉

12 / 35
Q-Format: Multiplication
To obtain: C = A × B, with Qc , Qa , Qb
• Let Ma and Mb be size of registers for A and B
• Ma and Mb or Qa and Qb need not be equal
• Intermediate values
• Intermediate result size = Ma + Mb
• Intermediate QI = Qa + Qb
• Final values
• Top Mc bits are used and lowest fractional bits are
discarded
• Qc = (Qa + Qb ) − (Ma + Mb − Mc )

13 / 35
Fixed-point Arithmetic

Table: Fixed-point Arithmetic

Floating-point Fixed-point IWL of result


IX > IY IX < IY
X := Y X := (Y ≫ (IX − IY )) X := (Y ≪ (IY − IX )) IX
X +Y X + (Y ≫ (IX − IY )) (X ≫ (IY − IX )) + Y max(IX , IY ) + 1
X ∗Y X ∗Y X ∗Y IX + IY
a
IX , IY - Integer word length (IWL) of X and Y
b
Overflow needs to be avoided for valid results

14 / 35
Coefficient Quantization
Quantization
• Quantization - represents numerical values with finite
number of bits
• Significant in fixed-point representation

Quantization in signal processing


• data - round-off errors
• coefficients/parameters - changes system transfer function

16 / 35
Implementation of DT Filter

17 / 35
Quantization Errors
Possible errors in a quantized (fixed-point) system
• Input quantization
• Coefficient quantization
• Product quantization (round-off error, underflow)
• Overflow

18 / 35
Quantization Example
Suppose you want to implement a filter using fractional
arithmetic (Q1.3)
h[n] = [ 13 1
3
1
3]
x[n] = [ 15 2
5
3
5
4
5]
y [n] = [0.0667 0.2 0.4 0.6 0.4667 0.2667] (ideal)

19 / 35
Quantization Example
Suppose you want to implement a filter using fractional
arithmetic (Q1.3)
h[n] = [ 13 1
3
1
3]
x[n] = [ 15 2
5
3
5
4
5]
y [n] = [0.0667 0.2 0.4 0.6 0.4667 0.2667] (ideal)

Q-format representation: Q1.3 refers to 4 bits with the MSB


representing sign and three bits representing fractional part

19 / 35
Quantization Example
Consider h = 0.3333 represented in Q(1.3) or Q3 (Qf )
• Word length is 4 bits
• hq = round(h × 2f ) = round(2.66) = 3
• Q{h}bin = 0 011
• Equivalent decimal value 0.375

20 / 35
Quantization Example
Consider h = 0.3333 represented in Q(1.3) or Q3 (Qf )
• Word length is 4 bits
• hq = round(h × 2f ) = round(2.66) = 3
• Q{h}bin = 0 011
• Equivalent decimal value 0.375
Typically output of product or sum will have register with more
bits (say in this case 8 bits) to avoid overflow

20 / 35
Quantization Example
Consider h = 0.3333 represented in Q(1.3) or Q3 (Qf )
• Word length is 4 bits
• hq = round(h × 2f ) = round(2.66) = 3
• Q{h}bin = 0 011
• Equivalent decimal value 0.375
Typically output of product or sum will have register with more
bits (say in this case 8 bits) to avoid overflow
• Say x = 0.2, we have Q{x}bin = 0 010
• Product Q{hbin } × Q{xbin } = 00 000110

20 / 35
Quantization Example
Consider h = 0.3333 represented in Q(1.3) or Q3 (Qf )
• Word length is 4 bits
• hq = round(h × 2f ) = round(2.66) = 3
• Q{h}bin = 0 011
• Equivalent decimal value 0.375
Typically output of product or sum will have register with more
bits (say in this case 8 bits) to avoid overflow
• Say x = 0.2, we have Q{x}bin = 0 010
• Product Q{hbin } × Q{xbin } = 00 000110
• Use nearest rounded value hbin × xbin = 0 001
• Comparing to actual value of 0.2 × 0.3333 = 0.0667 we
have 0.125

20 / 35
Quantization Example
Quantized h[n] - Q(1.3)
h[0] h[1] h[2]
h[n] 0.3333 0.3333 0.3333
Q{h[n]} 0.375 0.375 0.375
Q{h[n]}bin 0.011 0.011 0.011

21 / 35
Quantization Example
Quantized h[n] - Q(1.3)
h[0] h[1] h[2]
h[n] 0.3333 0.3333 0.3333
Q{h[n]} 0.375 0.375 0.375
Q{h[n]}bin 0.011 0.011 0.011

Quantized x[n] - Q(1.3)


x[0] x[1] x[2] x[3]
x[n] 0.2 0.4 0.6 0.8
Q{x[n]} 0.25 0.375 0.625 0.75
Q{x[n]}bin 0.010 0.011 0.101 0.110

21 / 35
Quantization Example
Quantized h[n] - Q(1.3)
h[0] h[1] h[2]
h[n] 0.3333 0.3333 0.3333
Q{h[n]} 0.375 0.375 0.375
Q{h[n]}bin 0.011 0.011 0.011

Quantized x[n] - Q(1.3)


x[0] x[1] x[2] x[3]
x[n] 0.2 0.4 0.6 0.8
Q{x[n]} 0.25 0.375 0.625 0.75
Q{x[n]}bin 0.010 0.011 0.101 0.110

Quantized y[n] - Q(1.3)


y[0] y[1] y[2] y[3] y[4] y[5]
y[n] 0.0667 0.2 0.4 0.6 0.4667 0.2667
Q{y[n]} 0.125 0.25 0.375 0.625 0.5 0.25
Q{y [n]}actual 0.125 0.25 0.5 0.625 0.5 0.25
21 / 35
Coefficient Quantization Effects
In IIR filters the response can be sensitive to coefficient values
• Quantization of filter coefficients causes the roots of the
numerator and denominator polynomials of the z-transform
to move
• Changing the position of these roots in z-plane causes the
frequency response to change and may even cause the
filter to go unstable as poles are moved onto or outside the
unit circle

22 / 35
Coefficient Quantization Severity
Severity of the effect is effected by
• Tightly clustered roots
• Roots close to the unit circle
• Many roots (a long filter)
• Filter structure
• Roots close to the real axis (for DF implementation)

23 / 35
Coefficient Quantization Severity
To reduce severity of these effects
• Use smaller filter sections
• Select filters with less tightly clustered roots
• Use more bits of precision
• Try slight variations on the filter to find which provides the
best response under the quantization constraints
• Scale the coefficients (by choosing the Q#) to
advantageously trade-off significant bits and overflow
probability

24 / 35
Quantization Effects - Filter Sturcuture
• Direct form (I and II) structures - filter and signal
quantization errors can accumulate

25 / 35
Quantization Effects - Filter Sturcuture
• Direct form (I and II) structures - filter and signal
quantization errors can accumulate

Cascaded Second-order Sections


• SOSs require a much smaller range of coefficient values
• Filter coefficient errors in one SOS do not affect other
SOSs
• Scaling may be applied separately to each section to
reduce signal quantization effects
• Sets of poles and zeros may be matched to limit the
dynamic range at the output of each section
• SOSs may be ordered to minimize quantization effects
• Cascades of SOSs require more operations than a DF-II
filter

25 / 35
Quantization Effects - Filter Sturcuture
Folded FIR Filter Structure
• Order of adds matter - folded FIR filters add from the
outside in

26 / 35
Quantization Effects - Filter Sturcuture
Folded FIR Filter Structure
• Order of adds matter - folded FIR filters add from the
outside in
• FIR filters almost always have their smallest values at the
ends

26 / 35
Quantization Effects - Filter Sturcuture
Folded FIR Filter Structure
• Order of adds matter - folded FIR filters add from the
outside in
• FIR filters almost always have their smallest values at the
ends
• For long FIR filters, folded structures improve numerical
accuracy over Direct Form

26 / 35
Handling Quantization Effects
Quantization Effects
Finite word length effects
• Overflow errors
Can be avoided by appropriate scaling

• Round-off errors
Difficult to avoid - requires appropriate fixed-point arithmetic

28 / 35
Round-off Noise
Product of fixed-point numbers
• Product output requires more bits than inputs
• Truncation or rounding of result can lead to errors
• Extended precision registers help in reducing this error

Sum of fixed-point numbers


• Output sum requires one-bit more than inputs
• Truncation or rounding of result can lead to errors
• Not as severe in product

29 / 35
Scaling
Scaling
• Prevents overflow
• Provides a trade-off between SNR and overflow

Scaling in filter design/implementation


• Normalize inputs, coefficients to ±1
• Based on magnitude of frequency response

30 / 35
Approaches to Scaling
Absolute scaling
• Scale assuming worst-case inputs/data
• Guarantees no overflow
• Leads to less accurate results (more quantization error)
Dynamic scaling
• Monitor range of variables and scale if required
• Increases computation

31 / 35
Floating-point to Fixed-point
• Implement and verify floating-point algorithm
• Estimate minimum/maximum ( range) of variables

• Convert floating-point variables to fixed-point


• Decide on scaling, based on architecture (word length)
• Range of variables can help in fixing integer word length
(IWL)

• Replace floating-point arithmetic with fixed-point arithmetic

• Consider available accumulator and register word lengths

32 / 35
FFT Computation: Finite Word-length
• Similar effects (quantization, round-off and overflow) also
affect FFT computation

• Depends on FFT length used

• Use appropriate scaling at each stage of FFT (before a


butterfly computation)

33 / 35
Today’s Lab
Lab Exercises
• Evaluating y = ax + b

• Evaluating y [n] = h[n] ∗ x[n]

35 / 35
Lab Exercises
• Evaluating y = ax + b

• Evaluating y [n] = h[n] ∗ x[n]

• FIR quantization effects


filterDesigner

35 / 35

You might also like