0% found this document useful (0 votes)

6 views50 pages

Finite Precision

The document discusses finite wordlength effects in discrete-time systems, focusing on quantization errors during A/D and D/A conversions and finite word-length arithmetic. It covers various number representations, including fixed-point and floating-point formats, and illustrates quantization effects through examples of digital signal processing. Key concepts include coefficient quantization, handling quantization errors, and the implications for system design.

Uploaded by

Aditya Raj

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

6 views50 pages

Finite Precision

Uploaded by

Aditya Raj

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 50

Finite Wordlength Effects in Discrete-Time

Systems

V. Rajbabu
[email protected]
DSP Lab

Department of Electrical Engineering

Indian Institute of Technology Bombay

24 Jan 2024

1 / 35
Outline

Number Representations

Coefficient Quantization

Handling Quantization Effects

Today’s Lab

2 / 35
Finite Wordlength Effects
Compared to an analog implementation, discrete
implementation losses precision due to quantization
• A/D and D/A conversions
• Finite word-length arithmetic
Sources of error
• Quantization in A/D conversion - signal-to-noise ratio
(SNR) increases by 6 dB for each additional bit
• Rounding of multiple products
These errors have to be considered while designing DT
systems

3 / 35
Number Representations
Number Representation
Based on finite word-length numeric representations, digital
signal processors (DSPs) can be
• Fixed-point processors
• Floating-point processors

5 / 35
Number Representation
Based on finite word-length numeric representations, digital
signal processors (DSPs) can be
• Fixed-point processors
• Floating-point processors
Floating-point representation
• General purpose processors use this representation

x = 2E x̂M eg., in 64-bit, E = 11, 1 sign bit

52 bits for decimal part

E is the exponent, and x̂M ∈ (−1, +1) is the mantissa

x = 2E x̂M eg., in 64-bit, E = 11, 1 sign bit

52 bits for decimal part

E is the exponent, and x̂M ∈ (−1, +1) is the mantissa

Floating-point representation
• supports wide range of values (with small number of bits)
• has complex hardware
• is simpler to design
5 / 35
Binary Number Systems
Basic number systems
• Signed magnitude
• One’s complement
• Two’s complement

Two’s complement
• Most commonly used in binary system
• Unique representation for zero
• Simple mathematical operations
• Subtraction performed using addition

6 / 35
Fractional Binary Numbers
• Designer keeps track of radix (decimal) point

Normalized Binary
Normalized M-bit binary numbers are written as:

x(B) = b0 .b1 b2 · · · bM−1

where b0 is the sign bit

Decimal representation
M−1
X
x(10) = −b0 + bi 2−i
i=1

7 / 35
Fractional Binary Numbers
• Designer keeps track of radix (decimal) point

Normalized Binary
Normalized M-bit binary numbers are written as:

x(B) = b0 .b1 b2 · · · bM−1

where b0 is the sign bit

Decimal representation
M−1
X
x(10) = −b0 + bi 2−i
i=1

Unscaled
real number - infinite precision
∞

bi 2−i
P
x(10) = Xm −b0 +
i=1
7 / 35
Number Circle
An easy way to visualize two’s complement representation
0.00 ≡ 0⋅ 2−2 = 0

1.11 ≡ −1⋅ 2−2 = −0.25 0.01 ≡ 1⋅ 2−2 = 0.25

1.10 ≡ −2⋅ 2−2 = −0.5 0.10 ≡ 2⋅ 2−2 = 0.5

1.01 ≡ −3⋅ 2−2 = −0.75 0.11 ≡ 3⋅ 2−2 = 0.75

1.00 ≡ −4⋅ 2−2 = −1

Figure: 3-bit Q2 numbers

8 / 35
Q-Format
Q-format is a formal mechanism to keep track of radix (fixed)
point
Q-Format: Q##
refers to a binary number with ## bits to the right of the radix
point
• Total word length depends on the system
• In DSPs, Q15 is a common format
• A 16-bit number in Q15 has 1 sign bit and 15 fractional bits

s.b0 b1 · · · b14

Alternate form: Q(I.F )

• I indicates number of bits to the left of radix point (for sign
and integer part of a number)
• F bits to the right of the radix point
9 / 35
Q-Format - Example
Convert the following numbers to their signed integer value in
Q15
Q-Format

0.5
−0.5
−1.0
1.0

10 / 35
Q-Format - Example
Convert the following numbers to their signed integer value in
Q15
Q-Format

0.5 = 16384
−0.5 = −16384
−1.0 = −32768
1.0 = out of range
≈ 32767 = 1 − 2−15

10 / 35
Q-Format Conversion
Q12 Number
15 11 0
S

IWL WL − 1 − IWL
−2IWL ≤ Range < 2IWL − 2−12
Let x be a fractional number that needs to be represented as a
B-bit (WL) signed integer, Qf format as xq
• For positive x, xq = round(x · 2f )
• For negative x, xq = −round(|x| · 2f )

11 / 35
Q-Format: Addition
To obtain: C = A + B, with Qc , Qa , Qb
• Require Qa and Qb to be equal
• Let Ma and Mb be size of registers for A and B
• Intermediate values
• Intermediate result size = max(Ma , Mb ) + 1
• Intermediate QI = Qa = Qb
• Final values
• Top Mc bits are used, lowest fractional bits are discarded

Qc = Qa − (Mc − max(Ma , Mb ) − 1)
= Qb − (Mc − max(Ma , Mb ) − 1)

Adding N numbers of length M

Final word length : ???

12 / 35
Q-Format: Addition
To obtain: C = A + B, with Qc , Qa , Qb
• Require Qa and Qb to be equal
• Let Ma and Mb be size of registers for A and B
• Intermediate values
• Intermediate result size = max(Ma , Mb ) + 1
• Intermediate QI = Qa = Qb
• Final values
• Top Mc bits are used, lowest fractional bits are discarded

Qc = Qa − (Mc − max(Ma , Mb ) − 1)
= Qb − (Mc − max(Ma , Mb ) − 1)

Adding N numbers of length M

Final word length : M + ⌈log2 N⌉

12 / 35
Q-Format: Multiplication
To obtain: C = A × B, with Qc , Qa , Qb
• Let Ma and Mb be size of registers for A and B
• Ma and Mb or Qa and Qb need not be equal
• Intermediate values
• Intermediate result size = Ma + Mb
• Intermediate QI = Qa + Qb
• Final values
• Top Mc bits are used and lowest fractional bits are
discarded
• Qc = (Qa + Qb ) − (Ma + Mb − Mc )

13 / 35
Fixed-point Arithmetic

Table: Fixed-point Arithmetic

Floating-point Fixed-point IWL of result

IX > IY IX < IY
X := Y X := (Y ≫ (IX − IY )) X := (Y ≪ (IY − IX )) IX
X +Y X + (Y ≫ (IX − IY )) (X ≫ (IY − IX )) + Y max(IX , IY ) + 1
X ∗Y X ∗Y X ∗Y IX + IY
a
IX , IY - Integer word length (IWL) of X and Y
b
Overflow needs to be avoided for valid results

14 / 35
Coefficient Quantization
Quantization
• Quantization - represents numerical values with finite
number of bits
• Significant in fixed-point representation

Quantization in signal processing

• data - round-off errors
• coefficients/parameters - changes system transfer function

16 / 35
Implementation of DT Filter

17 / 35
Quantization Errors
Possible errors in a quantized (fixed-point) system
• Input quantization
• Coefficient quantization
• Product quantization (round-off error, underflow)
• Overflow

18 / 35
Quantization Example
Suppose you want to implement a filter using fractional
arithmetic (Q1.3)
h[n] = [ 13 1
3
1
3]
x[n] = [ 15 2
5
3
5
4
5]
y [n] = [0.0667 0.2 0.4 0.6 0.4667 0.2667] (ideal)

19 / 35
Quantization Example
Suppose you want to implement a filter using fractional
arithmetic (Q1.3)
h[n] = [ 13 1
3
1
3]
x[n] = [ 15 2
5
3
5
4
5]
y [n] = [0.0667 0.2 0.4 0.6 0.4667 0.2667] (ideal)

Q-format representation: Q1.3 refers to 4 bits with the MSB

representing sign and three bits representing fractional part

19 / 35
Quantization Example
Consider h = 0.3333 represented in Q(1.3) or Q3 (Qf )
• Word length is 4 bits
• hq = round(h × 2f ) = round(2.66) = 3
• Q{h}bin = 0 011
• Equivalent decimal value 0.375

20 / 35
Quantization Example
Consider h = 0.3333 represented in Q(1.3) or Q3 (Qf )
• Word length is 4 bits
• hq = round(h × 2f ) = round(2.66) = 3
• Q{h}bin = 0 011
• Equivalent decimal value 0.375
Typically output of product or sum will have register with more
bits (say in this case 8 bits) to avoid overflow
• Say x = 0.2, we have Q{x}bin = 0 010
• Product Q{hbin } × Q{xbin } = 00 000110

20 / 35
Quantization Example
Quantized h[n] - Q(1.3)
h[0] h[1] h[2]
h[n] 0.3333 0.3333 0.3333
Q{h[n]} 0.375 0.375 0.375
Q{h[n]}bin 0.011 0.011 0.011

21 / 35
Quantization Example
Quantized h[n] - Q(1.3)
h[0] h[1] h[2]
h[n] 0.3333 0.3333 0.3333
Q{h[n]} 0.375 0.375 0.375
Q{h[n]}bin 0.011 0.011 0.011

Quantized x[n] - Q(1.3)

x[0] x[1] x[2] x[3]
x[n] 0.2 0.4 0.6 0.8
Q{x[n]} 0.25 0.375 0.625 0.75
Q{x[n]}bin 0.010 0.011 0.101 0.110

21 / 35
Quantization Example
Quantized h[n] - Q(1.3)
h[0] h[1] h[2]
h[n] 0.3333 0.3333 0.3333
Q{h[n]} 0.375 0.375 0.375
Q{h[n]}bin 0.011 0.011 0.011

Quantized x[n] - Q(1.3)

x[0] x[1] x[2] x[3]
x[n] 0.2 0.4 0.6 0.8
Q{x[n]} 0.25 0.375 0.625 0.75
Q{x[n]}bin 0.010 0.011 0.101 0.110

Quantized y[n] - Q(1.3)

y[0] y[1] y[2] y[3] y[4] y[5]
y[n] 0.0667 0.2 0.4 0.6 0.4667 0.2667
Q{y[n]} 0.125 0.25 0.375 0.625 0.5 0.25
Q{y [n]}actual 0.125 0.25 0.5 0.625 0.5 0.25
21 / 35
Coefficient Quantization Effects
In IIR filters the response can be sensitive to coefficient values
• Quantization of filter coefficients causes the roots of the
numerator and denominator polynomials of the z-transform
to move
• Changing the position of these roots in z-plane causes the
frequency response to change and may even cause the
filter to go unstable as poles are moved onto or outside the
unit circle

22 / 35
Coefficient Quantization Severity
Severity of the effect is effected by
• Tightly clustered roots
• Roots close to the unit circle
• Many roots (a long filter)
• Filter structure
• Roots close to the real axis (for DF implementation)

23 / 35
Coefficient Quantization Severity
To reduce severity of these effects
• Use smaller filter sections
• Select filters with less tightly clustered roots
• Use more bits of precision
• Try slight variations on the filter to find which provides the
best response under the quantization constraints
• Scale the coefficients (by choosing the Q#) to
advantageously trade-off significant bits and overflow
probability

24 / 35
Quantization Effects - Filter Sturcuture
• Direct form (I and II) structures - filter and signal
quantization errors can accumulate

25 / 35
Quantization Effects - Filter Sturcuture
• Direct form (I and II) structures - filter and signal
quantization errors can accumulate

Cascaded Second-order Sections

• SOSs require a much smaller range of coefficient values
• Filter coefficient errors in one SOS do not affect other
SOSs
• Scaling may be applied separately to each section to
reduce signal quantization effects
• Sets of poles and zeros may be matched to limit the
dynamic range at the output of each section
• SOSs may be ordered to minimize quantization effects
• Cascades of SOSs require more operations than a DF-II
filter

25 / 35
Quantization Effects - Filter Sturcuture
Folded FIR Filter Structure
• Order of adds matter - folded FIR filters add from the
outside in

26 / 35
Quantization Effects - Filter Sturcuture
Folded FIR Filter Structure
• Order of adds matter - folded FIR filters add from the
outside in
• FIR filters almost always have their smallest values at the
ends

26 / 35
Handling Quantization Effects
Quantization Effects
Finite word length effects
• Overflow errors
Can be avoided by appropriate scaling

• Round-off errors
Difficult to avoid - requires appropriate fixed-point arithmetic

28 / 35
Round-off Noise
Product of fixed-point numbers
• Product output requires more bits than inputs
• Truncation or rounding of result can lead to errors
• Extended precision registers help in reducing this error

Sum of fixed-point numbers

• Output sum requires one-bit more than inputs
• Truncation or rounding of result can lead to errors
• Not as severe in product

29 / 35
Scaling
Scaling
• Prevents overflow
• Provides a trade-off between SNR and overflow

Scaling in filter design/implementation

• Normalize inputs, coefficients to ±1
• Based on magnitude of frequency response

30 / 35
Approaches to Scaling
Absolute scaling
• Scale assuming worst-case inputs/data
• Guarantees no overflow
• Leads to less accurate results (more quantization error)
Dynamic scaling
• Monitor range of variables and scale if required
• Increases computation

31 / 35
Floating-point to Fixed-point
• Implement and verify floating-point algorithm
• Estimate minimum/maximum ( range) of variables

• Convert floating-point variables to fixed-point

• Decide on scaling, based on architecture (word length)
• Range of variables can help in fixing integer word length
(IWL)

• Replace floating-point arithmetic with fixed-point arithmetic

• Consider available accumulator and register word lengths

32 / 35
FFT Computation: Finite Word-length
• Similar effects (quantization, round-off and overflow) also
affect FFT computation

• Depends on FFT length used

• Use appropriate scaling at each stage of FFT (before a

butterfly computation)

33 / 35
Today’s Lab
Lab Exercises
• Evaluating y = ax + b

• Evaluating y [n] = h[n] ∗ x[n]

35 / 35
Lab Exercises
• Evaluating y = ax + b

• Evaluating y [n] = h[n] ∗ x[n]

• FIR quantization effects

filterDesigner

35 / 35

Practical Considerations in Fixed-Point FIR Filter Implem
No ratings yet
Practical Considerations in Fixed-Point FIR Filter Implem
15 pages
13.a - Fixed Point Arithmetics
No ratings yet
13.a - Fixed Point Arithmetics
8 pages
Fixed-Point Design: SYSC5603 (ELG6163) Digital Signal Processing Microprocessors, Software and Applications
No ratings yet
Fixed-Point Design: SYSC5603 (ELG6163) Digital Signal Processing Microprocessors, Software and Applications
57 pages
R T D S P: EAL IME Igital Ignal Rocessing
No ratings yet
R T D S P: EAL IME Igital Ignal Rocessing
56 pages
FPGA Based System Design
No ratings yet
FPGA Based System Design
12 pages
Finite Word Length Effects
No ratings yet
Finite Word Length Effects
31 pages
SW Lab 3 Fixed Point Simulation EE 462
No ratings yet
SW Lab 3 Fixed Point Simulation EE 462
7 pages
Lecture 12 - Quantization
No ratings yet
Lecture 12 - Quantization
6 pages
UNIT V Finite Word Length Effects Lecture Notes Modified
No ratings yet
UNIT V Finite Word Length Effects Lecture Notes Modified
11 pages
Lab # 06 PDF
No ratings yet
Lab # 06 PDF
12 pages
DSP Arithmetic
No ratings yet
DSP Arithmetic
33 pages
Chapter 5 Part 1
No ratings yet
Chapter 5 Part 1
17 pages
Unit V Finite Word Length Effects in Digital Filters
75% (4)
Unit V Finite Word Length Effects in Digital Filters
3 pages
Finite Word Length
No ratings yet
Finite Word Length
13 pages
Fixed vs. Floating Point in Computing
No ratings yet
Fixed vs. Floating Point in Computing
24 pages
Finite Word Length Effects in Digital Filter
No ratings yet
Finite Word Length Effects in Digital Filter
26 pages
Computer Arithmetic Representations
No ratings yet
Computer Arithmetic Representations
24 pages
Fixed Point Conversion
No ratings yet
Fixed Point Conversion
50 pages
Lab # 07 IIR PDF
No ratings yet
Lab # 07 IIR PDF
13 pages
VHDL Fixed-Point Representation Guide
No ratings yet
VHDL Fixed-Point Representation Guide
57 pages
L2-Variables and Floating Point Number System
No ratings yet
L2-Variables and Floating Point Number System
38 pages
Ece3101l Lab6 Signal Quantization
No ratings yet
Ece3101l Lab6 Signal Quantization
14 pages
Ch.5 Fixed-Point vs. Floating Point
No ratings yet
Ch.5 Fixed-Point vs. Floating Point
10 pages
Cacc
No ratings yet
Cacc
106 pages
Computer Arithmetic Representations
No ratings yet
Computer Arithmetic Representations
24 pages
VHDL Fir
No ratings yet
VHDL Fir
21 pages
DSP
No ratings yet
DSP
44 pages
Lecture 14 - Arithmetic Subsystems - Numbering Systems and Floating Point Unit (FPU)
No ratings yet
Lecture 14 - Arithmetic Subsystems - Numbering Systems and Floating Point Unit (FPU)
32 pages
COA Module 2
No ratings yet
COA Module 2
65 pages
Floating Point & Fixed Point Representation - BCA II
No ratings yet
Floating Point & Fixed Point Representation - BCA II
24 pages
Chapter 2 Exercise and Answer Sign
No ratings yet
Chapter 2 Exercise and Answer Sign
10 pages
Digital Signal Processing Formats
No ratings yet
Digital Signal Processing Formats
29 pages
Binary Number Representations
No ratings yet
Binary Number Representations
14 pages
DSP Arithmetic for Academics
No ratings yet
DSP Arithmetic for Academics
96 pages
Finite Word Length Effects in DSP: Prepared BY Guided BY
No ratings yet
Finite Word Length Effects in DSP: Prepared BY Guided BY
28 pages
Finite Word Length Effects in DSP: Prepared BY Guided BY
No ratings yet
Finite Word Length Effects in DSP: Prepared BY Guided BY
28 pages
Ece 306L - Experiment 4: Signal Quantization
No ratings yet
Ece 306L - Experiment 4: Signal Quantization
10 pages
Unit 5 - Share
No ratings yet
Unit 5 - Share
38 pages
Design & Simulation of 32-Bit Floating Point Alu
No ratings yet
Design & Simulation of 32-Bit Floating Point Alu
3 pages
Fixed-Point FIR Filter Guide
No ratings yet
Fixed-Point FIR Filter Guide
12 pages
Computer Architecture: Data Types
No ratings yet
Computer Architecture: Data Types
25 pages
32-Bit ALU Design & Simulation
No ratings yet
32-Bit ALU Design & Simulation
3 pages
Lab 7
No ratings yet
Lab 7
16 pages
Assignment 5
No ratings yet
Assignment 5
4 pages
Floating-Point To Fixed-Point Conversion For Audio
No ratings yet
Floating-Point To Fixed-Point Conversion For Audio
10 pages
COD - Unit-3 - N - 4 - PPT AJAY Kumar
No ratings yet
COD - Unit-3 - N - 4 - PPT AJAY Kumar
93 pages
Module 1 DSPA Chapter 2
No ratings yet
Module 1 DSPA Chapter 2
8 pages
Module 04
No ratings yet
Module 04
19 pages
5 Data - Floating - Point v1
No ratings yet
5 Data - Floating - Point v1
25 pages
Mailam Engineering College Mailam (Po), Villupuram (DT) - Pin: 604 304
No ratings yet
Mailam Engineering College Mailam (Po), Villupuram (DT) - Pin: 604 304
43 pages
Unit 2
No ratings yet
Unit 2
16 pages
ENSC254 - Floating Point Computation
No ratings yet
ENSC254 - Floating Point Computation
29 pages
Digital Arithmetic for CSE Students
No ratings yet
Digital Arithmetic for CSE Students
97 pages
Module 2
No ratings yet
Module 2
33 pages
Computer Arithmetic: Part II: Integer Arithmetic & Floating Point
No ratings yet
Computer Arithmetic: Part II: Integer Arithmetic & Floating Point
30 pages
Finite Word Length Effects
No ratings yet
Finite Word Length Effects
29 pages
BCA Exam: Design & Analysis of Algorithms
No ratings yet
BCA Exam: Design & Analysis of Algorithms
2 pages
CH 2 - Finite Automata
No ratings yet
CH 2 - Finite Automata
72 pages
Mca PDF
No ratings yet
Mca PDF
65 pages
DAA Insem Official Paper Solution
No ratings yet
DAA Insem Official Paper Solution
25 pages
Modbus Ultrasonic Level Transmitter Guide
No ratings yet
Modbus Ultrasonic Level Transmitter Guide
3 pages
Chap. 3 BooleanAlgebra
No ratings yet
Chap. 3 BooleanAlgebra
32 pages
IEEE 754 Floating Point Guide
No ratings yet
IEEE 754 Floating Point Guide
26 pages
Important Questions Algo
No ratings yet
Important Questions Algo
10 pages
P.E.S. College of Engineering, Mandya - 571 401
No ratings yet
P.E.S. College of Engineering, Mandya - 571 401
3 pages
Turing Machines for Computable Functions
No ratings yet
Turing Machines for Computable Functions
35 pages
Digital Logic Design: Binary Systems
No ratings yet
Digital Logic Design: Binary Systems
64 pages
Computer Architecture Detailed Answers
No ratings yet
Computer Architecture Detailed Answers
2 pages
Induction Recursion
No ratings yet
Induction Recursion
5 pages
Base Conversion
No ratings yet
Base Conversion
4 pages
Finite State Transducers: Data Structures and Algorithms For Computational Linguistics III
No ratings yet
Finite State Transducers: Data Structures and Algorithms For Computational Linguistics III
31 pages
Design and Analysis of Algorithms CSC 321 Lecture 3 29092022 032607pm
No ratings yet
Design and Analysis of Algorithms CSC 321 Lecture 3 29092022 032607pm
49 pages
Digital System Design Course Outline
No ratings yet
Digital System Design Course Outline
16 pages
2010-0012. Basic Adders and Multipliers From Wakerly
No ratings yet
2010-0012. Basic Adders and Multipliers From Wakerly
29 pages
Digital Logic Design Exam Guide
No ratings yet
Digital Logic Design Exam Guide
7 pages
12 PLC Program To Implement 2 Bit Magnitude Comparator
No ratings yet
12 PLC Program To Implement 2 Bit Magnitude Comparator
5 pages
Atc QB
No ratings yet
Atc QB
13 pages
Module 5
No ratings yet
Module 5
17 pages
Assembly Code Exercise Guide
No ratings yet
Assembly Code Exercise Guide
3 pages
TOPIC F - Kleen's Theorem
No ratings yet
TOPIC F - Kleen's Theorem
31 pages
Coa Unit 2
No ratings yet
Coa Unit 2
5 pages
Lecture 01 - Finite Automat I PDF
No ratings yet
Lecture 01 - Finite Automat I PDF
28 pages
Assignment 1 and 2
No ratings yet
Assignment 1 and 2
7 pages
CSE 221 Lec01 Intro F23
No ratings yet
CSE 221 Lec01 Intro F23
65 pages
ECTE233 Tutorial 01 Questions 2024
No ratings yet
ECTE233 Tutorial 01 Questions 2024
3 pages
Chapter 10 Morris Mano
0% (1)
Chapter 10 Morris Mano
15 pages

Finite Precision

Uploaded by

Finite Precision

Uploaded by

Finite Wordlength Effects in Discrete-Time

Department of Electrical Engineering

Handling Quantization Effects

x = 2E x̂M eg., in 64-bit, E = 11, 1 sign bit

E is the exponent, and x̂M ∈ (−1, +1) is the mantissa

x = 2E x̂M eg., in 64-bit, E = 11, 1 sign bit

E is the exponent, and x̂M ∈ (−1, +1) is the mantissa

x(B) = b0 .b1 b2 · · · bM−1

where b0 is the sign bit

x(B) = b0 .b1 b2 · · · bM−1

where b0 is the sign bit

1.11 ≡ −1⋅ 2−2 = −0.25 0.01 ≡ 1⋅ 2−2 = 0.25

1.10 ≡ −2⋅ 2−2 = −0.5 0.10 ≡ 2⋅ 2−2 = 0.5

1.01 ≡ −3⋅ 2−2 = −0.75 0.11 ≡ 3⋅ 2−2 = 0.75

1.00 ≡ −4⋅ 2−2 = −1

Figure: 3-bit Q2 numbers

Alternate form: Q(I.F )

Adding N numbers of length M

Adding N numbers of length M

Table: Fixed-point Arithmetic

Floating-point Fixed-point IWL of result

Quantization in signal processing

Q-format representation: Q1.3 refers to 4 bits with the MSB

Quantized x[n] - Q(1.3)

Quantized x[n] - Q(1.3)

Quantized y[n] - Q(1.3)

Cascaded Second-order Sections

Sum of fixed-point numbers

Scaling in filter design/implementation

• Convert floating-point variables to fixed-point

• Replace floating-point arithmetic with fixed-point arithmetic

• Consider available accumulator and register word lengths

• Depends on FFT length used

• Use appropriate scaling at each stage of FFT (before a

• Evaluating y [n] = h[n] ∗ x[n]

• Evaluating y [n] = h[n] ∗ x[n]

• FIR quantization effects

You might also like