Floating Point Arithmetic: A
Comprehensive Guide
Floating point arithmetic is a fundamental concept in computer science,
enabling the representation and manipulation of real numbers.
SUBMITTED BY:-
PRASHANT SHARMA
Representation of Floating Point Numbers
Scientific Notation Precision and Range
Floating point numbers are represented using a form of The mantissa determines the precision of the number, while
scientific notation. They consist of three components: a sign the exponent controls its range. This representation allows
bit, a mantissa, and an exponent. for a wide range of values, both very small and very large.
Basic Floating Point Operations
Addition Subtraction Multiplication Division
Floating point addition Floating point subtraction is Floating point multiplication Floating point division
involves aligning the similar to addition, but with involves multiplying the involves dividing the
exponents and adding the the sign of the subtrahend mantissas and adding the mantissas and subtracting the
mantissas, potentially flipped. It can also lead to exponents. This operation can exponents. It is prone to
resulting in overflow or cancellation error. lead to rounding errors. potential division by zero
underflow. errors.
Rounding and Precision
Errors
1 Limited Precision 2 Error Accumulation
Floating point numbers have Rounding errors can
a finite precision, leading to accumulate over multiple
rounding errors when operations, potentially
representing real numbers impacting the accuracy of
exactly. the final result.
3 Catastrophic Cancellation
Subtracting two nearly equal floating point numbers can lead to a
significant loss of precision, known as catastrophic cancellation.
IEEE 754 Standard
Standardization
The IEEE 754 standard defines the representation and behavior of floating point numbers across
1
different platforms and architectures.
Precision Formats
2 It specifies different precision formats, including single-precision (32 bits) and double-
precision (64 bits).
Rounding Modes
3 The standard defines different rounding modes, allowing for control over how
rounding is handled during operations.
Special Values
4 It defines special values, such as infinity, NaN (Not a Number), and
denormal numbers, to handle exceptional situations.
Denormal Numbers and Underflow
Denormal Numbers
1 Denormal numbers are used to represent values smaller than the smallest normal number. They have
a reduced precision and are used to avoid abrupt underflow.
Underflow
2 Underflow occurs when the result of a calculation is too small to be represented
as a normal floating point number. This can lead to loss of precision.
Gradual Underflow
Denormal numbers help to provide gradual underflow,
3
reducing the impact of underflow on the accuracy of
calculations.
Floating Point Exceptions and Special Values
1 2
Overflow Division by Zero
An overflow occurs when the result of a calculation is too large to be represented Division by zero is an illegal operation in floating point arithmetic, resulting in an
as a floating point number. exception.
3 4
NaN Infinity
NaN (Not a Number) is a special value used to represent undefined or invalid Infinity is a special value used to represent values that are larger than the
results, such as the result of dividing by zero. maximum representable floating point number.
Practical Considerations and Best Practices
Understanding the limitations of floating point arithmetic and following best practices can help mitigate errors and ensure
reliable results.