Floating Point Number Representation:
When you have to represent very small or very large numbers, a fixed point representation will not do.
The accuracy will be lost.
Therefore, you will have to look at floating-point representations, where the binary point is assumed to
be floating.
When you consider a decimal number 12.34*107, this can also be treated as 0.1234*109, where 0.1234 is
the fixed-point mantissa.
The other part represents the exponent value, and indicates that the actual position of the binary point is
9 positions to the right (left) of the indicated binary point in the fraction.
Since the binary point can be moved to any position and the exponent value adjusted appropriately, it is
called a floating-point representation.
By convention, you generally go in for a normalized representation, wherein the floating-point is placed
to the right of the first nonzero (significant) digit.
The base need not be specified explicitly and the sign, the significant digits and the signed exponent
constitute the representation.
The IEEE (Institute of Electrical and Electronics Engineers) has produced a standard for floating point
arithmetic.
This standard specifies how single precision (32 bit) and double precision (64 bit) floating point
numbers are to be represented, as well as how arithmetic should be carried out on them.
The IEEE single precision floating point standard representation requires a 32 bit word, which may be
represented as numbered from 0 to 31, left to right.
The first bit is the sign bit, S, the next eight bits are the exponent bits, 'E', and the final 23 bits are the
fraction 'F'.
Instead of the signed exponent E, the value stored is an unsigned integer E' = E + 127, called the excess-
127 format. Therefore, E' is in the range 0 < E' < 255.
S E'E'E'E'E'E'E'E' FFFFFFFFFFFFFFFFFFFFFFF
01 89 31
The value V represented by the word may be determined as follows:
• If E' = 255 and F is nonzero, then V = NaN ("Not a number")
• If E' = 255 and F is zero and S is 1, then V = -Infinity
• If E' = 255 and F is zero and S is 0, then V = Infinity
• If 0 < E< 255 then V = (-1)S*2 (E-127)
*(1.F) where "1.F" is intended to represent the binary number
created by prefixing F with an implicit leading 1 and a binary point.
If E' 0 and F is nonzero, then V = (-1)S * 2 (-126) values.
• If E'= 0 and F is zero and S is 1, then V = -0
• If E' = 0 and F is zero and S is 0, then V = 0
For example:
0 00000000 00000000000000000000000 = 0
0 00000000 00000000000000000000000 = 0
1 00000000 00000000000000000000000 = -0
0 11111111 00000000000000000000000 = Infinity
1 11111111 00000000000000000000000 = -Infinity
0 11111111 00000100000000000000000 = NaN
1 11111111 00100010001001010101010 = NaN
0 10000000 00000000000000000000000= +1 * 2**(128-127) * 1.0 = 2
0 10000001 10100000000000000000000= +1*2**(129-127) * 1.101 = 6.5
1 10000001 10100000000000000000000= -1 * 2**(129-127) * 1.101 = -6.5
0 00000001 00000000000000000000000= +1 * *2**(1-127) * 1.0 = 2**(-126)
0 00000000 10000000000000000000000= +1*2**(-126) * 0.1 = 2**(-127)
0 00000000 00000000000000000000001 = +1*2**(-126)*
0.00000000000000000000001 = 2**(-149) (Smallest positive value)
(unnormalized values)
Double Precision Numbers:
The IEEE double precision floating point standard representation requires a 64-bit word, which may be
represented as numbered from 0 to 63, left to right.
The first bit is the sign bit, S, the next eleven bits are the excess-1023 exponent bits, E', and the final 52
bits are the fraction 'F':
S E'E'E'E'E'E'E'E'E'E'E' FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF
01 11 12 63
The value V represented by the word may be determined as follows:
• If E' = 2047 and F is nonzero, then V = NaN ("Not a number")
• If E'= 2047 and F is zero and S is 1, then V = -Infinity
• If E' = 2047 and F is zero and S is 0, then V = Infinity
• If 0 < E'< 2047 then V = (-1)**S* 2 ** (E-1023) * (1.F) where "1.F" is intended to represent
the binary number created by prefixing F with an implicit leading 1 and a binary point.
• If E'= 0 and F is nonzero, then V = (-1)**S* 2 ** (-1022)* (0.F) These are "unnormalized" values.
• If E' = 0 and F is zero and S is 1, then V = -0
• If E'= 0 and F is zero and S is 0, then V = 0