Floating Point Representation
Oyewo Maroof Olayinka
Department of Electrical/Electronic Engineering,
Federal University of Petroleum Resources, Effurun
8th August, 2019
[email protected]
Abstract
The term floating point refers to the fact that a number's radix point (decimal point, or, more
commonly in computers, binary point) can "float"; that is, it can be placed anywhere relative to the
significant digits of the number. This position is indicated as the exponent component, and thus the
floating-point representation can be thought of as a kind of scientific notation. (Wikipedia, 2019)
In computing, floating-point arithmetic (FP) is arithmetic using formulaic representation of real
numbers as an approximation so as to support a trade-off between range and precision. For this
reason, floating-point computation is often found in systems which include very small and very large
real numbers, which require fast processing times. (Wikipedia, 2019)
In this paper, we will be examining floating point representation of binary integers and its
applications in the field of computing.
Overview
Decimal Floating Point Representation
A floating-point number (or real number) can represent a very large (1.23×1088) or a very small
(1.23×10-88) value. It could also represent very large negative number (-1.23×1088) and very small
1
negative number (-1.23×10-88), as well as zero, as illustrated: (C. Hock-Chuan, 2014. [Online]:
https://www.ntu.edu.sg/home/ehchua/programming/java/DataRepresentation.html)
Image from www.ntu.edu.ng
A floating point usually consist of a fractional representation of the actual number (fraction, F) –
which is derived by shifting the decimal point from the end (after the LSB) to the left just at the right
of the MSB – an exponent, ‘e’ of a certain radix, ‘r’ (usually 10 for decimal system). The exponent or
the index of the radix is a figure which indicates the number of times the decimal point has to be
shifted to the right, (for an integer >1), or to the left (for an integer <1) to obtain the number in its
real form. This can be represented as: F x re
Representation of floating point number is not unique. For example, the number 55.66 can be
represented as 5.566×101, 0.5566×102, 0.05566×103, and so on. The fractional part can be
normalized. In the normalized form, there is only a single non-zero digit before the radix point. For
example, decimal number 123.4567 can be normalized as 1.234567×102; binary number 1010.1011B
can be normalized as 1.0101011B×23. (C. Hock-Chuan, 2014. [Online]:
https://www.ntu.edu.sg/home/ehchua/programming/java/DataRepresentation.html)
IEEE 754: floating point in modern computers
We may do the same in binary and this forms the foundation of our floating point number.
2
Here it is not a decimal point we are moving but a binary point and because it moves it is referred to
as floating. What we will look at below is what is referred to as the IEEE 754 Standard for
representing floating point numbers. The standard specifies the number of bits used for each section
(exponent, mantissa and sign) and the order in which they are represented.
This first standard is followed by almost all modern machines. It was revised in 2008. IBM
mainframes support IBM's own hexadecimal floating point format and IEEE 754-2008 decimal
floating point in addition to the IEEE 754 binary format.
The standard specifies the following formats for floating point numbers:
Single precision, which uses 32 bits and has the following layout:
1 bit for the sign of the number. 0 means positive and 1 means negative.
8 bits for the exponent.
23 bits for the mantissa.
1 10110110 11010011001010011101011
Sign
Mantissa
Exponent
Double precision, which uses 64 bits and has the following layout:
1 bit for the sign of the number. 0 means positive and 1 means negative.
11 bits for the exponent.
52 bits for the mantissa.
3
e.g. 0 00011100010 0100001000000000000001110100000110000000000000000000
Double precision has more bits, allowing for much larger and much smaller numbers to be
represented. As the mantissa is also larger, the degree of accuracy is also increased (remember that
many fractions cannot be accurately represented in binary). Whilst double precision floating point
numbers have these advantages, they also require more processing power. With increases in CPU
processing power and the move to 64 bit computing a lot of programming languages and software
just default to double precision. (R. Chadwick, 2019. [Online]: https://ryanstutorials.net/binary-
tutorial/binary-floating-point.php)
Sign bit is the first bit of the binary representation. '1' implies negative number and '0' implies
positive number. Example: 11000001110100000000000000000000. This is negative number.
Exponent is decided by the next 8 bits of binary representation. 127 is the unique number for 32 bit
floating point representation. It is known as bias. It is determined by 2 k-1 -1 where 'k' is the number of
bits in exponent field.
There are 3 exponent bits in 8-bit representation and 8 exponent bits in 32-bit representation.
Thus
bias = 3 for 8 bit conversion (23-1 -1 = 4-1 = 3)
bias = 127 for 32 bit conversion. (28-1 -1 = 128-1 = 127)
Example: 01000001110100000000000000000000
10000011 = (131)2
131-127 = 4
Hence the exponent of 2 will be 4 i.e. 24 = 16.
4
Mantissa is calculated from the remaining 23 bits of the binary representation. It consists of '1' and a
fractional part which is determined by:
Example:
01000001110100000000000000000000
The fractional part of mantissa is given by:
1*(1/2) + 0*(1/4) + 1*(1/8) + 0*(1/16) +……… = 0.625
Thus the mantissa will be 1 + 0.625 = 1.625
The decimal number hence given as: Sign*Exponent*Mantissa = (-1)*(16)*(1.625) = -26
(GeeksforGeeks, 2019. [Online]: https://www.geeksforgeeks.org/floating-point-representation-
digital-logic/)
Decimal to Floating Point Conversion
To convert a decimal number to binary floating point representation:
1. Convert the absolute value of the decimal number to a binary integer plus a binary fraction.
2. Normalize the number in binary scientific notation to obtain m and e.
3. Set s=0 for a positive number and s=1 for a negative number.
To convert 22.625 to binary floating point:
1. Convert decimal 22 to binary 10110. Convert decimal 0.625 to binary 0.101. Combine integer
and fraction to obtain binary 10110.101.
2. Normalize binary 10110.101 to obtain 1.0110 x 24 Thus, m = 1.01101012 and e = 4 = 1002.
5
3. The number is positive, so s=0. (M. Roth, 1996. [Online]:
https://www.cs.uaf.edu/2004/fall/cs301/notes/node49.html)
Conclusion
While floating-point representation can’t represent all numbers precisely, it does give us a
guaranteed number of significant digits. For 8-bit representation, we get a single bit of precision
which is pretty limited. Using 16-bit representation, we get more mantissa and exponent bits to
increase precision and range of numbers respectively; again this can only represent 2 16 real numbers.
Nearly all computers today follow the IEEE 754 standard (32-bit and 64-bit formats) for representing
floating-point numbers. This standard is similar to the 8-bit and 16-bit formats we've explored
already, but the standard deals with longer bit lengths to gain more precision and range; and it
incorporates two special cases to deal with very small and very large numbers. (C. Burch, 2011)
IEEE 754 standard is for now the de-facto standard for floating point representation of binary
numbers for application in modern computer systems until an even more efficient standard is
developed in the future. This is something exciting to look forward to.