Thanks to visit codestin.com
Credit goes to www.scribd.com

0% found this document useful (0 votes)
151 views6 pages

Floating Point Representation - M.eng Term Paper

Floating point numbers represent real numbers in computing by approximating their values. They consist of three parts - a sign bit, exponent field, and mantissa. The IEEE 754 standard specifies floating point number formats, including 32-bit single precision and 64-bit double precision, which place the sign bit first followed by exponent and mantissa bits. This representation allows floating point numbers to support a wide range of large and small values. Decimal numbers are converted to floating point by normalizing their binary representation and mapping the bits according to the IEEE 754 standard.

Uploaded by

MAROOF OYEWO
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
151 views6 pages

Floating Point Representation - M.eng Term Paper

Floating point numbers represent real numbers in computing by approximating their values. They consist of three parts - a sign bit, exponent field, and mantissa. The IEEE 754 standard specifies floating point number formats, including 32-bit single precision and 64-bit double precision, which place the sign bit first followed by exponent and mantissa bits. This representation allows floating point numbers to support a wide range of large and small values. Decimal numbers are converted to floating point by normalizing their binary representation and mapping the bits according to the IEEE 754 standard.

Uploaded by

MAROOF OYEWO
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 6

Floating Point Representation

Oyewo Maroof Olayinka

Department of Electrical/Electronic Engineering,

Federal University of Petroleum Resources, Effurun

8th August, 2019

[email protected]

Abstract

The term floating point refers to the fact that a number's radix point (decimal point, or, more

commonly in computers, binary point) can "float"; that is, it can be placed anywhere relative to the

significant digits of the number. This position is indicated as the exponent component, and thus the

floating-point representation can be thought of as a kind of scientific notation. (Wikipedia, 2019)

In computing, floating-point arithmetic (FP) is arithmetic using formulaic representation of real

numbers as an approximation so as to support a trade-off between range and precision. For this

reason, floating-point computation is often found in systems which include very small and very large

real numbers, which require fast processing times. (Wikipedia, 2019)

In this paper, we will be examining floating point representation of binary integers and its

applications in the field of computing.

Overview
Decimal Floating Point Representation

A floating-point number (or real number) can represent a very large (1.23×1088) or a very small

(1.23×10-88) value. It could also represent very large negative number (-1.23×1088) and very small

1
negative number (-1.23×10-88), as well as zero, as illustrated: (C. Hock-Chuan, 2014. [Online]:

https://www.ntu.edu.sg/home/ehchua/programming/java/DataRepresentation.html)

Image from www.ntu.edu.ng

A floating point usually consist of a fractional representation of the actual number (fraction, F) –

which is derived by shifting the decimal point from the end (after the LSB) to the left just at the right

of the MSB – an exponent, ‘e’ of a certain radix, ‘r’ (usually 10 for decimal system). The exponent or

the index of the radix is a figure which indicates the number of times the decimal point has to be

shifted to the right, (for an integer >1), or to the left (for an integer <1) to obtain the number in its

real form. This can be represented as: F x re

Representation of floating point number is not unique. For example, the number 55.66 can be

represented as 5.566×101, 0.5566×102, 0.05566×103, and so on. The fractional part can be

normalized. In the normalized form, there is only a single non-zero digit before the radix point. For

example, decimal number 123.4567 can be normalized as 1.234567×102; binary number 1010.1011B

can be normalized as 1.0101011B×23. (C. Hock-Chuan, 2014. [Online]:

https://www.ntu.edu.sg/home/ehchua/programming/java/DataRepresentation.html)

IEEE 754: floating point in modern computers

We may do the same in binary and this forms the foundation of our floating point number.

2
Here it is not a decimal point we are moving but a binary point and because it moves it is referred to

as floating. What we will look at below is what is referred to as the IEEE 754 Standard for

representing floating point numbers. The standard specifies the number of bits used for each section

(exponent, mantissa and sign) and the order in which they are represented.

This first standard is followed by almost all modern machines. It was revised in 2008. IBM

mainframes support IBM's own hexadecimal floating point format and IEEE 754-2008 decimal

floating point in addition to the IEEE 754 binary format.

The standard specifies the following formats for floating point numbers:

Single precision, which uses 32 bits and has the following layout:

 1 bit for the sign of the number. 0 means positive and 1 means negative.

 8 bits for the exponent.

 23 bits for the mantissa.

1 10110110 11010011001010011101011

Sign
Mantissa

Exponent

Double precision, which uses 64 bits and has the following layout:

 1 bit for the sign of the number. 0 means positive and 1 means negative.

 11 bits for the exponent.

 52 bits for the mantissa.

3
e.g. 0 00011100010 0100001000000000000001110100000110000000000000000000

Double precision has more bits, allowing for much larger and much smaller numbers to be

represented. As the mantissa is also larger, the degree of accuracy is also increased (remember that

many fractions cannot be accurately represented in binary). Whilst double precision floating point

numbers have these advantages, they also require more processing power. With increases in CPU

processing power and the move to 64 bit computing a lot of programming languages and software

just default to double precision. (R. Chadwick, 2019. [Online]: https://ryanstutorials.net/binary-

tutorial/binary-floating-point.php)

Sign bit is the first bit of the binary representation. '1' implies negative number and '0' implies

positive number. Example: 11000001110100000000000000000000. This is negative number.

Exponent is decided by the next 8 bits of binary representation. 127 is the unique number for 32 bit

floating point representation. It is known as bias. It is determined by 2 k-1 -1 where 'k' is the number of

bits in exponent field.

There are 3 exponent bits in 8-bit representation and 8 exponent bits in 32-bit representation.

Thus

bias = 3 for 8 bit conversion (23-1 -1 = 4-1 = 3)

bias = 127 for 32 bit conversion. (28-1 -1 = 128-1 = 127)

Example: 01000001110100000000000000000000

10000011 = (131)2

131-127 = 4

Hence the exponent of 2 will be 4 i.e. 24 = 16.

4
Mantissa is calculated from the remaining 23 bits of the binary representation. It consists of '1' and a

fractional part which is determined by:

Example:

01000001110100000000000000000000

The fractional part of mantissa is given by:

1*(1/2) + 0*(1/4) + 1*(1/8) + 0*(1/16) +……… = 0.625

Thus the mantissa will be 1 + 0.625 = 1.625

The decimal number hence given as: Sign*Exponent*Mantissa = (-1)*(16)*(1.625) = -26

(GeeksforGeeks, 2019. [Online]: https://www.geeksforgeeks.org/floating-point-representation-

digital-logic/)

Decimal to Floating Point Conversion

To convert a decimal number to binary floating point representation:

1. Convert the absolute value of the decimal number to a binary integer plus a binary fraction.

2. Normalize the number in binary scientific notation to obtain m and e.

3. Set s=0 for a positive number and s=1 for a negative number.

To convert 22.625 to binary floating point:

1. Convert decimal 22 to binary 10110. Convert decimal 0.625 to binary 0.101. Combine integer

and fraction to obtain binary 10110.101.

2. Normalize binary 10110.101 to obtain 1.0110 x 24 Thus, m = 1.01101012 and e = 4 = 1002.

5
3. The number is positive, so s=0. (M. Roth, 1996. [Online]:

https://www.cs.uaf.edu/2004/fall/cs301/notes/node49.html)

Conclusion

While floating-point representation can’t represent all numbers precisely, it does give us a

guaranteed number of significant digits. For 8-bit representation, we get a single bit of precision

which is pretty limited. Using 16-bit representation, we get more mantissa and exponent bits to

increase precision and range of numbers respectively; again this can only represent 2 16 real numbers.

Nearly all computers today follow the IEEE 754 standard (32-bit and 64-bit formats) for representing

floating-point numbers. This standard is similar to the 8-bit and 16-bit formats we've explored

already, but the standard deals with longer bit lengths to gain more precision and range; and it

incorporates two special cases to deal with very small and very large numbers. (C. Burch, 2011)

IEEE 754 standard is for now the de-facto standard for floating point representation of binary

numbers for application in modern computer systems until an even more efficient standard is

developed in the future. This is something exciting to look forward to.

You might also like