COMPUTER ORGANIZATION &
ASSEMBLY LANGUAGE(CS-215T)
Lecture by:
Dr. Abdul Hameed
Assistant Professor
CSD, SSUET
Batch
2020F
Floating-Point
Representatio
n, Conversion
Week 7
& Floating-
Point
Arithmetic
Computer Science & Information Technology Department
Sir Syed University of Engg. & Tech.
FLOATING-POINT REPRESENTATION
Why the name “Floating-Point”?
Consider the number 32
It is called fixed-point number
That is, the decimal point is fixed at the end
32 => 32.0
Remember Scientific Notation
32 can be written as 3.2 x 101
Now, it becomes a Real number (a fraction)
And the decimal point floats w.r.t power of 10
32.0 = 3.2 x 101 = 0.32 x 102 = 0.032 x 103
Or 32.0= 32.0 x 100 = 320.0 x 10-1 = 3200.0 x 10-2
3
Computer Science & Information Technology Department
Sir Syed University of Engg. & Tech.
FLOATING-POINT REPRESENTATION
Why the name “Floating-Point”?
3.2 x 101 32.0 x 101-1
Moving decimal point to right results in
subtracting 1 from the power of 10
3.2 x 101 0.32 x 101+1
Moving decimal point to left results in
adding 1 to the power of 10
Floa n F
g l o ti n g
tingloati atin loa 4
F gF
Computer Science & Information Technology Department
Sir Syed University of Engg. & Tech.
FLOATING-POINT REPRESENTATION
So far we know that
in floating-point there are three elements
a number, (Significand)
the base value, (Base)
and a power to the base, (Exponent)
S x B E
The number is stored in memory with three fields:
Sign: (plus or minus)
Significand S
Exponent E
The base B is implicit and need not to be stored
5
Computer Science & Information Technology Department
Sir Syed University of Engg. & Tech.
FLOATING-POINT REPRESENTATION
Typical 32-Bit Floating-Point Format
Computer Science & Information Technology Department
Sir Syed University of Engg. & Tech.
FLOATING-POINT REPRESENTATION
Biased-Exponent?
Bias is a fixed value
Bias is used to get the true value of exponent
For this, bias is subtracted from the exponent field
Typical value is (2k-1 – 1)
Here, k = number of bits in binary exponent
If k=8, then bias is 127
True exponent values are in the range
-127 to +128
Computer Science & Information Technology Department
Sir Syed University of Engg. & Tech.
FLOATING-POINT’S BENEFITS
Fixed-Point has limitations
Can not represent very large numbers
Can not represent fractions
Recall Scientific Notation,
We can get around this limitation
Very large number can be represented as
976,000,000,000,000 = 9.67 x 1014
Fraction values can be represented as
0.0000000000000967 = 9.67 x 10-14
8
Computer Science & Information Technology Department
Sir Syed University of Engg. & Tech.
IEEE STANDARD FOR FLOATING-
POINT
Floating-Point representation is defined in
IEEE Standard 754,
adopted in 1985,
revised in 2008
IEEE 754-2008 covers both decimal and binary
floating-point representations.
Three basic binary formats have bits lengths of
32 bits with exponent of 8 bits
64 bits with exponent of 11 bits
128 bits with exponent of 15 bits
Computer Science & Information Technology Department
Sir Syed University of Engg. & Tech.
IEEE STANDARD FOR
BINARY FLOATING-POINT
Three basic binary formats have bits lengths of
32 bits with exponent of 8 bits
10
Computer Science & Information Technology Department
Sir Syed University of Engg. & Tech.
IEEE STANDARD FOR
BINARY FLOATING-POINT
Three basic binary formats have bits lengths of
64 bits with exponent of 11 bits
11
Computer Science & Information Technology Department
Sir Syed University of Engg. & Tech.
IEEE STANDARD FOR
BINARY FLOATING-POINT
Three basic binary formats have bits lengths of
128 bits with exponent of 15 bits
12
Computer Science & Information Technology Department
Sir Syed University of Engg. & Tech.
FLOATING-POINT CONVERSION
PRINCIPLE
Conversion #1
IEEE 754 Conversion (32-Bits to Float value)
Divide 32 bits into three fields
Convert the exponent value
from unsigned binary to unsigned decimal
subtract 127, call it ‘E’
Convert significand to a floating point number
between 1 to 1.999, call it ‘S
Float value = S x 2e
13
Computer Science & Information Technology Department
Sir Syed University of Engg. & Tech.
FLOATING-POINT CONVERSION
PRINCIPLE
Conversion #1: Example
IEEE 754 Conversion (32-Bits to Float value)
Bit = 43FC0000
Binary = 0100 0011 1111 1100 0000 0000 0000 0000
Sign =0 (+ve)
E = 10000111 = 135
= 135 – 127 = 8
S = 11111000000000000000000
= 1+.5+.25+.125+.0625+.03125 = 1.96875
Float value = +1.96875 x 28 = +1.96875 x 256
= 504.0 14
Computer Science & Information Technology Department
Sir Syed University of Engg. & Tech.
FLOATING-POINT CONVERSION
PRINCIPLE
Conversion #2
IEEE 754 Conversion (Float value to 32-Bits)
Let f be the float value
Determine largest power of 2 not greater than f
call it ‘p’ such that f = (f/2p)x2p
\S = f/2p , subtract 1
convert remaining value to binary
with each bit position a negative power of 2
Also, E = p, add 127 and convert to binary
If f is negative, sign-bit = 1 else sign-bit = 0
15
Computer Science & Information Technology Department
Sir Syed University of Engg. & Tech.
FLOATING-POINT CONVERSION
PRINCIPLE
Conversion #2: Example
IEEE 754 Conversion (Float value to 32-Bits)
f = 1208.0 = (1208 / 1024)x1024
= 1.1796875 x 210
S = 1.1796875, subtract 1 will give 0.1796875
Converting 0.1796875 to binary
by subtracting negative power of 2
S = 0.1796875 – 0.125 = 0.0546875
= 0.0546875 – 0.03125 = 0.0234375
= 0.0234375 – 0.015625 = 0.0078125
= 0.0078125 – 0.0078125 = 0
S = 001011100000000000000 16
Computer Science & Information Technology Department
Sir Syed University of Engg. & Tech.
FLOATING-POINT CONVERSION
PRINCIPLE
Conversion #2: Example (continued)
IEEE 754 Conversion (Float value to 32-Bits)
f = 1208.0 = (1028 / 1024)x1024
= 1.1796875 x 210
E = 10, add 127,
E = 137
Converting E to unsigned binary
E = 10001001
Complete 32-bits values
= 0 10001001 001011100000000000000
= 010001001001011100000000000000
= 44970000 17
Computer Science & Information Technology Department
Sir Syed University of Engg. & Tech. Lecture by: Mr. Shakir Karim
BINARY EXPONENT TO DECIMAL
CONVERSION TABLE
BACK
18
Computer Science & Information Technology Department
Sir Syed University of Engg. & Tech.
FLOATING-POINT ARITHMETIC
A floating-point operation may produce
Exponent Overflow
A positive exponent exceeds
a maximum possible exponent value.
may be designated as + or -
Exponent Underflow
A negative exponent is less than
minimum possible exponent value.
the number is too small
may be reported as 0
19
Computer Science & Information Technology Department
Sir Syed University of Engg. & Tech.
FLOATING-POINT ARITHMETIC
OPERATIONS
20
Computer Science & Information Technology Department
Sir Syed University of Engg. & Tech.
FLOATING-POINT ARITHMETIC:
EXAMPLES
21
Computer Science & Information Technology Department
Sir Syed University of Engg. & Tech.
FLOATING-POINT ARITHMETIC
Addition and Subtraction:
More complex than Multiplication and Division
Due to the need for alignment
22
Computer Science & Information Technology Department
Sir Syed University of Engg. & Tech.
FLOATING-POINT ARITHMETIC
Addition and Subtraction:
More complex than Multiplication and Division
Due to the need for alignment
Four basic phases for addition/subtraction algorithm
1. Check for Zeros
2. Align the Significant
3. Add or Subtract the Significands
4. Normalize the result
23
Computer Science & Information Technology Department
Sir Syed University of Engg. & Tech.