Carnegie Mellon
Floating Point Numbers
N. Navet - Computing Infrastructure 1 / Lecture 2
Bryant and O’Hallaron, Computer Systems: A Programmer’s Perspective, Third Edition
Carnegie Mellon
IEEE Floating Point standard
IEEE 754 Standard
▪ Established in 1985 as uniform standard for floating point arithmetic
Before that, many proprietary formats, leading thus to non-portable
▪
applications
▪ Intel’s hired in the mid-1970s prof. Kahan (Berkeley) to devise a floating
point coprocessor (8087) for the 8086 processor → work re-used later in
IEEE standard
▪ Nowadays, IEEE 754 is supported in HW by virtually all CPUs (that have a
floating point unit, otherwise it can be implemented in SW)
Driven by numerical concerns
▪ Good standards for rounding, overflow, underflow
▪ Hard to make fast in hardware
▪ Numerical analysts predominated over hardware designers in defining
the standard
Bryant and O’Hallaron, Computer Systems: A Programmer’s Perspective, Third Edition [edited NN] 2
Carnegie Mellon
Principles of floating point numbers
Basis for the support (of an approximation) of arithmetic with real
numbers
A floating point number is a rational number (i.e., quotient of two
integers)
Real numbers that cannot be represented as floating points will be
approximated leading to numerical imprecisions (real numbers
form a continuum, floating points do not → rounding to the
nearest value that can be expressed needed)
floating point is a number of the form 𝑠𝑖𝑔𝑛𝑖𝑓𝑖𝑐𝑎𝑛𝑑 ∙ 𝑏𝑎𝑠𝑒 𝑒𝑥𝑝𝑜𝑛𝑒𝑛𝑡 ,
where significand, exponent and base are all integers, e.g. in base
10, 5.367 = 5367 ∙ 10−3
“floating point” because the point can “float”, it can be placed
anywhere relative to the significant digits of the number (depending
on the value of the exponent), e.g. 536.7 ∙ 10−2 = 5367 ∙ 10−3
Bryant and O’Hallaron, Computer Systems: A Programmer’s Perspective, Third Edition [edited NN] 3
Carnegie Mellon
Principles of floating point numbers
As there is more than one way to represent a number, we need
a single standardized representation
Familiar base-10 (normalized) scientific notation used in
physics, math and engineering: n = f *10e where
▪ f is the fraction (aka mantissa or significand) with one non-zero decimal
digit before the decimal point
▪ e is a positive or negative number called the exponent
Normalized scientific notation
on the right
Range is determined by the number of digits of the exponent
Precision by the number of digits in the fraction
In computers, the base is 2, floating-point representation
encodes rational numbers of the form V = x × 2y
Bryant and O’Hallaron, Computer Systems: A Programmer’s Perspective, Third Edition [edited NN] 4
Carnegie Mellon
Tiny Floating Point Example #1
Base 10
Signed 3-digit significand that can be either 0, or (0.1 ≤ 𝑓 < 1) or (−1 < 𝑓 ≤
− 0.1 )
Signed 2-digit exponent Min and max exponent ?
Range over nearly 200 orders of magnitude: −0.999 ∙ 1099 to +0.999 ∙ 1099
The separation between expressible numbers is not constant: e.g., the
separation between +0. 998 × 1099 and +0. 999 × 1099 is >> than the
separation between +0. 998 × 100 and +0. 999 × 100
But the relative error introduced by rounding is about the same (i.e., the
separation between a number and its successor expressed as a percentage of
that number is approximatively the same over the whole range)
How to increase the accuracy of representation ?
How to increase the range of expressible numbers ?
Course reading – “Structured Computer Organization”:
Appendix B: floating point numbers
Bryant and O’Hallaron, Computer Systems: A Programmer’s Perspective, Third Edition [edited NN] 5
Carnegie Mellon
Example #1: the real line is divided up into seven regions
1. Large negative numbers less than −0. 999 × 1099.
2. Negative between −0.999 × 1099 and −0.100×10−99.
Not possible to
express any
3. Small negative, between -0.100×10−99 and zero
number in
4. Zero regions1,3,5,7
5. −99
Small positive, between 0 and 0.100×10 .
1060×1060 =10120
6. Positive between 0.100×10−99 and 0.999×1099. →positive overflow
7. Large positive numbers greater than 0.999×1099.
−0.999 ∙ 1099 −0.1 ∙ 10−99 0.1 ∙ 10−99 0.999 ∙ 1099
Nb: underflow errors is less serious than overflow since 0
is usually a satisfactory approximation in regions 3 and 5
Bryant and O’Hallaron, Computer Systems: A Programmer’s Perspective, Third Edition [edited NN] 6
Carnegie Mellon
Normalized numbers and hidden bits
“Normalized” format is for representing all numbers but the
ones close to 0 that are represented with “denormalized” format
(will be seen later in the lecture)
312.25 can be represented with the integer 31225 as the
significand and 10-2 as power term, but many other ways ..
Its normalized scientific notation in base 10 is 3.1225 * 102 that
is with one non-zero decimal digit before the decimal point
Same principle for normalized form in base 2: 1.xxx * 2y
As the most significant bit is always a 1, it is not necessary to
store it → this is the hidden bit
IEEE754 double precision: size of the significand is 52 bits not
including the hidden bit, 53 bits with it
Bryant and O’Hallaron, Computer Systems: A Programmer’s Perspective, Third Edition [edited NN] 7
Carnegie Mellon
Floating Point Representation – normalized numbers
IEEE 754 standard represents FP numbers having the following form:
(–1)s M 2E
▪ Sign bit s determines whether number is negative or positive
▪ Significand M (except in special cases) a fractional binary number in range
[1.0,2.0) (interval starts at 1 because of leading 1: 1.xxxx…x * 2^E )
▪ Exponent E weights value by a power of two
How to express 0?
Encoding of a FP number is done over 3 fields:
▪ Most Significant Bit s is sign bit s
▪ exp field encodes E (but is not equal to E)
▪ frac field encodes M (but is not equal to M)
s exp frac
Bryant and O’Hallaron, Computer Systems: A Programmer’s Perspective, Third Edition 8
Carnegie Mellon
As a programmer, you can expect a precision of
Precision options 7 decimal digits in single precision and 15 in
double precision. Except for good reasons, you
Single precision: 32 bits should always use double precision numbers.
s exp frac
1 8-bits 23-bits
Double precision: 64 bits
s exp frac
1 11-bits 52-bits
Extended precision: 80 bits (not supported by all CPUs and
compilers) – out of the scope of the course
s exp frac
1 15-bits 64-bits
Bryant and O’Hallaron, Computer Systems: A Programmer’s Perspective, Third Edition 9
Carnegie Mellon
3 types of floating point encodings
Determined by the value of the exponent – here we consider
single precision numbers, that is with an exponent of 8 bits
denormalized numbers are a “sub-format" within the IEEE-754 floating-point format
Not A Number (NaN): a value that is undefined
examples: 0/0, −5
Bryant and O’Hallaron, Computer Systems: A Programmer’s Perspective, Third Edition 10
Carnegie Mellon
Visualization: Floating Point Encodings
Cannot be represented
−Normalized −Denorm +Denorm +Normalized
− +
−0 +0
Denormalized encoding is for 0 and
numbers that are very close to 0
Bryant and O’Hallaron, Computer Systems: A Programmer’s Perspective, Third Edition 11
Carnegie Mellon
Case 1: “Normalized” Values v = (–1)s M 2E
Most common case: when bit pattern of exp ≠ 000…0 and ≠
111…1 (i.e., 255 for single precision and 2047 for double)
Exponent coded as a biased value: E = Exp – Bias
▪ Exp: unsigned value of exp field of the floating point number
▪ Bias = 2k-1 - 1, where k is number of exponent bits
▪ Single precision: bias=127 (Exp: 1…254, E: -126…127)
▪ Double precision: bias=1023 (Exp: 1…2046, E: -1022…1023)
Significand coded with implied leading 1: M = 1.xxx…x2
▪ xxx…x: bits of frac field Beyond the lecture’s scope:
▪ Minimum when frac=000…0 (M = 1.0) thanks to the bias, exp field can be
encoded as unsigned (as it is
▪ Maximum when frac=111…1 (M = 2.0 – ε)
positive) and not in two’s
▪ Get extra leading bit for “free” (hidden bit) complement, which allows for
faster comparison of FP numbers
Bryant and O’Hallaron, Computer Systems: A Programmer’s Perspective, Third Edition 12
Carnegie Mellon
Normalized Encoding : example
v = (–1)s M 2E
in single precision E = Exp – Bias
Value: float F = 15213.0;
▪ 1521310 = 11101101101101.02 x 20 5 steps: a) (unsigned) binary form b)
= 1.11011011011012 x 213 normalized form c) encode significand
d) encode exponent 5) sign bit
Significand
M = 1.11011011011012
frac field (23bits)= 110110110110100000000002
Single precision
Exponent
E = 13
Bias = 127
Exp field (8bits) = 140 = 100011002
Result: Bit
Bit 31 22 Bit 0
0 10001100 11011011011010000000000
s exp frac
Bryant and O’Hallaron, Computer Systems: A Programmer’s Perspective, Third Edition [edited NN] 13
Carnegie Mellon
v = (–1)s M 2E
Example #2 E = Exp – Bias
http://www.binaryconvert.com/convert_float.html
1) Write 4.0 as v = (–1)s M 2E 4 = (–1)0 · 1.0 ·22
2) Encode 4.0 as a floating point
number (single precision)
Bryant and O’Hallaron, Computer Systems: A Programmer’s Perspective, Third Edition 14
Carnegie Mellon
v = (–1)s M 2E
Example #2 E = Exp – Bias
http://www.binaryconvert.com/convert_float.html
4 = (–1)0 · 1.0 ·22
32 bits = 4 bytes
Bit Bit
22 0
Bryant and O’Hallaron, Computer Systems: A Programmer’s Perspective, Third Edition 15
Carnegie Mellon
v = (–1)s M 2E
Example #3 E = Exp – Bias
Encode 4.75 as a floating point number
in single precision format
Bryant and O’Hallaron, Computer Systems: A Programmer’s Perspective, Third Edition 16
Carnegie Mellon
v = (–1)s M 2E
Example #4 E = Exp – Bias
Encode 1.0 in IEEE754
single precision format
1 = (–1)0 · (1+0) · 20
How would 1.0 be encoded without the BIAS?
Bryant and O’Hallaron, Computer Systems: A Programmer’s Perspective, Third Edition 17
Carnegie Mellon
Case 2 : Denormalized numbers v = (–1)s M 2E
E = 1 – Bias
exp = 000…0 indicates a denormalized number
Purpose: represent 0 and numbers very close to 0 that normalized
numbers cannot represent
Exponent value is constant : E = 1 – Bias (i.e., E = -126 in single
precision or E=-1022 in double precision)
Significand coded with implied leading 0: M = 0.xxx…x2
▪ xxx…x: bits of frac
Why 0 cannot be represented
Cases
with normalized encoding?
▪ exp = 000…0, frac = 000…0
Represents the value zero
▪
▪ Two distinct values: +0 and –0 (all bits are zero possibly except sign bit)
▪ exp = 000…0, frac ≠ 000…0
▪ Numbers are equi-spaced in that range as the exponent is constant
Bryant and O’Hallaron, Computer Systems: A Programmer’s Perspective, Third Edition [edited NN] 18
Carnegie Mellon
v = (–1)s M 2E
Example #5 E = -126
a) Encode of the smallest strictly positive denormalized number in
single precision floating point b) Express this value as a power of 2
= (–1)0 · 2-23 · 2-126 = 2-149
Bryant and O’Hallaron, Computer Systems: A Programmer’s Perspective, Third Edition 19
Carnegie Mellon
v = (–1)s M 2E
Example #6 E = -126
Single precision floating point: encoding of the largest positive
denormalized number in binary ?
= (–1)0 · (2-1 +2-2 + …+ 2-22 +2-23) · 2-126
= 2-126 · (1 - 2-23)
Bryant and O’Hallaron, Computer Systems: A Programmer’s Perspective, Third Edition 20
Carnegie Mellon
Case 3: Special Values
Condition: exp = 111…1
Case: exp = 111…1, frac = 000…0
▪ Represents value (infinity)
▪ Can be used as an operand and behaves according to the usual
mathematical rules for
▪ As expected, both positive and negative
▪ E.g., 1.0/0.0 = −1.0/−0.0 = +, 1.0/−0.0 = −
Case: exp = 111…1, frac ≠ 000…0
▪ Not-a-Number (NaN)
▪ Represents case when no numeric value can be determined
▪ E.g., sqrt(–1), − , 0
Bryant and O’Hallaron, Computer Systems: A Programmer’s Perspective, Third Edition [edited NN] 21
Carnegie Mellon
IEEE 754: a recap
≠0 and ≠ 111…1
Floating Point Zero Same as Integer Zero
▪ All bits = 0
Bryant and O’Hallaron, Computer Systems: A Programmer’s Perspective, Third Edition 22
Carnegie Mellon
Supplementary material
Outside the scope of the course
Bryant and O’Hallaron, Computer Systems: A Programmer’s Perspective, Third Edition 23
Carnegie Mellon
Tiny Floating Point Example #2
s exp frac
1 4-bits 3-bits
8-bit Floating Point Representation
▪ the sign bit is in the most significant bit
▪ the next four bits are the exponent, with a
bias of 7 v = (–1)s M 2E
▪ the last three bits are the frac Normalized : E = Exp – Bias
Denormalized : E = 1 – Bias
Same general form as IEEE Format
a) what is the smallest strictly positive
▪ normalized, denormalized
normalized number and what is the
▪ representation of 0, NaN, infinity
largest ?
b) List all positive denormalized
numbers
Bryant and O’Hallaron, Computer Systems: A Programmer’s Perspective, Third Edition 24
Carnegie Mellon
v = (–1)s M 2E
Range (Positive Only) Normalized : E = Exp – Bias
s exp frac E Value
Denormalized : E = 1 – Bias
0 0000 001 -6 1/8*1/64 = 1/512 closest to zero
Denormalized 0 0000 010 -6 2/8*1/64 = 2/512
numbers …
0 0000 110 -6 6/8*1/64 = 6/512
0 0000 111 -6 7/8*1/64 = 7/512 largest denorm
smallest norm
0 0001 000 -6 8/8*1/64 = 8/512
0 0001 001 -6 9/8*1/64 = 9/512
…
0 0110 110 -1 14/8*1/2 = 14/16
0 0110 111 -1 15/8*1/2 = 15/16 closest to 1 below
Normalized
0 0111 000 0 8/8*1 = 1
numbers
0 0111 001 0 9/8*1 = 9/8 closest to 1 above
0 0111 010 0 10/8*1 = 10/8
…
0 1110 110 7 14/8*128 = 224
0 1110 111 7 15/8*128 = 240 largest norm
0 1111 000 n/a inf
Bryant and O’Hallaron, Computer Systems: A Programmer’s Perspective, Third Edition 25
Carnegie Mellon
Tiny Floating Point Example #3
6-bit IEEE-like format
▪ e = 3 exponent bits
▪ f = 2 fraction bits s exp frac
▪ Bias is 23-1-1 = 3 1 3-bits 2-bits
Notice how the distribution gets denser toward zero.
8 values
-15 -10 -5 0 5 10 15
Denormalized Normalized Infinity
Bryant and O’Hallaron, Computer Systems: A Programmer’s Perspective, Third Edition 26
Carnegie Mellon
Distribution of Values (close-up view)
6-bit IEEE-like format
▪ e = 3 exponent bits
▪ f = 2 fraction bits s exp frac
▪ Bias is 3 1 3-bits 2-bits
-1 -0.5 0 0.5 1
Denormalized Normalized Infinity
Bryant and O’Hallaron, Computer Systems: A Programmer’s Perspective, Third Edition 27