計算機組織
Computer Organization
Arithmetic for Computers – Part 2
Kun-Chih (Jimmy) Chen 陳坤志
[email protected] Institute of Electronics,
National Yang Ming Chiao Tung University
NYCU EE / IEE
Floating Point: Motivation
❖ What can be represented in n bits?
Unsigned 0 to 2n - 1
2's Complement -2n-1 to 2n-1- 1
1's Complement -2n-1+1 to 2n-1- 1
Excess M -M to 2n - M - 1
❖ But, what about ...
❖very large numbers? 1,987,987,987,987,987,987,987,987,987
❖very small numbers? 0.0000000000000000000000054088
❖rationals 2/3
❖irrationals 2
❖transcendentals e,
❖Types float and double in C
P2
Floating Point: Example
❖ Floating Point
❖ A = 31.48
➢ 3 → 3 101
➢ 1 → 1 100
➢ 4 → 4 10-1
➢ 8 → 8 10-2
❖ Scientific notation
❖ A = 3.148 101
➢ 3 → 3 100 101
➢ 1 → 1 10-1 101
➢ 4 → 4 10-2 101
➢ 8 → 8 10-3 101
P3
Scientific Notation: Decimal
Fraction (Mantissa) exponent
Significand 3.5ten x 10-9
radix (base)
“decimal point”
❖ Normalized form: no leading 0s
(exactly one digit to left of decimal point)
❖ Alternatives to represent 0.0000000035
❖ Normalized: 3.5 x 10-9
❖ Not normalized: 0.35 x 10-8, 35.0 x 10-10
P4
Scientific Notation: Binary
Fraction (Mantissa) exponent
Significand 1.001two x 2-9
radix (base)
“binary point”
❖ Computer arithmetic that supports it is called floating point, because
the binary point is not fixed, as it is for integers
❖ Normalized form: no leading 0s
(exactly one digit to left of binary point)
❖ Scientific notation
❖ Normalized: 1.001 x 2-9
❖ Not normalized: 0.1001 x 2-8, 10.01 x 2-10
P5
Floating Point Standard
❖ Defined by IEEE Std 754-1985
❖ Developed in response to divergence of representations
❖ Portability issues for scientific code
❖ Now almost universally adopted
❖ Two representations
❖ Single precision (32-bit)
❖ Double precision (64-bit)
P6
FP Representation
❖ Normal format: 1.xxxxxxxxxxtwo 2yyyytwo
❖ Want to put it into multiple words: 32 bits for single-precision and 64
bits for double-precision
❖ A simple single-precision representation:
31 30 23 22 0
S Exponent Fraction
1 bit 8 bits 23 bits
S represents sign
Exponent represents y's
Fraction represents x’s
❖ Represent numbers as small as ~2.0 x 10-38 to as large as ~2.0 x
1038
P7
Double Precision Representation
❖ Next multiple of word size (64 bits)
31 30 20 19 0
S Exponent Fraction
1 bit 11 bits 20 bits
Fraction (cont'd)
32 bits
❖ Double precision (vs. single precision)
❖ Represent numbers almost as small as
~2.0 x 10-308 to almost as large as ~2.0 x 10308
❖ But primary advantage is greater accuracy
due to larger fraction
P8
IEEE 754 Standard (1/4)
❖ Regarding single precision, DP similar
❖ Sign bit:
1 means negative
0 means positive
❖ Fraction:
❖ To pack more bits, leading 1 implicit for normalized numbers (hidden leading
1 bit)
❖ 1 + 23 bits single, 1 + 52 bits double
❖ always true: 0 ≤ Fraction < 1
(for normalized numbers)
❖ Significand is Fraction with the “1.” restored
❖ Note: 0 has no leading 1, so reserve exponent value 0 just for number 0
P9
IEEE 754 Standard (2/4)
❖ Exponent:
❖ Need to represent positive and negative exponents
❖ Also want to compare FP numbers as if they were integers, to help in
value comparisons
❖ How about using 2's complement to represent?
Ex: 1.0 x 2-1 versus 1.0 x2+1 (1/2 versus 2)
1/2 0 1111 1111 000 0000 0000 0000 0000 0000
2 0 0000 0001 000 0000 0000 0000 0000 0000
If we use integer comparison for these two words, we
will conclude that 1/2 > 2!!!
P10
IEEE 754 Standard (3/4)
❖ Instead, let notation 0000 0000 be most negative, and 1111 1111
most positive
❖ Called biased notation, where bias is the number subtracted to get
the real number
❖ IEEE 754 uses bias of 127 for single precision:
Subtract 127 from Exponent field to get actual value for exponent
❖ 1023 is bias for double precision
126-127=-1
1/2 0 0111 1110 000 0000 0000 0000 0000 0000
2 0 1000 0000 000 0000 0000 0000 0000 0000
128-127=1
We can use integer comparison for floating point
comparison.
P11
Biased (Excess) Notation
❖ Biased 7
0 0000 -7
1 0001 -6
2 0010 -5
3 0011 -4
4 0100 -3
5 0101 -2
6 0110 -1
7 0111 0
8 1000 1
9 1001 2
10 1010 3
11 1011 4
12 1100 5
13 1101 6
14 1110 7
15 1111 8
P12
IEEE 754 Standard (4/4)
❖ Summary (single precision):
31 30 23 22 0
S Exponent Fraction
1 bit 8 bits 23 bits
(-1)S x (1.Fraction) x 2 (Exponent-127)
❖ Double precision are same, except with exponent bias of 1023
P13
Example 1: FP to Decimal
0 0110 1000 101 0101 0100 0011 0100 0010
❖ Sign: 0 => positive
❖ Exponent:
❖ 0110 1000two = 104ten
❖ Bias adjustment: 104 - 127 = -23
❖ Fraction:
❖ 1+2-1+2-3 +2-5 +2-7 +2-9 +2-14 +2-15 +2-17 +2-22
= 1.0 + 0.666115
❖ Represents: 1.666115ten2-23 1.986 10-7
P14
Example 2: Decimal to FP
❖ Number = - 0.75
= - 0.11two 20 (scientific notation)
= - 1.1two 2-1 (normalized scientific notation)
❖ Sign: negative => 1
❖ Exponent:
❖ Bias adjustment: -1 +127 = 126
❖ 126ten = 0111 1110two
1 0111 1110 100 0000 0000 0000 0000 0000
P15
Example 3: Decimal to FP
❖ A more difficult case: representing 1/3?
= 0.33333…10 = 0.0101010101… 2 20
= 1.0101010101… 2 2-2
❖ Sign: 0
❖ Exponent = -2 + 127 = 12510=011111012
❖ Fraction = 0101010101…
0 0111 1101 0101 0101 0101 0101 0101 010
P16
Double-Precision Range
❖ Exponents 0000…00 and 1111…11 reserved
❖ Smallest value
❖ Exponent: 00000000001
actual exponent = 1 – 1023 = –1022
❖ Fraction: 000…00 significand = 1.0
❖ ±1.0 × 2–1022 ≈ ±2.2 × 10–308
❖ Largest value
❖ Exponent: 11111111110
actual exponent = 2046 – 1023 = +1023
❖ Fraction: 111…11 significand ≈ 2.0
❖ ±2.0 × 2+1023 ≈ ±1.8 × 10+308
P17
Floating-Point Precision
❖ Relative precision
❖ all fraction bits are significant
❖ Single: approx 2–23
➢ Equivalent to 23 × log102 ≈ 23 × 0.3 ≈ 6 decimal digits of precision
❖ Double: approx 2–52
➢ Equivalent to 52 × log102 ≈ 52 × 0.3 ≈ 16 decimal digits of precision
P18
Zero and Special Numbers
❖ What have we defined so far? (single precision)
Exponent Fraction Object
0 0 ???
0 nonzero ???
1-254 anything +/- floating-point
255 0 ???
255 nonzero ???
P19
Representation for 0
❖ Represent 0?
❖ Exponent: all zeroes
❖ Fraction: all zeroes, too
❖ What about sign?
❖ +0: 0 00000000 00000000000000000000000
❖ -0: 1 00000000 00000000000000000000000
❖ Why two zeroes?
❖ Helps in some limit comparisons
P20
Special Numbers
❖ What have we defined so far? (single precision)
Exponent Fraction Object
0 0 +/- 0
0 nonzero ???
1-254 anything +/- floating-point
255 0 ???
255 nonzero ???
❖ Range:
1.0 2-126 1.2 10-38
What if result too small? (>0, < 1.2x10-38 => Underflow!)
1.11…1 2127 (2 – 2-23) 2127 3.4 1038
What if result too large? (> 3.4x1038 => Overflow!)
P21
Range of Singe Precision Floating Point Number
Underflow
Overflow Overflow
0
–∞ +∞
+ 1.0 2–126
– 1.11…11 2127
– 1.0 2–126 + 1.11…11 2127
P22
Gradual Underflow
❖ Represent denormalized numbers (denorms)
❖ Exponent : all zeroes
❖ Fraction : non-zeroes
❖ Allow a number to degrade in significance until it become 0 (gradual
underflow)
❖ The smallest normalized number
➢ 1.0000 0000 0000 0000 0000 000 2-126
❖ The smallest de-normalized number
➢ 0.0000 0000 0000 0000 0000 001 2-127
P23
Special Numbers
❖ What have we defined so far? (single precision)
Exponent Fraction Object
0 0 +/- 0
0 nonzero denorm
1-254 anything +/- floating-point
255 0 ???
255 nonzero ???
P24
Representation for +/- Infinity
❖ In FP, divide by zero should produce +/- infinity, not overflow
❖ Why?
❖ OK to do further computations with infinity
Ex: X/0 > Y may be a valid comparison
❖ IEEE 754 represents +/- infinity
❖ Most positive exponent reserved for infinity
❖ Fractions all zeroes
S 1111 1111 0000 0000 0000 0000 0000 000
P25
Special Numbers (cont’d)
❖ What have we defined so far? (single-precision)
Exponent Fraction Object
0 0 +/- 0
0 nonzero denom
1-254 anything +/- floating-point
255 0 +/- infinity
255 nonzero ???
P26
Representation for Not a Number
❖ What do I get if I calculate sqrt(-4.0) or 0/0?
❖ If infinity is not an error, these should not be either
❖ They are called Not a Number (NaN)
❖ Exponent = 255, fraction nonzero
❖ Why is this useful?
❖ Hope NaNs help with debugging?
❖ They contaminate: op(NaN,X) = NaN
❖ OK if calculate but don't use it
P27
Special Numbers (cont’d)
❖ What have we defined so far? (single-precision)
Exponent Fraction Object
0 0 +/- 0
0 nonzero denom
1-254 anything +/- floating-point
255 0 +/- infinity
255 nonzero NaN
P28
Decimal Addition
❖ A = 3.71345 102, B = 1.32 10-4, Perform A + B
3.71345 102
+ 0.00000132 102
3.71345132 102
Right shift 2 – (-4) bits
❖ A = 3.71345 102
❖ B = 1.32 10-4 = 0.00000132 102
❖ A + B = (3.71345 + 0.00000132) 102
P29
Floating-Point Addition
Basic addition algorithm:
(1) Align binary point :compute Ye – Xe
❖ right shift the smaller number, say Xm, that many positions to
form Xm 2Xe-Ye
(2) Add mantissa: compute Xm 2Xe-Ye +Ym
(3) Normalization & check for over/underflow if necessary:
❖ left shift result, decrement result exponent
❖ right shift result, increment result exponent
❖ check overflow or underflow during the shift
(4) Round the mantissa and renormalize if necessary
P30
Floating-Point Addition Example
❖ Now consider a 4-digit binary example
❖ 1.0002 × 2–1 + –1.1102 × 2–2 (0.5 + –0.4375)
❖ 1. Align binary points
❖ Shift number with smaller exponent
❖ 1.0002 × 2–1 + –0.1112 × 2–1
❖ 2. Add mantissa
❖ 1.0002 × 2–1 + –0.1112 × 2–1 = 0.0012 × 2–1
❖ 3. Normalize result & check for over/underflow
❖ 1.0002 × 2–4, with no over/underflow
❖ 4. Round and renormalize if necessary
❖ 1.0002 × 2–4 (no change) = 0.0625
P31
Floating-Point Addition
P32
Sign Exponent Significand Sign Exponent Significand
Compare
Small ALU exponents
Exponent
difference Step 1
0 1 0 1 0 1
Shift smaller
Control Shift right
number right
Add Step 2
Big ALU
0 1 0 1
Increment or Step 3
decrement Shift left or right Normalize
Step 4
Rounding hardware Round
Sign Exponent Significand
P33
FP Adder Hardware
❖ Much more complex than integer adder
❖ Doing it in one clock cycle would take too long
❖ Much longer than integer operations
❖ Slower clock would penalize all instructions
❖ FP adder usually takes several cycles
❖ Can be pipelined
P34
Decimal Multiplication
❖ A = 3.12 102, B = 1.5 10-4, Perform A B
3.12 102
1.5 10-4
4.68 10-2
❖ A = 3.12 102
❖ B = 1.5 10-4
❖ A B = (3.12 1.5) 10(2+(-4))
P35
Floating-Point Multiplication
Basic multiplication algorithm
(1) Add exponents of operands to get exponent of product
doubly biased exponent must be corrected:
Xe = 7
Ye = -3 Xe = 1111 = 15 = 7 + 8
Excess 8 Ye = 0101 = 5 = -3 + 8
10100 20 4+8+8
need extra subtraction step of the bias amount
(2) Multiplication of operand mantissa
(3) Normalize the product & check overflow or underflow
during the shift
(4) Round the mantissa and renormalize if necessary
(5) Set the sign of product
P36
Floating-Point Multiplication
P37
Floating-Point Multiplication Example
❖ Now consider a 4-digit binary example
❖ 1.0002 × 2–1 × –1.1102 × 2–2 (i.e., 0.5 × –0.4375)
1. Add exponents
❖ Unbiased: –1 + –2 = –3
❖ Biased: (–1 + 127) + (–2 + 127) = –3 + 254 – 127 = –3 + 127
2. Multiply operand mantissa
❖ 1.0002 × 1.1102 = 1.1102 1.1102 × 2–3
3. Normalize result & check for over/underflow
❖ 1.1102 × 2–3 (no change) with no over/underflow
4. Round and renormalize if necessary
❖ 1.1102 × 2–3 (no change)
5. Determine sign:
❖ –1.1102 × 2–3 = –0.21875
P38
FP Arithmetic Hardware
❖ FP multiplier is of similar complexity to FP adder
❖ But uses a multiplier for significands instead of an adder
❖ FP arithmetic hardware usually does
❖ Addition, subtraction, multiplication, division, reciprocal, square-root
❖ FP integer conversion
❖ Operations usually takes several cycles
❖ Can be pipelined
P39
P40
FP Instructions in RISC-V
❖ Separate FP registers: f0, …, f31
❖ double-precision
❖ single-precision values stored in the lower 32 bits
❖ FP instructions operate only on FP registers
❖ Programs generally don’t do integer ops on FP data, or vice versa
❖ More registers with minimal code-size impact
❖ FP load and store instructions
❖ flw, fld
❖ fsw, fsd
P41
FP Instructions in RISC-V
❖ Single-precision arithmetic
❖ fadd.s, fsub.s, fmul.s, fdiv.s, fsqrt.s
➢ e.g., fadd.s f2, f4, f6
❖ Double-precision arithmetic
❖ fadd.d, fsub.d, fmul.d, fdiv.d, fsqrt.d
➢ e.g., fadd.d f2, f4, f6
❖ Single- and double-precision comparison
❖ feq.s, flt.s, fle.s
❖ feq.d, flt.d, fle.d
❖ Result is 0 or 1 in integer destination register
➢ Use beq, bne to branch on comparison result
❖ Branch on FP condition code true or false
❖ B.cond
P42
FP Instructions in RISC-V
2024/10/8 Andy Yu-Guang Chen 43
P43
FP Example: °F to °C
❖ C code:
float f2c (float fahr) {
return ((5.0/9.0)*(fahr - 32.0));
}
❖ fahr in f10, result in f10, literals in global memory space
❖ Compiled RISC-V code:
f2c:
flw f0,const5(x3) // f0 = 5.0f
flw f1,const9(x3) // f1 = 9.0f
fdiv.s f0, f0, f1 // f0 = 5.0f / 9.0f
flw f1,const32(x3) // f1 = 32.0f
fsub.s f10,f10,f1 // f10 = fahr – 32.0
fmul.s f10,f0,f10 // f10 = (5.0f/9.0f) * (fahr–32.0f)
jalr x0,0(x1) // return
We assume the compiler places the three floating-point
constants in the memory within easy reach of register x3
P44
Accurate Arithmetic
❖ IEEE Std 754 specifies additional rounding control
❖ Extra bits of precision (guard, round, sticky)
❖ Choice of rounding modes
❖ Allows programmer to fine-tune numerical behavior of a computation
❖ Not all FP units implement all options
❖ Most programming languages and FP libraries just use defaults
❖ Trade-off between hardware complexity, performance, and market
requirements
P45
Subword Parallelism
❖ Graphics and audio applications can take advantage of performing
simultaneous operations on short vectors
❖ Example: 128-bit adder:
➢ Sixteen 8-bit adds
➢ Eight 16-bit adds
➢ Four 32-bit adds
❖ Also called data-level parallelism, vector parallelism, or Single
Instruction, Multiple Data (SIMD)
P46
Final 64-bit RISC-V ALU
ALUop Function
0000 and
0001 or
0010 add
0110 subtract
0111 set-on-less-than
1100 nor
P47
ALU Control and Function
ALUop
Ainvert 2
CarryIn Operation
a 0
0
1
1
Binvert
Result
b 0 2
1
ALU Control (ALUop) Function
slt 3
0000 and
0001 or
CarryOut 0010 add
0110 subtract
0111 set-on-less-than
1100 nor
P48
Ripple Carry Adder
❖ Carry Ripple from lower-bit to the higher-bit
00111111 Cin = 1
00101010
+ 00010101
01
01 01
01 01
01 01
0
❖ Ripple computation dominates the run time
❖ Higher-bit ALU must wait for carry from lower-bit ALU
❖ Run time complexity: O(n)
P49
Problems with Ripple Carry Adder
❖ Carry bit may have to propagate from LSB to MSB => worst case
delay: N-stage delay
CarryIn0
A0 1-bit Result0
B0 ALU
CarryOut0
CarryIn1
A1 1-bit Result1
B1 ALU
CarryOut1
CarryIn2
A2 1-bit Result2
B2 ALU
CarryOut2
CarryIn3
A3 1-bit Result3
Design Trick: look for
B3 ALU
CarryOut3
parallelism and throw
hardware at it
P50
Remove the Dependency
❖ Ripple carry adder
a7 b7 a6 b6 a5 b5 a4 b4 a3 b3 a2 b2 a1 b1 a0 b0
Cout + + + + + + + + Cin
s7 s6 s5 s4 s3 s2 s1 s0
❖ Carry lookahead adder
❖ No carry bit propagation from LSB to MSB
Carry Computation Circuit
a7 b7 a6 b6 a5 b5 a4 b4 a3 b3 a2 b2 a1 b1 a0 b0
Cout + + + + + + + + Cin
s7 s6 s5 s4 s3 s2 s1 s0
P51
4-bit Carry-Lookahead Adder (CLA)
❖ Ripple carry adder which takes a lot of time to determine the carry bit
❖ Carry-Lookahead adder (CLA) is type of adder which improves speed by
reducing the amount of time required to determine carry bit
A3 B3 A2 B2 A1 B1 A0 B0
1-bit full C3 1-bit full C2 1-bit full C1 1-bit full
C0
S3 adder S2
adder adder adder
S1 S0
P3 G3 P2 G2 P1 G1 P0 G0
4-bit CLL (carry look-ahead logic)
C4
PG GG
P52
Carry-Lookahead Adder
𝑆 = 𝐴 ⊕ 𝐵 ⊕ 𝐶𝑖𝑛
Full adder = { 𝐶out =𝐴 ⊕ 𝐵 · 𝐶𝑖𝑛 + 𝐴 · 𝐵
❖ Ci+1 = (Ai · Bi) + (Ai ^ Bi) · Ci
=Gi + Pi · Ci
❖ Generate : Gi = Ai · Bi
❖ Propagate : Pi = Ai ^ Bi
❖ C1 = G0 + P0 · C0
C2 = G1 + P1 · C1 = G1 + P1 · (G0 + P0 · C0) = G1 + P1 · G0 + P1 · P0 · C0
C3 = G2 + P2 · G1 + P2 · P1 · G0+ P2 · P1 · P0 · C0
C4 = G3 + P3 · G2 + P3 · P2 · G1+ P3 · P2 · P1 · G0 + P3 · P2 · P1 · P0 · C0
❖ Only need A, B and C0 to calculate the carry bit
P53
16-bit CLA
g p
C P G
• Same as before, p,g’s are generated in parallel in 1 gate delay
• Now, without input carry, the first-tier CLL cannot generate C’s……
Instead they generate P,G’s (group propagator and group generator) in 2 gate delay
P => This group will propagate the input carry to the group P=p0p1p2p3
G => This group will generate an output carry G=g3+p3g2+p3p2g1+p3p2p1g0
• The second-tier CLL takes the P,G’s from first-tier CLLs and C0 to generate “seed C’s”
for first-tier CLLs in 2 gate delay. (note that the logic for generating “seed C’s” from
P,G’s is exactly the same to generating C’s from p,g’s!)
• With the seed C’s as input, the first-tier CLLs use Cin and p,g’s to generate C’s in
2 gate delay
• With all C’s in place, S’s are calculated in 3 gate delay due to the XOR gate
P54
Pi, Gi Generation in a 16-bit CLA
❖ Propagate (P) → 1 gate delay
❖ P0 = p3 · p2 · p1 · p0
Therefore, totally
❖ P1 = p7 · p6 · p5 · p4
1+2+2+3 = 8 gate delay
❖ P2 = p11 · p10 · p9 · p8 to finish the whole thing!!
❖ P3 = p15 · p14 · p13 · p12
❖ Generate (G) → 2 gate delay
❖ G0 = g3 + (p3 · g2) + (p3 · p2 · g1) + (p3 · p2 · p1 · g0)
❖ G1 = g7 + (p7 · g6) + (p7 · p6 · g5) + (p7 · p6 · p5 · g4)
❖ G2 = g11 + (p11 · g10) + (p11 · p10 · g9) + (p11 · p10 · p9 · g8)
❖ G3 = g15 + (p15 · g14) + (p15 · p14 · g13) + (p15 · p14 · p13 · g12)
❖ Carry (C) → 2 gate delay
❖ C1 = G0 + c0 · P0
❖ C2 = G1 + G0 · P1 + c0 · P0 · P1
❖ C3 = G2 + G1 · P2 + G0 · P1 · P2 + c0 · P0 · P1 · P2
❖ C4 = G3 + G2 · P3 + G1 · P2 · P3 + G0 · P1 · P2 · P3 + c0 · P0 · P1 · P2 · P3
P55
16-bits Carry-Lookahead Adder
❖ 16-bit Carry-Lookahead Adder is composed of 4-bit Carry-Lookahead Adder
P56
Who Cares About FP Accuracy?
❖ Important for scientific code
❖ But for everyday consumer use?
➢"My bank balance is out by 0.0002¢!"
❖ The Intel Pentium (Floating point Division) FDIV bug, 1994
❖ Recall cost: USD $475M
❖ The market expects accuracy
❖ See Colwell, The Pentium Chronicles
66 MHz Intel Pentium
P57