9/18/2014
Advanced Computer Architecture
Dr. Umer Farooq
[email protected]Lecture overview
Previous lecture
Hardware renaissance era
Performance improvement in processors
Computer classes
Instruction Set Architecture overview
Todays lecture
ISA overview (contd)
Technology, power and energy, cost trends
IC dependability
Amdahls law
Processor performance equations
9/18/2014
ISA: Logical operations
Logical ops
C operators
Shift Left
Shift Right
Bit-by-bit AND
Bit-by-bit OR
Bit-by-bit NOT
<<
>>
&
|
~
Java operators
<<
>>>
&
|
~
MIPS instr
sll
srl
and, andi
or, ori
nor
ISA: Logical operations
Sll
$t2, $S0, 4
0000 0000 0000 0000 0000 0000 0000 1001
0000 0000 0000 0000 0000 0000 1001 0000
and $t0, $t1, $t2
$t2 = 0000 0000 0000 0000 0000 1101 0000 0000
$t1 = 0000 0000 0000 0000 0011 1100 0000 0000
$t0 = 0000 0000 0000 0000 0000 1100 0000 0000
or
$t0, $t1, $t2
$t0 = 0000 0000 0000 0000 0011 1101 0000 0000
nor
$t0, $t1, $t3 (if $t3 is all 0s )
$t0 = 1111 1111 1111 1111 1100 0011 1111 1111
4
9/18/2014
ISA: Control instructions
Involve decision making: if a certain condition is satisfied
then do something else do something else.
MIPS assembly language includes two decision making
instructions (conditional branches)
beq
register1, register2, L1 #go to L1 if contents of
register1 and 2 are equal
bne
register1, register2, L1 #go to L1 if contents of
register1 and 2 are not equal
ISA: Control instructions
Convert to assembly:
if (i == j)
f = g+h;
else
f = g-h;
Variable f through j correspond to
$s0 through $s4
Assembly code:
bne $s3, $s4, Else
add $s0, $s1, $s2
j
Exit
Else: sub $s0, $s1, $s2
Exit:
6
9/18/2014
Example
Convert to assembly:
while (save[i] == k)
i += 1;
i and k are in $s3 and $s5 and
base of array save[] is in $s6
Loop: sll
add
lw
bne
addi
j
Exit:
$t1, $s3, 2
$t1, $t1, $s6
$t0, 0($t1)
$t0, $s5, Exit
$s3, $s3, 1
Loop
Unconditional branch
ISA: More control instructions
To check equality is important
What if we want to check if a certain variable is less
than or greater than another variable
slt
slti
$t0, $S3, $S4
$t0, $S2, 10 # $t0 = 1 if $s2 < 10
9/18/2014
Summary of MIPS assembly language instructions
Summary of MIPS machine language instructions
10
9/18/2014
Procedures
Each procedure (function, subroutine) maintains a scratchpad of register
values when another procedure is called (the callee), the new procedure
takes over the scratchpad values may have to be saved so we can safely
return to the caller
1.
2.
3.
4.
5.
6.
parameters (arguments) are placed where the callee can see them
control is transferred to the callee
acquire storage resources for callee
execute the procedure
place result value where caller can access it
return control to caller
11
Registers
MIPS follows following convention in allocating its 32
registers for procedure calling
$a0 - $a3: four argument registers in which to pass
parameters
$v0 - $v1: two value registers in which to return values
$ra: one return address register to return to the point
of origin
12
9/18/2014
Jump and link
A special register (storage not part of the register file) maintains the
address of the instruction currently being executed this is the
program counter (PC)
The procedure call is executed by invoking the jump-and-link (jal)
instruction the current PC (actually, PC+4) is saved in the register
$ra and we jump to the procedures address (the PC is accordingly
set to this address)
jal NewProcedureAddress
Since jal may over-write a relevant value in $ra, it must be saved
somewhere (in memory?) before invoking the jal instruction
How do we return control back to the caller after completing the
callee procedure?
13
The stack
What if more registers are required than the four parameter
registers ($a0 - $a3) and two value registers ($v0 - $v1)
The register scratchpad for a procedure seems volatile it seems to
disappear every time we switch procedures a procedures values
are therefore backed up in memory on a stack
High address
Proc A
Proc As values
Proc Bs values
Proc Cs values
Stack grows
this way
Low address
call Proc B
call Proc C
return
return
return
14
9/18/2014
Storage management on call/return
A new procedure must create space for all its variables on the stack
Before executing the jal, the caller must save relevant values in $s0$s7, $a0-$a3, $ra, temps into its own stack space
Arguments are copied into $a0-$a3; the jal is executed
After the callee creates stack space, it updates the value of $sp
Once the callee finishes, it copies the return value into $v0, frees up
stack space, and $sp is incremented
On return, the caller may bring in its stack values, ra, temps into
registers
The responsibility for copies between stack and registers may fall
upon either the caller or the callee
15
Example 1
int leaf_example (int g, int h, int i, int j)
{
int f ;
f = (g + h) (i + j);
return f;
}
If g, h, i, j correspond to parameter register $a0 - $a3
and f corresponds to $s0 what will be MIPS assembly
code?
16
9/18/2014
Example 1
leaf_example
addi
$sp, $sp, -12
sw
$t1, 8($sp)
sw
$t0, 4($sp)
sw
$s0, 0($sp)
add
$t0, $a0, $a1
add
$t1, $a2, $a3
sub
$s0, $t0, $t1
add
$v0, $s0, $zero
lw
$s0, 0($sp)
lw
$t0, 4($sp)
lw
$t1, 8($sp)
addi
$sp, $sp, 12
jr
$ra
#making room for 3 registers $t0, $t1, $s0
#save $t1 for use afterwards
#save $t0 for use afterwards
#save $s0 for use afterwards
# $t0 = g+h
# $t1 = i+j
# f = $t0 - $t1
# return f
# restore register $s0 for caller
# restore register $t0 for caller
# restore register $t1 for caller
# adjust stack to delete three items
# jump back to calling routine
17
Example 2
int fact (int n)
{
if (n < 1) return (1);
else return (n * fact(n-1));
}
What is MIPS assembly code?
18
9/18/2014
Example 2
fact:
addi
sw
sw
slti
beq
addi
addi
jr
L1:
addi
jal
lw
lw
addi
mul
jr
$sp, $sp, -8
$ra, 4($sp)
$a0, 0($sp)
$t0, $a0, 1
$t0, $zero, L1
$v0, $zero, 1
$sp, $sp, 8
$ra
$a0, $a0, -1
fact
$a0, 0($sp)
$ra, 4($sp)
$sp, $sp, 8
$v0, $a0, $v0
$ra
# adjust stack for two items
# save the return address
# save the argument n int fact (int n)
{
# test for n<1
if (n < 1) return (1);
#if n>=1 go to L1
else return (n * fact(n-1));
# return 1
}
# pop two items off stack
What is MIPS assembly code?
# return to after jal
#n>=1 argument gets (n-1)
# call fact with n-1
# return from jal:restore argument n
# restore the return address
# adjust stack pointer to pop 2 items
# return n*fact(n-1)
# return to the caller
19
Example
ra1
a0 = 4
SP
SP
SP, v0 = 6
ra2
a0 = 3
ra3
SP
SP, v0 = 24
a0 = 2
SP, v0 = 2
SP, v0 = 1
ra4
SP
a0 = 1
ra5
SP
SP, v0 = 1
a0 = 0
20
10
9/18/2014
Memory organization
The space allocated on stack by a procedure is termed the
activation record (includes saved values and data local to the
procedure)
Frame pointer points to the start of the record and stack pointer
points to the end
variable addresses are specified relative to $fp as $sp may change
during the execution of the procedure
21
Memory organization
In addition to variables local to procedure, space needed for static
variables and for dynamic data structures
Stack starts from top and grows towards bottom
$gp points to area in memory that saves global variables
Dynamically allocated storage (with malloc()) is placed on the heap
22
11
9/18/2014
MIPS registers
23
Dealing with characters
Instructions are also provided to deal with byte-sized and half-word
quantities: lb (load-byte), sb, lh, sh
lb
$t0, 0($sp)
#read a byte from source and store it in $t0
sb
$t0, 0($gp)
#write a byte to destination
Right most 8 bits of register are used in byte operations
These data types are most useful when dealing with characters, pixel
values, etc.
C employs ASCII formats to represent characters each character is
represented with 8 bits and a string ends in the null character
(corresponding to the 8-bit number 0)
For Example Cal in C is represented as 67, 97, 108, 0
24
12
9/18/2014
ASCII code of characters
25
Example
Consider an example
where one string of
characters is copied
into another string of
characters
Convert to assembly:
void strcpy (char x[], char y[])
{
int i;
i=0;
while ((x[i] = y[i]) != `\0)
i += 1;
}
Assume that base addresses for arrays
x and y are found in $a0 and $a1 while
i is in $s0
26
13
9/18/2014
Example
strcpy:
addi $sp, $sp, -4
sw
$s0, 0($sp)
add $s0, $zero, $zero
L1: add $t1, $s0, $a1
lb
$t2, 0($t1)
add $t3, $s0, $a0
sb
$t2, 0($t3)
beq $t2, $zero, L2
addi $s0, $s0, 1
j
L1
L2: lw $s0, 0($sp)
addi $sp, $sp, 4
jr
$ra
#adjust stack for 1 more item
# save $s0
#i=0
# address of y[i] in $t1
# $t2 = y[i]
# address of x[i] in $t3
# x[i] = y[i]
# if y[i]==0 go to L2
# i = i+1
# go to L1
# y[i]==0, end of string
# pop 1 word off stack
# return
27
Large constants
Immediate instructions can only specify 16-bit constants
The lui instruction is used to store a 16-bit constant into the upper 16 bits
of a register thus, two immediate instructions are used to specify a 32bit constant
For example, what if you want to perform add operation between a 32 bit
immediate number and value stored in $s1?
32 bit number is 0000 0000 0011 1101 0000 1001 0000 0000
lui $s0, 61
$s0 = 0000 0000 0011 1101 0000 0000 0000 0000
ori $s0, $s0, 2304
$s0 = 0000 0000 0011 1101 0000 1001 0000 0000
28
14
9/18/2014
Large constants
The destination PC-address in a conditional branch is specified as a 16-bit
constant, relative to the current PC
How to branch far away?
A jump (j) instruction can specify a 26-bit constant; if more bits are
required, the jump-register (jr) instruction is used
29
Overview of different ISAs
Class of ISA:
register-memory ISA like 80x86 which can access
memory through many instructions
Load-store ISA like RISC, ARM, MIPS. Access to
memory through load or store instructions only
Memory addressing
Byte addressing or word addressing?? 80x86, ARM,
MIPS use byte addressing
Objects are aligned or no? ARM, MIPS yes. 80x86 no
An access to object of size S bytes at byte address A is
aligned if A mod s = 0
30
15
9/18/2014
ISA: memory alignment
31
Overview of different ISAs
Addressing modes
Addressing modes specify address of memory objects. Register,
Immediate, and Displacement addressing modes are used
Types and Sizes of operands
80x86, ARM, and MIPS supports sizes of 8-bit (ASCII character),
16-bit (unicode character), 32-bit (integer or word), 64-bit
(double word or long integer), floating point 32-bit (single
precision), 64-bit (double precision), 80-bit floating point
(extended double precision) supported by 80x86 only
Operations
For example data transfer, arithmetic logical, control, and
floating point
Control flow instructions
Like conditional branches, unconditional jumps, procedure calls,
returns etc.
32
16
9/18/2014
Implementation: Organization and Hardware
Design
Organization: high level aspects of computer design like
memory system, memory interconnect, design of CPU.
Term microarchitecture also used for organization
Computers can have same ISA but different organizations
like AMD opteron and Intel i7: x86 ISA but different
pipeline and cache organizations
Hardware: detailed logic design, process/packaging
technology
Computers can have same ISA and organization but
different hardware implementation.
For example Intel core i7, Intel xeon 7560: same ISA,
organization but different clock rates and memory systems
33
Trends in technology
A successful ISA must be designed to survive technology evolution
Five implementation technologies have changed at a dramatic pace
Integrated circuit logic technology
Transistor density increases by about 35% per year quadrupling every 4 years
Growth in transistor count on a chip about 40% to 55% per year or doubling
every 18 to 24 months (Moores law)
Semiconductor DRAM
Capacity increase 25% to 40% per year doubling roughly every two to three
years
Growth rate getting slower and may hit the wall sooner than later
34
17
9/18/2014
Trends in technology
Semiconductor flash: standard storage device in PMDs
Capacity increases 50% to 60% per year doubling roughly every two
years
15 to 20 times cheaper per bit than DRAM
Magnetic disk technology
Prior to 90s, increased at 30% per year
Mid 90s, growth rate rose to 100% per year
Now almost stable at 40% per year
300 to 500 times cheaper per bit than DRAM
Central to server and warehouse scale computers
Network technology
Depends upon performance of both network switches and
transmission systems
From early 80s till today, bandwidth of network switches has increased
10000 times
Latency has improved by 30 times
35
Bandwidth over latency
Bandwidth or throughput: total work done in a given time.
For example for disk data transfer it is measured as
megabytes/second
10000-25000 times improvement for processors
300-1200 times improvement for memory and disks
Latency or response time: time between the start and
completion of an event
For example for disk access latency can be in milliseconds.
30-80 times improvement for processors
6-8 times improvement for memory and disks
General rule of thumb: bandwidth grows by at least square of
improvement in latency
36
18
9/18/2014
Bandwidth and latency
37
Scaling of transistor performance and wires
Feature size: minimum size of transistor or wire in either x or y
dimension
Feature size has decreased from 10 microns (in 70s) to 22
nm(core i7)
Transistor performance improves linearly with decreasing
feature size e.g. improvement from 4-bit microprocessors to 64bit processors today
Wires do not improve with reduced feature size
Wires get shorter with reduced feature size
Wire delay is directly proportional to product of its resistance
and capacitance
With reduced feature size, capacitance and resistance per unit
length gets worse
Wire delay is a major limiting factor today for design of large ICs
38
19
9/18/2014
39
40
20
9/18/2014
Energy in microprocessors
For a complete transition i.e. 0-1-0
For single transition
(
1
2
For power
(
) 1/2
As the process technology is improved, number of transistors
switching and the frequency with which they switch increases;
hence resulting in an increased power and energy consumption
41
Growth in clock rates for microprocessors
First 32-bit microprocessor required only 2 watts
Intel core-i7 requires 130 watts
More clock rate, higher power, more heat dissipation
42
21
9/18/2014
Power and energy efficiency improvement techniques
How to distribute power, remove heat, prevent hot spots?
Do nothing well
Turn off what is not being used
Dynamic Voltage Frequency Scaling (DVFS)
During periods of inactivity, operate the device on low voltage and
frequency
43
Power and energy efficiency improvement techniques
Design for typical case
PMDs are often idle, so low power modes for memory to save
energy
Spin disks of laptops at lower rate during inactive periods
Overclocking
Run your system in turbo mode for short bursts of time and
then go to sleep
For example a 3.3 GHz core i7 can run at 3.6 GHz for short
periods
Issue of static power
Increasing with the process technology and can be as high as
50% of total power
Can be decreased with power gating (turn off power supply)
44
22
9/18/2014
Trends in cost
The impact of time, volume and commoditization
Time
Improved learning curve, better yield
Twice the yield, half the cost
Volume
Increase in volume decreases cost/chip
Decrease in development cost
Decrease in cost by 10% by doubling the volume
Commoditize the market
More vendors, high competition and less rate
45
Chip manufacturing process
46
23
9/18/2014
Wafer containing Intel core i7 processors
280 core i7 dies at
100% yield
Die area 20.7x10.5
mm2
Die is manufactured in
32 nm processing
technology
47
Cost of integrated circuits
Bose-Einstein formula
Defects/area = 0.1 to 0.3 per square inch or 0.016 to 0.057 per square cm
N is process complexity factor. N = 11.5 to 15.5 for 40 nm
48
24
9/18/2014
Dependability
Historically integrated circuits were one of the most reliable
components of a computer
With the advancement in processing technology transient and
permanent faults have become more common
Module/component reliability is a measure of continuous
service accomplishment from a reference initial time
Measure in Mean Time To Failure (MTTF)
Reciprocal of MTTF is Failure In Time (FIT) or failure rate
FIT normally measured in no of failures per billion hour
Service interruption is measured in Mean Time To Repair
(MTTR)
Mean Time Between Failures (MTBF) = MTTR + MTTF
Module availability = MTTF/(MTTF+MTTR)
Primary way to cope with failure is redundancy
49
Quantitative principles of computer design
Taking advantage of parallelism
At server level parallelism can be achieved through
multiprocessors and multiple disks
At single processor level, its best example is pipelining
At memory level, set-associative cache is an example
where multiple banks are searched in parallel
Principle of locality
90-10 rule of thumb: programs spend 90% of time in 10%
of code
Temporal locality: recently accessed data is likely to be
accessed again in future
Spatial Locality: items whose addresses are near one
another tend to be referenced close together in time
50
25
9/18/2014
Quantitative principles of computer design
Focus on the common case
In making a design trade-off, favor the frequent case over
infrequent case (applicable when not sure how to spend
resources)
Amdahls Law: performance improvement to be gained
from using some faster mode of execution is limited by
the fraction of time the faster mode can be used
Speedup: improvement in performance/execution time
that can be gained by using a particular feature
51
Amdahls law
Speedup governed by two factors
Fraction of time in the original computer that can be converted
to take advantage of the enhancement.
Termed as fraction(enhanced). Less than or equal to 1
Improvement gained by enhanced execution mode
Value is the time of original mode over time of enhanced mode
Always greater than 1. Termed as speedup (enhanced)
52
26
9/18/2014
Amdahls law
Amdahls law and law of diminishing returns: you can
not improve beyond a certain limit no matter what
that speedup (enhanced)
53
Processor performance equation
54
27