lOMoARcPSD|38291606
CS2100 Finals Cheatsheet
Computer Organisation (National University of Singapore)
Scan to open on Studocu
Studocu is not sponsored or endorsed by any college or university
Downloaded by Muskan Gupta (
[email protected])
lOMoARcPSD|38291606
Hazard and resolution
Structural Hazards
Simultaneous use of hardware resource (e.g. memory unit used by
both load and fetch instruction)
No issue for MIPS as data and instruction memory is separate
Data Hazards
Per-block overhead: Valid flag (1-bit) + Tag length
RAW (Read after Write) (Initially, all valid flags are unset)
Register Writes first, then Reads Blocks in cache: 2M
Without data forwarding: If dependent cycle is right before: 2 Bytes per block: 2N
cycle delay, 2 cycles before: 1 cycle delay For each Memory Address
With data forwarding: If dependent cycle is dependent on lw: 1 Set Index = (val mod 2N+M)// 2N
Word Index = (val mod 2N)//Bytesword
cycle delay otherwise: no delay Tag = val // 2N+M
Detect Load-Use hazard when ID/EX.instruction == Load && Set-Associative Cache
(ID/EX.rt == IF/ID.rs || ID/EX.rt == IF/ID.rt) A block maps to a unique set of N possible cache locations
Data Forwarding N-way SAC → N cache blocks per set
Resolves all RAW Hazards except lw (need one stall) Bytes per block: 2M
Cache bocks = Sizecache / Sizeblock
Performance sw after lw might not need to stall at all Sets = CacheBlocks / N = 2N
Single Cycle Forward from EX/MEM to ALU for 1 Fully-Associative Cache
One Instruction = 1 Clock Cycle Forward from MEM/WB to ALU for 2
Clock Cycle Time: Longest Latency Amongst all Control Hazards (Branching/Jumping)
Instructions (usually Lw) Without ANY Control Measures: 3 cycle delay
Total Execution Time = Early branch resolution Move branch decision calculation from
EX/MEM to ID stage – stall 1 cycle instead of 3 (may cause
Number of Instructions x Clock Cycle Time further stall if reg. is written by previous instruction)
Multi Cycle o Involved in RAW with prev inst (not lw): stall 2
cycles Block can be placed anywhere, but need to search all blocks
One Stage = 1 Clock Cycle No conflict miss anymore. Capacity miss = total miss – cold miss
Cycle Time Decreases, Clock Frequency Increases o Involved in RAW with prev inst (lw): stall 3 cycles Cache Performance
Different Instructions take variable number of clock o Not involved in any RAW: stall 1 cycle Larger Block Trade Off:
Branch prediction (not taken): Guess the outcome and Spatial Locality Advantage (Hit Rate Increases)
cycles (since not all stages are needed) speculatively execute instructions, if guess wrongly then flush
Clock Cycle Time: Longest latency amongst all pipeline Miss Penalty increases due to loading more
Stages Temporal Locality Disadvantage at certain limit (Miss Rate
o With early branching: 1 cycle occur Increases)
Total Execution Time = o Without early branching: 3 cycles occur Rule of thumb: Direct-mapped cache of size N has
I x Average CPI x Clock Cycle Time before instructions get flushed/not flushed almost the same miss rate as 2-way set associative cache of
Delayed branch: X instructions following a branch will always size N/2
Pipeline be executed regardless of outcome (requires compiler re-ordering - Cold/Compulsory miss does not depend on size/associativity
One Stage = 1 Clock Cycle of instructions to branch-delay slot(s), or add nop instructions)
Try to find un-linked instructions from before the branch. - For same cache size, Conflict miss decreases with increasing associativity
Clock Cycle Time: Longest latency amongst all - Conflict miss is 0 for FA Cache
Stages + Td (time needed to store into pipeline) o With early branching: shift 1 instruction
- For same cache size, Capacity miss does not depend on associativity
o Without early branching: shift 3 instructions - Capacity miss decreases with increasing size
Cycles needed for I inst: (I + N – 1) Cache (1GiB = 2 bytes, 1 KiB = 210 bytes)
30
Total Execution Time = Block replacement policy
Temporal locality: Same item tends to be re-referenced soon Least recently used (LRU): the usual policy, hard to track
(I + N – 1) x Clock Cycle Time Spatial locality: Nearby items tend to be referenced soon First in first out (FIFO) – with second chance variant
If N(instructions) >> N(stages) Hit rate: fraction of memory accesses that are in cache Random replacement (RR)
(avg. access time) = (hit rate) × (hit time) + (1 − (hit rate)) × Least frequently used (LFU)
Speedup(pipeline) = (Time(single cycle) / Time(pipe)) ~ N (miss penalty)
Performance
Cache block/line: smallest unit of transfer between memory and
cache Performance = 1 / ResponseTime
Types of misses: Speedup n, between x and y:
Cold/Compulsory: when the block has never been accessed
before
Conflict: same index gets overwritten (direct & set assoc.)
Capacity: cache cannot contain all blocks (full assoc.) CPU Time = Instructions / Program * Cycles / Instruction * Seconds / Cycle
Write Policy Factors affecting performance: Different compiler (affects Instruction Per
Write-through: write data both to cache and main memory using Program), Different ISA (affects CPI)
Pipelining (Pipeline register contents) a write buffer to queue memory writes Cannot use CPI to determine performance/time, use total time!
Write-back: write data to cache; write to main memory when Amdahl’s Law (Performance limited to non-speedup program portion)
IF/ID: Instruction from memory & PC + 4 block is evicted using a “dirty bit” on each cache block P: % of program time that can be
ID/EX: Data read from register files, 32-bit Sign Extended Imm, Write miss policy improved
& PC + 4 Write allocate: load block to cache, then follow write policy
EX/MEM: Imm, & (PC + 4) + (Imm * 4), ALU Result, isZero Write around: write directly to main memory Boolean Algebra
Signal * RD2 from register file Direct Mapped Cache Precedence of Not > And > Or
Identity: A + 0 = A and A · 1 = A
MEM/WB: ALU result, Memory Read Data & Write Register
Complement: A + A’ = 1 and A · A’ = 0
Data (passed through all pipelines) Commutative: A + B = B + A and A · B = B · A
Downloaded by Muskan Gupta (
[email protected])
lOMoARcPSD|38291606
Associative: A + (B + C) = (A + B) + C and A · (B · C) = (A · B) · Prime implicant: Implicant which is not a subset of any other code
C implicant - Priority Encoder can deal with the garbage inputs by assigning priorities
Distributive: A + (B · C) = (A + B) · (A + C) and A · (B + C) = to inputs.
(A · B) + (A · C) Essential prime implicant: Prime implicant with at least one ‘1’
- Add valid bit to
Duality (not a real law): If we flip AND/OR operators and flip the that is not in any other prime implicant (must show in final eqn) deal with (if nothing
operands (0 and 1), the Boolean equation still holds Simplified SOP expression – group ‘1’s on K-map switched on)
Ide mpo tency:X+X=Xa ndX·X=X Simplified POS expression – find SOP expression using ‘0’s on • Demultiplexer:
One /Ze roEl eme nt :X+1=1a ndX·0=0 K-map, then negate resulting expression - One input data line
Inv olution:( X’)’=X Grouping 2N cells(only power-sizes are allowed) eliminates n - N selection lines
Abs orption:X+( X· Y)=X variables - Directs data from
X·( X+Y)=X input to a selected
EPIs are counted only by checking 1s, not Xs output line among
Abs orption( var i
ant ):X+( X’·Y)=X+Y K-maps help to obtain canonical SOP, but might not provide the 2N possibilities
X·( X’+Y)=X·Y simplest expression possible (need to use boolean algebra for that) Demultiplexer ≡
De Mor g ans’(ca nb eus edon>2v ar
iabl
es):( X·Y) ’=X’+Y’ Decoder with enable
(X+Y) ’=X’·Y’ • Multiplexer:
Cons ens us:(X·Y)+( X’·Z)+( Y·Z)=( X·Y)+( X’·Z) - Selects one of 2n inputs to a single output line, using n selection lines
(X+Y)·( X’+Z)·( Y+Z)=( X+Y)·( X’+Z) - To implement functions with n variables, pass variables to the n-bit selector
Lo gicGa t es and set 2n inputs to
Complete set of logic: Any set of gates appropriate constants from
sufficient for building any boolean function. truth table
e.g. {AND, OR, NOT} - To implement functions
Lo gicCircuits with n + 1 variables, pass
e.g. {NAND} (self-sufficient / first n variables to the n-
universal gate) = {Negative OR} Combi nati
onalc i
rc uit
:eachoutputdependsent
ire
lyon
bit selector and set each
e.g. {NOR} (self-sufficient / presentinputs input appropriately to ‘0’,
Seque nti
alc i
rcuit:e a
choutputdependsonbo t
hpresent ‘1’, Z, or Z’ (Z is the last
universal gate) – only when both
inputsands t
ate variable)
inputs 0 will output be 1
•Hal f-
Adde rC=X·Y,S=X⊕Y
•Ful l-
adde rCout=X· Y+( X⊕Y) Cin,S=X⊕(
· Y⊕Z)=( X⊕Y) ⊕Z
With negated outputs, use NAND to simulate •4- bi
tparall
eladd erbyc a
scadi
ng4f ull-
adder
sviathe
ircar
ri
es
OR and NOR to simulate AND •Adde r-
cum- subtractorneedtoXORt heYwi thS(0/1de p
endingon Larger
add/s ubtract
)andpa s
sinSa sC-in(X–Y=X+( 1s-
Complemento fY)+ Components
1) - Remove a
•Magni t
udeCompar a
tor:input:2uns i
gnedval
uesAa ndB,output:"
A> decoder that gives
B" ,"A=B" ,"A<B" duplicate outputs
(w.r.t another
Circuit Delays decoder) by using
• For each component, time = max(∀tinput) + tcurrent component an OR gate with
• Propagation delay of ripple-carry parallel adders ∝ no. of bits the outputs from
the first decoder,
and the enable input of the second.
ALU Build
MSI Components
SOP 0 IS THE LEAST
SIGNIFICANT
expression – implement using 2-level AND-OR circuit or 2-level NAND INPUT!
circuit • Decoder (n-to-m-
POS expression – implement using 2-level OR-AND circuit or 2-level NOR line decoder):
circuit converts binary
data from n input lines to one of the m ≤ 2n output
lines (i.e. 2 x 4 )
Minterms & Maxterms
- Each output line represents a minterm
Minterm/Maxterm of n - Active High - Generate minterms and
variables is a use OR on minterms to form a function.
product/sum term that Alternatively, use NOR on maxterms
contains n literals from all the variables -> n variables -> 2n - Active Low – AND the maxterms or
mindterms, 2n maxterms NAND the minterms
Minterm: m0 = X’· Y’· Z’ - Can add an Enable signal
Larger decoders can be constructed from
Maxterm: M0 = X + Y + Z smaller ones with an inverter (e.g. 3 x 8
m0’ = M0 decoder built from 2 x 4)
Functions can be sum of minterms or product of maxterms • Encoder: opposite of decoder Sequential Circuits
Sum of 2 distinct Maxterms is 1 - Exactly ONE input should be ‘1’ Self-Correcting: any unused state can transit to a used state after a finite
Product of 2 distinct minterms is 0 - If more than one input switched one, number of cycles
Kmap then X (don’t care values)
- Position of single active input line Synchronous: outputs change at specific time (with clock)
Implicant: Product term with all ‘1’ or ‘X’, but with at least one among 2n possibilities is coded as a n-bit Asynchronous: outputs change at any time
‘1’
Downloaded by Muskan Gupta ([email protected])
lOMoARcPSD|38291606
Multivibrator: sequential circuits that operate/swing between
HIGH and LOW state
Bistable: 2 stable states (e.g. latch, flip-flop)
Monostable / one-shot: 1 stable state
Astable: no stable state (e.g. clock)
Memory element: device that can remember value indefinitely, or change
value on command from its inputs. Same input does not always give same
output!
Pulse-triggered: activated by +ve/−ve pulses (e.g. latch)
Edge-triggered: activated by rising/falling edge (e.g. flip-flop)
S-R latch (“Set-Reset”) (High: 2 cross-coupled NOR gates Low: NAND):
Gated S-R latch: Outputs change only when EN is HIGH (AND)
Memorised when EN is LOW
Gated D latch (“Data”): Can build from gated S-R latch (No invalid)
• S-R flip-flop: Similar to gated S-R latch
• D (data) flip-flop: Similar to gated D latch (No invalid Inputs)
• J-K flip-flop: J:“Set”, K:“Reset”, Toggle if both HIGH
• T flip-flop (“Toggle”): J-K flip-flop with tied inputs
J-K Flip Flop: Q and Q’ fed back to NAND gates
T Flip Flop: Tie both inputs of J-K together