EEDG/CE/CS 6304 Computer Architecture
Lecture 13 – Modern Processors
Types
Benjamin Carrion Schaefer
Associate Professor
Department of Electrical and Computer Engineering
Course Overview
• Fundamentals of Design and Analysis of
Computers (2 lectures)
– History, technological breakthroughs, etc.
– Trends and metrics: performance,
power/energy, cost
• CPU (7 Lectures)
– Instruction Set Architecture
– Arithmetic for Computers (new)
– Instruction Level Parallelism (ILP)
– Dynamic instruction scheduling
– Branch prediction
– Thread-level parallelism
– Modern processors
• Memories (4 Lectures)
– Memory hierarchy
– Caches
– Secondary storage
– Virtual memory
• Buses (1 lecture)
• New computer structures: Heterogeneous
computing (1 lecture)
Objectives
Upon completion of this chapter, you will be able to:
– Differentiated between the different microprocessors'
architectures
• Scalar
• Superscalar
• CISC
• RISC
• Vector Processor
– Understand how control units in CPUs work
• Hardwired vs. micorprogrammed
– Understand current processors available and their architecture
• Intel, ARM, MIPS
– Application Specific Processors
– Soft Processors
– MicroBlaze
– Nios II
Ref: Miscellaneous Sources
3
Scalar Processor
• Can only execute one
instruction at a time
• Flynn’s Taxonomy
– Single instruction stream
single data stream (SISD)
Superscalar Processor
• Processor being able to issue multiple
instructions in a single clock cycle
• Implements parallelism through instruction-
level parallelism à increases throughput
• Execute more than one instruction during a
clock cycle by simultaneously dispatching
multiple instructions to different execution
units on the processor
Vector Processing
• Vector processors have high-level operations that
work on linear arrays of numbers (vectors)
DAP Spr.‘98 ©UCB 6
Vector Processors Properties
• Operate on an entire vector in one instruction
• Each result independent on previous result
– Deep, wide pipeline à Compiler ensures no dependencies
– High clock rate
• Vector instructions access memory with known patterns
– Highly interleaved memory (spreading memory addresses evenly across
memory banks)
– Amortize memory latency
– No (data) cache required (only instruction cache)
• Vector operations are SIMD
Types of Vector Architectures
• Based on how the operands are
fetched :
1. Memory-memory vector
processors
– Operands directly streamed to
functional units from memory and
stored back to memory
2. Vector-register processors
– All operands are read into vector
registers which feed to the
functional units and results stored
back into vector registers
• Vector equivalent of load-store
architecture
• Includes all vector machines since late
1980 (Cray, convex, Fujitsu, Hitachi, NEC)
• Assume vector-register machine from
now on
Operations and Instruction Count RISC vs. Vector
Processor*
Spec92fp Operations (millions) Instructions (Million)
Program RISC Vector R/V RISC Vector R/V
Swim256 115 95 1.1x 115 0.8 142x
Hydro2d 58 40 1.4x 58 0.8 71x
Nasa7 69 41 1.7x 69 2.2 31x
Su2cor 51 35 1.4x 51 1.8 29x
Timcatv 15 10 1.4x 15 1.3 11x
wave5 27 25 1.1x 27 7.2 4x
mdljdp2 32 52 0.6x 32 15.8 2x
• Vector reduces
– ops by 1.2x
– Instructions by 20x
* Ref. F. Quintana (Universidad de Barcelona)
Components of Vector Processor
• Vector Register
– Fixed length bank holding a single vector
• Has at least 2 read and 1 write ports
• Typically, 8-32 vector registers, each holding
64-128 64-bits elements
• Vector Functional Units (FUs)
– Fully pipelined à start new operations
every clock
• Typically, 4 to 8 FUs (e.g., FP add, FP mult,
integer add)
• Vector Load-Store Units (LSU)
– Fully pipelined unit to load or store a
vector
– May have multiple LSU
• Scalar Registers
– Single elements for interconnecting FUS,
LSUs and registers
Example of Vector Machines
Machine Year Clock [MHz] Regs Regs FUs Load
Elements Store
Cray1 1976 80 8 64 6 1
Cray XMP 1983 120 8 64 8 2L,1S
Cray YMP 1888 166 8 64 8 2L, 1S
Cray C-90 1991 240 8 128 8 4
Cray T-90 1996 455 8 128 8 4
Fuj VP200 1982 133 16 128 3 1
Fuj VP300 1996 133 8-256 32-1014 3 2
NEC SX/2 1984 160 8+8K 256+var 16 8
NEC SX/3 1995 400 8+8K 256+var 16 8
Vector Processors Declining Use
• More expensive than superscalar processors
– Sell very few copies à design cost is expensive
• Need high speed on-chip memory à
expensive
• Few architectural innovation to improve
performance
Scalar vs. Vector Processors
• Vector processor
– Smaller program size (requires less instructions)
– Memory access is more “efficient” since every data item requested is
actually used
– Once data is being processed other units (fetch, decode, etc..) can be
powered off à reduces power
– Reduces fetch and decode bandwidth as number of instructions fetched is
less
– Exploit parallelism in large scientific and multimedia applications
– Mainly used in supercomputers
• Scalar processor
– Instruction operate on single data item
• Most current CPUs implement vector-like instructions (SIMD)
– Intel’s x86 MMX/SSE (Streaming SIMD Extensions)
– Cell processor (IBM, Toshiba, Sony) for PlayStation 3 1 scalar +8 SIMD
processors
Conclusion: Vector processors are not viable due to economic reasons BUT vector
instructions set architecture is.
CISC vs. RISC Processors
• CISC: Complex Instruction Set Processors
– Complete a task in as few lines of assembly as possible.
– Performance improvement by simplifying the compiler à
burden falls on processors
– E.g., Intel X86 Processor
• RISC: Reduced Instruction Set Processors
– Only use simple instructions that can be executed within
one clock cycle
– Requires fewer transistors to produce processors
– processor can execute the instructions more quickly
– However greater burden is placed upon the compiler
– E.g., ARM Processor
14
Example
• Find the product of two numbers - one stored in
location 23h and another stored in location 52h -
and then store the product back in the location
23h
• CISC
MUL (23h), (52h)
• RISC
LOAD A, (23h)
LOAD B, (52h)
MUL A, B
STORE (23h), A
CISC vs. RISC
CISC RISC
Emphasis on hardware Emphasis on software
Includes multi-clock Single-clock,
complex instructions reduced instruction only
Memory-to-memory: Register to register:
"LOAD" and "STORE" "LOAD" and "STORE"
incorporated in instructions are independent instructions
Small code sizes, Low cycles per second,
high cycles per second large code sizes
The Performance Equation
• CISC approach attempts to minimize the number of
instructions per program, sacrificing the number of
cycles per instruction
• RISC does the opposite, reducing the cycles per
instruction at the cost of the number of instructions per
program.
Time
Processor Performance = ----------------
Program
= Instructions Cycles Time
X X
Program Instruction Cycle
(code size) (CPI) (cycle time-freq)
Instruction Decoder Differences
• CISC
– Microprogrammed
• RISC
– decoder
Control Signals when Executing 1 Instr.
• Performing an Arithmetic or
Logic Operation
ADD R1, R2 [R1] + [R2] à R1
Step Action
1. R1out, Yin
2. R2out, SelectY,Add, Zin
3. Zout, R1in
It takes three clocks to complete.
19
Hardwired Control
• Two ways to generate control signals:
– Hardwired - faster
– Micro-programmed – slower but more flexible
• Instructions are executed in steps, each taking 1 clock
cycle
• Different actions performed in each step depending on the
instruction being executed à setting of control signals
depends on:
– Contents of step counter
– Contents of instruction register
– Result of a computation or a comparison operation
– External input signals e.g., interrupts
20
Hardwired Control - Circuit
• Instruction decoder interprets the opcode in the IR and sets INS1, INS2, …, INSm
signals
• Step counter indicates which phase execution step the processor is (T1-T5)
• External inputs (e.g., interrupts) connected directly to control signal generator
21
Microprogrammed Control
• Control signals are generated by a program similar
to machine language programs (microprogram)
• Microprogram is stored on the processor in a
small and very fast memory called microprogam
memory (control store)
• Sequence of microinstructions corresponding to a
given machine instruction constitute the
microroutine for that instruction
22
Microprogrammed Control - Circuit
• Microinstruction address
generator
– generates the address to be Microinstruction
address
IR
used for reading the from generator
control store
μPC
• μPC (microcounter)
– keeps track of control store
addresses
Control Store
• Control store
– Contains instructions and :::::
their control signals
23
Microprogrammed Control
Select4
An example of microinstructions for ADD R1, [R3] (Slide 24).
24
Microinstruction Coding Scheme
Microinstruction
F1 F2 F3 F4 F5
F1 (4 bits) F2 (3 bits) F3 (3 bits) F4 (4 bits) F5 (2 bits)
0000: No transfer 000: No transfer 000: No transfer 0000: Add 00: No action
0001: PC 001: PC 001: MAR 0001: Sub 01: Read
out in in
0010: MDR out 010: IR in 010: MDR in 10: Write
0011: Z out 011: Z in 011: TEMP in
0100: R0 out 100: R0 in 100: Y in 1111: XOR
0101: R1 out 101: R1 in
0110: R2 out 110: R2 in 16 ALU
functions
0111: R3 out 111: R3 in
1010: TEMP out
1011: Offset out
F6 F7 F8
19-bit micro-instruction
F6 (1 bit) F7 (1 bit) F8 (1 bit)
0: SelectY 0: No action 0: Continue
1: Select4 1: WMFC 1: End
An example of a partial format for field-encoded microinstructions. 25
Embedded Microprocessors
• System on Chip (SoC) contain at least 1 embedded
processor
– High performance
– Low power: Pdynamic=Iavg Vdd=αCfVdd2
26
ARM Business model
• ARM Holdings licenses to third parties:
– the chip designs
– the ARM instruction set architectures
• Third parties:
– design their own products that implement one of
those architectures
30
ARM Acquisitions
ARM in 2023
ARM Cores
Apple A6/A6X– IPhone5
Altera Cyclone V SOC, Xilinx Zynq
Apply A7/8/8X/9/9X 34
ARM Cores Naming convention
• In the past ARM numbered their cores:
– ARM7, ARM8, ARM9, ARM10, ARM11
• Cortex is the result of a new naming convention
from ARM è Now some wise guys came up with
the idea to call ALL future cores from ARM Cortex
and a suffix.
– Cortex-A for applications processor, the high-end
running > 1 GHz
– Cortex-R for real-time processor, the mid-range 400-
600 MHz
– Cortex-M for microcontroller the low-end running <
200 MHz
35
ARM Cores Line Up
• Cortex-A Application Processor:
– Smartphones
– Netbooks
– eReaders
– Digital TV
– Home Gateways
– Servers and Networking
• Cortex-R Real-time embedded processors:
– Automotive braking systems
– Powertrain solutions
– Mass storage controller
– Networking & Printing
• Cortex-M Microcontroller:
– Microcontrollers
– Mixed signal devices
– Smart sensors
– Automotive body electronics and airbags
• Securecore Security Applications:
– Security markets for mobile SIMs
– identification applications
www.arm.com/products/processors/index.php
36
ARM Cortex-X
• Design is based on the ARM Cortex-A78, but
redesigned for purely performance instead of a
balance of performance, power, and area (PPA)
• X1:
– 5-wide decode out-of-order superscaler
– can fetch 5 instructions per cycle
– out-of-order window size has been increased to 224
entries
– Has 15 execution ports with a pipeline depth of 13
stages and the execution latencies consists of 10
stages
• Latest Cortex-X4
ARM Cores
• Reduced Instruction Set Computer (RISC)
– uniform register file load/store architecture,
where data processing operates only on
register contents, not directly on memory
contents.
– Simple addressing modes, with all load/store
addresses determined from register contents
and instruction fields only
• Differentiating features:
– Java acceleration (Jazelle)
– VFP : Vector Floating point (co-processor)
– security (TrustZone)
– SIMD
– Advanced SIMD (NEON) technologies.
– The ARMv8-architecture adds a Cryptographic
extension as an optional feature.
– Thumb instruction set provides a subset of
the most commonly used 32-bit ARM
instructions which have been compressed into
16-bit wide opcodes. On execution, these 16-
bit instructions are decompressed
transparently to full 32-bit ARM instructions in
real time without performance loss.
38
Thumb Instructions
• Re-encoded subset of the ARM instruction set
• Thumb instructions execute in their own processor
state
• Thumb instructions are half the size of ARM
instructions (16 bits compared with 32 bits)
Pros:
• Greater code density compared to the ARM
instruction set
Cons:
• Uses more instructions (ARM and Thumb)
39
ARM Cortex-A Architecture
• Cortex-A57 processor
– highest-performance and
most advanced processor
– Based on the ARMv8-A
Architecture
– launched in early 2015
– big.LITTLE technology
– ACP Accelerator Coherence
Port (DMA)
– SCU : Snoop control unit
(maintains cache
coherence between ARM
processors)
SCU: Snoop Control Unit
ACP: Accelerator Coherency Port
40
Cortex-A57 Features
• Superscalar, variable-length, out-of-order
pipeline.
• Dynamic branch prediction with Branch Target
Buffer (BTB) and Global History Buffer
(GHB)RAMs
• Fixed 48K L1 instruction cache and 32K L1 data
cache.
• Shared L2 cache of 512KB, 1MB, or 2MB
configurable size.
42
Cortex-A57 Implementation Options
• When implementing the Cortex-A57 processor
in an SoC
44
ARM big.LITTLE
• Pairs high performance core with lower performance (e.g. ARM Cortex A15
and Cortex A7 )
• Both processors support the same ARMv7-A ISA
• Differences in the internal microarchitecture allow them to provide the
different power and performance characteristics
• KEY: dynamically allocate tasks to the right processor according to their
instantaneous performance requirement à Cache coherent interconnect à
Not need to transfer data through main memory
45
ARM big.LITTLE Architecture
• Little: Cortex-A7
– Simple, in-order execution, 8 pipeline stages
• Big: Cortex-A15
– Complex, out-of-order execution, multi-issue pipelines
46
Apple A10
• Quad core
– 2 high-performance cores (64-
bit ARMv8-A – Hurricane – 4.16
mm2)
– 2 energy efficient cores (64-bit
ARM cores – Zephyr- 0.78 mm2)
• ARM big.LITTE technology, BUT
only one core can be active at a
time à works as dual core
• Small ones embedded into large
cores
– Share L2 cache (3 Mbytes)
– L3 cache service all CPUs 4
MBytes
• Die Area 125 mm2, 3.3 billion
transistors (TSMC 16nm FinFET)
• Dedicated image processing unit
Apple’s A11 Application Processors
• Shrunk by 30% compared to A10 to
87.66 mm2 , 4.3 billion transistors
à25% faster (TSMC 10nm FinFET
process)
• Custom ARM architecture
• 6 CPUs
– 2 high performance 64-bit ARMv8-A
cores – Monsoon – have own L2 $
– 4 energy-efficient cores – Mistral –
share L2 $
– Can be used simultaneously
• Dedicated hardware processing
blocks
– Neural Engine (600 billion
operations per second) à used for
e.g., face ID
– Dedicated image processing unit
• Package-on-package device CPU+
3GB of SDRAM
Apple A14 vs. A15 vs. A16
• A14 and A15 TSMC 5nm process technology
• A16 4nm TSMC
• Equipped with the same number of cores
– 2 high performance
– 4 energy efficient cores
• A15 has 5 GPU cores as compared to 4 GPU
cores of A14
• A16 HAS 6 GPU cores
• A15 has 15 billion transistors vs. 11.8 billion
of A14 A16, 16 billion
– New image processing accelerator
A17 Pro
• 3nm transistors à 19 billion transistors
• 6 CPUs
– 2 High-performance cores (10% faster)
• Improved branch predictor
• Wider decode & execute engines
– 4 Efficiency cores (3x performance/watt)
• Neural engine
– 16 cores (up to 2xfaster – 35 trillion
operations/second)
• Dedicated engines = HW accelerators
– ProRes codec
– Display engine
– AV1 decoder (video codec)
• New GPU architecture
– Hardware accelerated Ray tracing
Apple Silicon
A14 5nm, 11.8 billion transistors M1 Ultra 114 billion
transistors
Apple M1
• 5nm transistor size
• 16 billion transistors
• SoC –and System in a
Package (DRAM in same
package as CPU)
• Big.LITTLE architecture (4
large, 4 small cores)
• 64-bit architecture
• Dedicated HW accelerators
(neural engine)
Running Intel Programs on M1
• Need to emulate Intel x86 instructions on M1
ARM’s processor
• Rosetta 2 release
• Invisible to user BUT performance slowdown
• Apple used this during their transition period
from IBM PowerPC to Intel in 2006
MIPS Technologies
• Founded in 1984 at Stanford University
– Founded by Prof. Hennessy and his student Chris Rowen
• Fabless semiconductor company developing RISC CPU chips
• 1988 Silicon Graphics (SGI) adopted MIPS architecture for
its computers
• 1989 IPO
• 1992 fully acquired by SGI
• 1998 SGI Span its business off
• 2013 acquired in by Imagination Technologies (UK) –
embedded graphics chips.
• 2017 sold to venture capital firm
60
David Patterson and John Hennessy
https://www.cnet.com/news/risc-chip-inventors-hennessy-patterson-win-computing-turing-prize/
MIPS Processors
• 2013 acquired by Imagination Technologies
• 2017 sold to venture capitalist firm
62
http://imgtec.com/mips/
MIPS I6400 Internal Structure
• Data coherence across cores important
63
RISC-V
• Open-Source RISC ISA
• Free and extensible software and hardware
freedom on architecture
RISC V
CISC
• Intel x86 (and compatible AMD) only CISC processors
• CISC chips are becoming increasingly unwieldy and difficult
to develop
• Intel has the resources to plow through development and
produce powerful processors
BUT
• Cost or RAM has decreased significantly.
– In 1977, 1MB of DRAM cost about $5,000.
– 1994, the same amount of memory cost only $6 (when adjusted
for inflation).
• Compiler technology has also become more sophisticated
• RISC processors consume less power à ideal for embedded
systems
Intel Processors : 13th Gen Intel Core
Intel Processors : 14th Gen
Intel's E-core and P-core chip
• E : Efficiency cores
• P: Performance cores
Intel Turbo Boost
• CPU to determine how close the processor is to its
maximum thermal design power, or TDP
• If the Intel Turbo Boost Technology sees that the
CPU is operating well within limits, the Turbo
Boost can kick in
ASIPs – Application Specific Instruction Set Processors
• The instruction set of an ASIP is tailored to benefit a
specific application.
• This specialization of the core provides a tradeoff
between the flexibility of a general-purpose CPU
and the performance of an ASIC
ASIPs flexibility vs. performance (source: Synopsys)
74
ASIPs Design Flow – Target Compiler Technologies
Target Compiler Technologies
75
Target acquired by Synopsys
Tensilica Xtensa
• Provided as synthesizable RTL core ASIP
– Gate count range: 25,000 – 150,000+
– Increase in gates as customer adds instructions or
optional features
• Software development tools
• Basic architecture
– 78 instructions
– five-stage pipeline that supports single-cycle execution
– 1 - load/store model
– 32-entry orthogonal register file
– 32 optional extra registers
• Founded by Chris Rowen
Xtensa Family
• Xtensa I to V
Tensilica (acquired by Cadence)
• Customizable Processor IP (ASIP)
– http://ip.cadence.com/ipportfolio/tensilica-ip
– Xtensa Processor Generator
Soft Processors
• FPGAs also provide their own configurable ‘soft
processors’ (implemented on LUTs)
– Altera – Nios
– Xilinx – MicroBlaze
• Benefits of a Soft processor:
80
Intel (Altera ) Nios II
• Core Block diagram
• 32-bit RISC processor
81
Xilinx MicroBlaze
• Core Block diagram
• 32-bit RISC processor
82
Summary
• Vector processors
• CISC vs. RISC processors
• Embedded SoC microprocessors
• ARM
– Cortex-A|R|M
– Internal structure
– Power saving features
– big.LITTLE
• MIPS
• Application Specific Instruction Processors (ASIPs)
– Target
– Tensilica
• Soft processors
– Nios II
– MicroBlaze
94