The Cell Processor
From conception to deployment
Presented by Nathan Lemieux
November 16, 2005
Created for CS625a @ UWO
Overview
Brief History of the Cell Conception
Cell’s Architecture
Comparisons to other Architectures
Design Decisions
Conclusions
Extra tidbits
History
Idea generated by SCEI in 1999 after release of PS2
STI group formed in 2000
In 2001 the first design center opened in the US
Fall 2002 US patent released
Since then prototypes have been developed and clocked
over @4.5 GHz
February 2005 final architecture revealed to public
In 2005 announced that first commercial product of the
Cell will be released in 2006
Sony Toshiba IBM Group (STI)
Sony
Leading manufacture of consumer and professional
audio and video products. Includes SCEI that
produces PS consoles
Toshiba
A Leader in development of consumer electronics
such as HDTV and other devices
IBM
Proven track record as a leader in manufacturing
state-of-the art microprocessors
STI
Each bring different knowledge
Each have different Requirements and
Expectations
Power consumption
Size
Performance
Scalability
Cost
Cell Architecture Overview
Cell Architecture Overview Continued
Intended to be configurable
Basic Configuration consists of:
1 PowerPC Processing Element (PPE)
8 Synergistic Processing Elements (SPE)
Element Interconnect Bus (EIB)
Rambus Memory Interface Controller (MIC)
Rambus FlexIO interface
512 KB system Level 2 cache
Power Processing Element (PPE)
Act as the host processor and performs scheduling for
the SPE
64-bit processor based on IBM POWER architecture
(Performance Optimization With Enhanced RISC)
Dual threaded, in-order execution
32 KB Level 1 cache, connected to 512 KB system level
2 cache
Contains VMX (AltiVec) unit and IBM hypervisor
technology to allow two operating systems to run
concurrently (Such as Linux and a real-time OS for
gaming)
Synergistic Processing Unit (SPU)
SIMD vector processor
and acts independently
Handles most of the
computational workload
Again in-order execution
but dual issue
*
Contains 256 KB local
store memory
Contains 128 X 128 bit
registers
Synergistic Processing Unit (SPU)
Continued
Operate on registers which are read from or
written to local stores.
SPE cannot act directly on main memory; they
have to move data to and from the local stores.
DMA device in SPEs handles moving data
between the main memory and the local store.
Local Store addresses are aliased in the PPE
address map and transfers to and from Local
Store to memory (including other Local Stores)
are coherent in the system
Element Interface Bus (EIB)
Contains 4 channels.
Each channel can transfer 24
bytes per cycle (16 bytes data
+ 8 bytes tag). For a total 96
bytes/cycle.
Enables communication
between the SPEs and the
PPE and is also connected to
level 2 cache, memory
controller and FlexIO
Great design to allows for
different configurations
*
Rambus Contributions
Memory Controller
Dual channel Rambus XDR controller,
peak memory bandwidth is 25.6 GB per second(2
channels x 2 devices per channel x 2 bytes per device
x 3.2 GHz)
I/O Controller
Rambus FlexIO is capable of running from 400 MHz
to 8 GHz.
Contains 12 lanes (5 lanes are inbound, 7 outbound,
for a theoretical peak I/O bandwidth of 76.8 GB @ 8
GHz (44.8GB out, 32GB in)
Processing Power
8 (SPE) x 4GHz x 4 (32 bit words in a vector) x 2
(Multiply-Adds are counted as 2 operations) =
256 SP GFLOPS
Each SPE is capable of 32 SP GFLOPS
SPE can produce 2 DP FMADD operations
every 7 cycles, ~2.3 DP GFLOPS, ~18.4 Total
These calculations do not include the processing
power of the PPE
Architecture Wrap Up
Cell needs to be configured for different uses
Allows for variable number of PPEs and SPEs with
different memory configurations
Newer generation Cells will be compatible to older
generations
Cells are designed to work together; even distributed
over a network
Architecture Wrap Up Continued
Tasks are divided into SPE and PPE “modules”
or jobs.
Different resource allocation schemes available
PPE Scheduling – The PPE maintains a job queue
SPE self Scheduling – Scheduling is distributed
across the SPEs. PPE still maintans the job queue
Stream Processing – Each SPE runs a distinct
program to be chained together.
Processing Power Continued
Supercomputers rankings are done by Double
Precision calculations
Supercomputer BlueGene/L develop by IBM has
a theoretical peak performance of 183500
GFLOPS but has only achieved 136800
GFLOPS. IBM’s BlueGene/L has 65536
processors giving each processor a theoretical
peak performance of approximately 2.8 DP
GFLOPS
Comparison To Other Architectures
x86 GPU
CISC Specific purpose
Contain multiple level Contain vertex/pixel
of cache and OOO units, which are similar
hardware to the SPE
Current trend is a Connected to its own
dual-core approach high speed memory
Design Decisions
STI members each have different expectations.
but power consumption and performance are
shared prerequisite amongst them
Different techniques OOO execution, branch
predictions units and large cache have been
developed to increase performance but the
trade-off is increased complexity, power
consumption, size and heat.
Because of the heat issue they are moving
toward dual-core processors.
Design Decisions Continued
STI removed and/or modified all the techniques other
manufactures have used to increase performance but
have reduced complexity & power consumption, space
To combat the reduced performance they looked at the
memory latency issue and introduced local store
memory that is closer to the execution units and used
the extra space to insert more execution units and
introduced a large resister file
Using a multi-core approach that is easily scaleable to
multiple Cells
Since there is reduced power consumption and heat
generation, the Cell clocked frequency can be cranked
up
Conclusions
9 Core processor with revolutionary design
Very scaleable in design and flexible in it uses
Programming will more likely be difficult at first,
but future compilers will hopefully make things
more simple
Current POWER apps will port easily to the Cell
Will perform exceptionally well in its niche
markets but may never be seen in a desktop PC
What’s Apple Doing?
Recently announced that they are no
longer using the IBM’s PowerPC
Cell design changed from previous design
to include larger PPE with more advanced
VMX (AltiVec) unit
Giving up the chance to be the distributor
of Cell based desktops, for power hungry
Intel chips
Reasons?
PPC970FX failing to reach 3 GHz?
Shortages of PPC?
Higher cost of PPC processor?
Strategic Alliance?
Sony’s PS3
PS3 Specs
Cell processor @ 3.2 Ghz
7 functional SPE, but has 8 (Redundancy ?)
Total 218 SP GFLOPS
nVidia RSX GPU (1.8 TFLOPS)
256 MB XDR RAM
256MB GDDR3 VRAM
Up to 7 Bluetooth controllers
Backwards compatible, WiFi capabilities with
PSP
?