Lecture Outline
• Design Principles for Modern Computers
• Parallelism
UNIT 1 • Instruction-Level Parallelism
– Pipelining
– Dual Pipelines
– Superscalar Architectures
• Processor-Level Parallelism
– Array Computers
– Multiprocessors
– Multicomputers
Design Principles for Modern Computers
Design Principles for Modern Computers
• Instructions Should be Easy to Decode
– a critical limit on the rate of issue of
There is a set of design principles, sometimes instructions
called the RISC design principles, that architects – make instructions regular, fixed length, with
of general-purpose CPUs do their best to follow: a small number of fields.
– the fewer different formats for instructions.
• All Instructions Are Directly Executed by the better.
Hardware
• Only Loads and Stores Should Reference
– eliminates a level of interpretation Memory
• Maximise the Rate at Which Instructions are – operands for most instructions should come
Issued from- and return to- registers.
– access to memory can take a long time
– MIPS = millions of instructions per second – thus, only LOAD and STORE instructions
– MIPS speed related to the number of should reference memory.
instructions issued per second
– Parallelism can play a role • Provide Plenty of Registers
– accessing memory is relatively slow, many
registers (at least 32) need to be provided,
so that once a word is fetched, it can be kept
in a register until it is no longer needed.
2 3 1
Parallelism Instruction-Level Parallelism
• Computer architects are constantly striving to • Parallelism is exploited within individual instructions
improve performance of the machines they to get more instructions/sec out of the machine.
design.
• We will consider two approached
• Making the chips run faster by increasing their
clock speed is one way, – Pipelining
– Superscalar Architectures
• However, most computer architects look to
parallelism (doing two or more things at once)
as a way to get even more performance for a
given clock speed.
• Parallelism comes in two general forms:
– instruction-level parallelism, and
– processor-level parallelism.
4 5
Pipelining
A Example of Pipelining
S1 S2 S3 S4 S5
• Fetching of instructions from memory is a major Instruction
fetch
Instruction
decode
Operand
fetch
Instruction
execution
Write
back
unit unit unit unit unit
bottleneck in instruction execution speed.
(a)
However, computers have the ability to fetch
instructions from memory in advance S1: 1 2 3 4 5 6 7 8 9
S2: 1 2 3 4 5 6 7 8
S3: 1 2 3 4 5 6 7 …
• These instructions were stored in a set of S4: 1 2 3 4 5 6
registers called the prefetch buffer. S5: 1 2 3 4 5
1 2 3 4 5 6 7 8 9
Time
• Thus, instruction execution is divided into two (b)
parts: fetching and actual execution; Figure 2-4. (a) A five-stage pipeline. (b) The state
of each stage as a function of time. Nine clock
• The concept of a pipeline carries this strategy cycles are illustrated.
much further.
• Instead of dividing instruction execution into
only two parts, it is often divided into many
parts, each one handled by a dedicated piece
of hardware, all of which can run in parallel.
6 7 2
Dual Pipelines Example: Dual Pipelines
S1 S2 S3 S4 S5
• If one pipeline is good, then surely two Instruction
decode
Operand
fetch
Instruction
execution
Write
back
pipelines are better. Instruction
fetch
unit unit unit unit
unit
Instruction Operand Instruction Write
decode fetch execution back
• Here a single instruction fetch unit fetches pairs unit unit unit unit
of instructions together and puts each one into
its own pipeline, complete with its own ALU for Figure 2-5. (a) Dual five-stage pipelines with a
parallel operation. common instruction fetch unit.
• To be able to run in parallel, the two instructions
must not conflict over resource usage (e.g.,
registers), and neither must depend on the
result of the other.
8 9
Superscalar Architectures
Superscalar Architectures
S4
• Going to four pipelines is conceivable, but ALU
doing so duplicates too much hardware
ALU
• Instead, a different approach is used on high- S1 S2 S3 S5
end CPUs. Instruction
fetch
Instruction
decode
Operand
fetch LOAD
Write
back
unit unit unit unit
• The basic idea is to have just a single pipeline STORE
but give it multiple functional units.
Floating
point
• This is a superscalar architecture – using
more than one ALU, so that more than one
instruction can be executed in parallel. Figure 2-6. A superscalar processor with five
functional units.
• Implicit in the idea of a superscalar processor
is that the S3 stage can issue instructions
considerably faster than the S4 stage is able
to execute them.
10 11 3
Processor-Level Parallelism Array Computers
• Instruction-level parallelism (pipelining and • An array processor consists of a large number
superscalar operation) rarely win more than a of identical processors that perform the same
factor of five or ten in processor speed. sequence of instructions on different sets of
data.
• To get gains of 50, 100, or more, the only way
is to design computers with multiple CPUS • A vector processor is efficient at at executing
a sequence of operations on pairs of Data
• We will consider three alternative architectures: elements; all of the addition operations are
performed in a single, heavily-pipelined adder.
– Array Computers
– Multiprocessors
– Multicomputers
12 13
Example: Array Computers
Multiprocessors
Control unit
Broadcasts instructions • The processing elements in an array processor
are not independent CPUS, since there is only
one control unit.
8 × 8 Processor/memory grid
• The first parallel system with multiple full-blown
Processor CPUs is the multiprocessor.
Memory
• This is a system with more than one CPU
sharing a common memory co-ordinated in
software.
Figure 2-7. An array processor of the ILLIAC IV
type. • The simplest one is to have a single bus with
multiple CPUs and one memory all plugged
into it.
14 15 4
Example: Multiprocessors Multicomputers
Local memories
• Although multiprocessors with a small number
of processors (< 64) are relatively easy to build,
Shared Shar
memory mem large ones are surprisingly difficult to construct.
CPU CPU CPU CPU CPU CPU CPU CPU
• The difficulty is in connecting all the processors
to the memory.
Bus Bus
(a) (b)
• To get around these problems, many designers
have simply abandoned the idea of having
Figure 2-8. (a) A single-bus multiprocessor. (b) A a shared memory and just build systems
multicomputer with local memories. consisting of large numbers of interconnected
computers, each having its own private
memory, but no common memory.
• These systems are called multicomputers.
16 17