Thanks to visit codestin.com
Credit goes to www.scribd.com

0% found this document useful (0 votes)
16 views13 pages

M3.5 Instruction Level Parallesim

Uploaded by

Imptastic
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
16 views13 pages

M3.5 Instruction Level Parallesim

Uploaded by

Imptastic
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 13

342 Chapter 4 The Processor

imprecise Elaboration: The difficulty of always associating the proper exception with the
interrupt Also called correct instruction in pipelined computers has led some computer designers to relax
imprecise exception. this requirement in noncritical cases. Such processors are said to have imprecise
Interrupts or exceptions interrupts or imprecise exceptions. In the example above, PC would normally have 58hex
in pipelined computers at the start of the clock cycle after the exception is detected, even though the offending
that are not associated instruction is at address 4Chex. A processor with imprecise exceptions might put 58hex
with the exact instruction
into ELR and leave it up to the operating system to determine which instruction caused
that was the cause of the
the problem. LEGv8 and the vast majority of computers today support precise interrupts
interrupt or exception.
or precise exceptions. One reason is designers of a deeper pipeline processor might
be tempted to record a different value in the ELR, which would create headaches for the
precise interrupt Also OS. To prevent them, the deeper pipeline would likely be required to record the same PC
called precise exception. that would have been recorded in the five-stage pipeline. It is simpler for everyone to just
An interrupt or exception
record the PC of the faulting instruction instead. (Another reason is to support virtual
that is always associated
memory, which we shall see in Chapter 5.)
with the correct
instruction in pipelined
computers. Elaboration: We show that LEGv8 uses the exception entry address
0000 0000 1C09 0000hex, which is based on the ARMv8 Model Architecture. ARMv8
can have different exception entry addresses, depending on the platform.

Elaboration: The LEGv8 architecture has three levels of exception, each with their
own ELR and ESR registers, as we’ll see in Chapter 5.

Check Which exception should be recognized first in this sequence?


Yourself 1. XXX X1, X2, X1 // undefined instruction
2. SUB X1, X2, X1 // hardware error

4.10 Parallelism via Instructions

Be forewarned: this section is a brief overview of fascinating but complex topics.


If you want to learn more details, you should consult our more advanced book,
Computer Architecture: A Quantitative Approach, fifth edition, where the material
covered in these 13 pages is expanded to almost 200 pages (including appendices)!
Pipelining exploits the potential parallelism among instructions. This
parallelism is called, naturally enough, instruction-level parallelism (ILP). There
are two primary methods for increasing the potential amount of instruction-
level parallelism. The first is increasing the depth of the pipeline to overlap more
instructions. Using our laundry analogy and assuming that the washer cycle was
longer than the others were, we could divide our washer into three machines that
perform the wash, rinse, and spin steps of a traditional washer. We would then
4.10 Parallelism via Instructions 343

move from a four-stage to a six-stage pipeline. To get the full speed-up, we need instruction-level
to rebalance the remaining steps so they are the same length, in processors or in parallelism The
laundry. The amount of parallelism being exploited is higher, since there are more parallelism among
instructions.
operations being overlapped. Performance is potentially greater since the clock
cycle can be shorter.
Another approach is to replicate the internal components of the computer so
that it can launch multiple instructions in every pipeline stage. The general name
for this technique is multiple issue. A multiple-issue laundry would replace our multiple issue A
household washer and dryer with, say, three washers and three dryers. You would scheme whereby multiple
also have to recruit more assistants to fold and put away three times as much instructions are launched
in one clock cycle.
laundry in the same amount of time. The downside is the extra work to keep all the
machines busy and transferring the loads to the next pipeline stage.
Launching multiple instructions per stage allows the instruction execution rate to
exceed the clock rate or, stated alternatively, the CPI to be less than 1. As mentioned
in Chapter 1, it is sometimes useful to flip the metric and use IPC, or instructions
per clock cycle. Hence, a 3-GHz four-way multiple-issue microprocessor can
execute a peak rate of 12 billion instructions per second and have a best-case CPI
of 0.33, or an IPC of 3. Assuming a five-stage pipeline, such a processor would have
15 instructions in execution at any given time. Today’s high-end microprocessors
attempt to issue from three to six instructions in every clock cycle. Even moderate
static multiple issue An
designs will aim at a peak IPC of 2. There are typically, however, many constraints approach to implementing
on what types of instructions may be executed simultaneously, and what happens a multiple-issue processor
when dependences arise. where many decisions
There are two main ways to implement a multiple-issue processor, with the are made by the compiler
major difference being the division of work between the compiler and the hardware. before execution.
Because the division of work dictates whether decisions are being made statically
dynamic multiple
(that is, at compile time) or dynamically (that is, during execution), the approaches issue An approach to
are sometimes called static multiple issue and dynamic multiple issue. As we will implementing a multiple-
see, both approaches have other, more commonly used names, which may be less issue processor where
precise or more restrictive. many decisions are made
Two primary and distinct responsibilities must be dealt with in a multiple-issue during execution by the
pipeline: processor.

1. Packaging instructions into issue slots: how does the processor determine issue slots The positions
how many instructions and which instructions can be issued in a given from which instructions
clock cycle? In most static issue processors, this process is at least partially could issue in a given
clock cycle; by analogy,
handled by the compiler; in dynamic issue designs, it is normally dealt with
these correspond to
at runtime by the processor, although the compiler will often have already positions at the starting
tried to help improve the issue rate by placing the instructions in a beneficial blocks for a sprint.
order.
2. Dealing with data and control hazards: in static issue processors, the compiler
handles some or all the consequences of data and control hazards statically.
In contrast, most dynamic issue processors attempt to alleviate at least some
classes of hazards using hardware techniques operating at execution time.
344 Chapter 4 The Processor

Although we describe these as distinct approaches, in reality, one approach


often borrows techniques from the other, and neither approach can claim to be
perfectly pure.

The Concept of Speculation


One of the most important methods for finding and exploiting more ILP is
speculation. Based on the great idea of prediction, speculation is an approach
that allows the compiler or the processor to “guess” about the properties of an
instruction, to enable execution to begin for other instructions that may depend
on the speculated instruction. For example, we might speculate on the outcome of
a branch, so that instructions after the branch could be executed earlier. Another
example is that we might speculate that a store that precedes a load does not refer to
the same address, which would allow the load to be executed before the store. The
speculation An difficulty with speculation is that it may be wrong. So, any speculation mechanism
approach whereby the must include both a method to check if the guess was right and a method to unroll
compiler or processor
guesses the outcome of an or back out the effects of the instructions that were executed speculatively. The
instruction to remove it as implementation of this back-out capability adds complexity.
a dependence in executing Speculation may be done in the compiler or by the hardware. For example, the
other instructions. compiler can use speculation to reorder instructions, moving an instruction across
a branch or a load across a store. The processor hardware can perform the same
transformation at runtime using techniques we discuss later in this section.
The recovery mechanisms used for incorrect speculation are rather different.
In the case of speculation in software, the compiler usually inserts additional
instructions that check the accuracy of the speculation and provide a fix-up
routine to use when the speculation is wrong. In hardware speculation, the
processor usually buffers the speculative results until it knows they are no longer
speculative. If the speculation is correct, the instructions are completed by
allowing the contents of the buffers to be written to the registers or memory. If
the speculation is incorrect, the hardware flushes the buffers and re-executes the
correct instruction sequence. Misspeculation typically requires the pipeline to be
flushed, or at least stalled, and thus further reduces performance.
Speculation introduces one other possible problem: speculating on certain
instructions may introduce exceptions that were formerly not present. For example,
suppose a load instruction is moved in a speculative manner, but the address it uses
is not within bounds when the speculation is incorrect. The result would be that an
exception that should not have occurred would occur. The problem is complicated
by the fact that if the load instruction were not speculative, then the exception
must occur! In compiler-based speculation, such problems are avoided by adding
special speculation support that allows such exceptions to be ignored until it is
clear that they really should occur. In hardware-based speculation, exceptions
are simply buffered until it is clear that the instruction causing them is no longer
speculative and is ready to complete; at that point, the exception is raised, and
normal exception handling proceeds.
4.10 Parallelism via Instructions 345

Since speculation can improve performance when done properly and decrease
performance when done carelessly, significant effort goes into deciding when it
is appropriate to speculate. Later in this section, we will examine both static and
dynamic techniques for speculation.

Static Multiple Issue


Static multiple-issue processors all use the compiler to assist with packaging
instructions and handling hazards. In a static issue processor, you can think of the
set of instructions issued in a given clock cycle, which is called an issue packet, as issue packet The set
one large instruction with multiple operations. This view is more than an analogy. of instructions that
Since a static multiple-issue processor usually restricts what mix of instructions can issues together in one
clock cycle; the packet
be initiated in a given clock cycle, it is useful to think of the issue packet as a single
may be determined
instruction allowing several operations in certain predefined fields. This view led to statically by the compiler
the original name for this approach: Very Long Instruction Word (VLIW). or dynamically by the
Most static issue processors also rely on the compiler to take on some processor.
responsibility for handling data and control hazards. The compiler’s responsibilities
Very Long Instruction
may include static branch prediction and code scheduling to reduce or prevent all Word (VLIW) A
hazards. Let’s look at a simple static issue version of an LEGv8 processor, before we style of instruction set
describe the use of these techniques in more aggressive processors. architecture that launches
many operations that are
An Example: Static Multiple Issue with the LEGv8 ISA defined to be independent
in a single-wide
To give a flavor of static multiple issue, we consider a simple two-issue LEGv8
instruction, typically with
processor, where one of the instructions can be an integer ALU operation or many separate opcode
branch and the other can be a load or store. Such a design is like that used in fields.
some embedded processors. Issuing two instructions per cycle will require fetching
and decoding 64 bits of instructions. In many static multiple-issue processors, and
essentially all VLIW processors, the layout of simultaneously issuing instructions
is restricted to simplify the decoding and instruction issue. Hence, we will require
that the instructions be paired and aligned on a 64-bit boundary, with the ALU
or branch portion appearing first. Furthermore, if one instruction of the pair
cannot be used, we require that it be replaced with a nop. Thus, the instructions
always issue in pairs, possibly with a nop in one slot. Figure 4.66 shows how the
instructions look as they go into the pipeline in pairs.
Static multiple-issue processors vary in how they deal with potential data and
control hazards. In some designs, the compiler takes full responsibility for removing
all hazards, scheduling the code, and inserting no-ops so that the code executes
without any need for hazard detection or hardware-generated stalls. In others, the
hardware detects data hazards and generates stalls between two issue packets, while
requiring that the compiler avoid all dependences within an instruction packet.
Even so, a hazard generally forces the entire issue packet containing the dependent
instruction to stall. Whether the software must handle all hazards or only try to
reduce the fraction of hazards between separate issue packets, the appearance of
having a large single instruction with multiple operations is reinforced. We will
assume the second approach for this example.
346 Chapter 4 The Processor

Instruction type Pipe stages


ALU or branch instruction IF ID EX MEM WB
Load or store instruction IF ID EX MEM WB
ALU or branch instruction IF ID EX MEM WB
Load or store instruction IF ID EX MEM WB
ALU or branch instruction IF ID EX MEM WB
Load or store instruction IF ID EX MEM WB
ALU or branch instruction IF ID EX MEM WB
Load or store instruction IF ID EX MEM WB

FIGURE 4.66 Static two-issue pipeline in operation. The ALU and data transfer instructions
are issued at the same time. Here we have assumed the same five-stage structure as used for the single-issue
pipeline. Although this is not strictly necessary, it does have some advantages. In particular, keeping the
register writes at the end of the pipeline simplifies the handling of exceptions and the maintenance of a
precise exception model, which become more difficult in multiple-issue processors.

To issue an ALU and a data transfer operation in parallel, the first need for
additional hardware—beyond the usual hazard detection and stall logic—is extra
ports in the register file (see Figure 4.67). In one clock cycle, we may need to read
two registers for the ALU operation and two more for a store, and also one write
port for an ALU operation and one write port for a load. Since the ALU is tied
up for the ALU operation, we also need a separate adder to calculate the effective
address for data transfers. Without these extra resources, our two-issue pipeline
would be hindered by structural hazards.
Clearly, this two-issue processor can improve performance by up to a factor of
two! Doing so, however, requires that twice as many instructions be overlapped
in execution, and this additional overlap increases the relative performance loss
from data and control hazards. For example, in our simple five-stage pipeline,
use latency Number loads have a use latency of one clock cycle, which prevents one instruction from
of clock cycles between using the result without stalling. In the two-issue, five-stage pipeline the result of
a load instruction and a load instruction cannot be used on the next clock cycle. This means that the next
an instruction that can
two instructions cannot use the load result without stalling. Furthermore, ALU
use the result of the
load without stalling the instructions that had no use latency in the simple five-stage pipeline now have a
pipeline. one-instruction use latency, since the results cannot be used in the paired load or
store. To effectively exploit the parallelism available in a multiple-issue processor,
more ambitious compiler or hardware scheduling techniques are needed, and static
multiple issue requires that the compiler take on this role.
4.10 Parallelism via Instructions 347

4
ALU

Registers
1C090000 PC Instruction
memory
Write
data
Data
Sign- ALU
extend memory
Sign-
extend
Address

FIGURE 4.67 A static two-issue datapath. The additions needed for double issue are highlighted: another 32 bits from instruction
memory, two more read ports and one more write port on the register file, and another ALU. Assume the bottom ALU handles address
calculations for data transfers and the top ALU handles everything else.

Simple Multiple-Issue Code Scheduling


EXAMPLE
How would this loop be scheduled on a static two-issue pipeline for LEGv8?
Loop: LDUR X0, [X20,#0] // X0=array element
ADD X0,X0,X21 // add scalar in X21
STUR X0, [X20,#0] // store result
SUBI X20,X20,#8 // decrement pointer
CMP X20,X22 // compare to loop limit
BGT Loop // branch if X20 > X22
Reorder the instructions to avoid as many pipeline stalls as possible. Assume
branches are predicted, so that control hazards are handled by the hardware.
The first three instructions have data dependences, as do the next two. Figure
4.68 shows the best schedule for these instructions. Notice that just one pair ANSWER
of instructions has both issue slots used. It takes five clocks per loop iteration;
at five clocks to execute six instructions, we get the disappointing CPI of 0.83
versus the best case of 0.5, or an IPC of 1.2 versus 2.0. Notice that in computing
CPI or IPC, we do not count any nops executed as useful instructions. Doing
so would improve CPI, but not performance!
348 Chapter 4 The Processor

ALU or branch instruction Data transfer instruction Clock cycle


Loop: LDUR X0,[X20,#0] 1
SUBI X20, X20, #8 2
ADD X0, X0, X21 3
CMP X20, X22 4
BGT Loop STUR X0,[X20,#8] 5

FIGURE 4.68 The scheduled code as it would look on a two-issue LEGv8 pipeline. The
empty slots are no-ops.

loop unrolling A
technique to get more An important compiler technique to get more performance from loops is loop
performance from loops unrolling, where multiple copies of the loop body are made. After unrolling, there
that access arrays, in is more ILP available by overlapping instructions from different iterations.
which multiple copies of
the loop body are made
and instructions from
different iterations are
scheduled together.

Loop Unrolling for Multiple-Issue Pipelines


EXAMPLE
See how well loop unrolling and scheduling work in the example above. For
simplicity, assume that the loop index is a multiple of four.

To schedule the loop without any delays, it turns out that we need to make
ANSWER four copies of the loop body. After unrolling and eliminating the unnecessary
loop overhead instructions, the loop will contain four copies each of LDUR,
ADD, and STUR, plus one SUBI, one CMP, and one CBZ. Figure 4.69 shows the
unrolled and scheduled code.
During the unrolling process, the compiler introduced additional registers
register renaming The (X1, X2, X3). The goal of this process, called register renaming, is to eliminate
renaming of registers dependences that are not true data dependences, but could either lead to
by the compiler or potential hazards or prevent the compiler from flexibly scheduling the code.
hardware to remove
Consider how the unrolled code would look using only X0. There would be
antidependences.
repeated instances of LDUR X0, [X20,#0], ADD X0, X0, X21 followed by STUR
X0, [X20,#8], but these sequences, despite using X0, are actually completely
independent—no data values flow between one set of these instructions and the
antidependence next set. This case is what is called an antidependence or name dependence,
Also called name which is an ordering forced purely by the reuse of a name, rather than a real
dependence An data dependence that is also called a true dependence.
ordering forced by the
Renaming the registers during the unrolling process allows the compiler to
reuse of a name, typically
a register, rather than by move these independent instructions subsequently to better schedule the code.
a true dependence that The renaming process eliminates the name dependences, while preserving the
carries a value between true dependences.
two instructions.
4.10 Parallelism via Instructions 349

ALU or branch instruction Data transfer instruction Clock cycle


Loop: SUBI X20, X20,#32 LDUR X0, [X20,#0] 1
LDUR X1, [X20,#24] 2
ADD X0, X0, X21 LDUR X2, [X20,#16] 3
ADD X1, X1, X21 LDUR X3, [X20,#8] 4
ADD X2, X2, X21 STUR X0, [X20,#32] 5
ADD X3, X3, X21 STUR X1, [X20,#24] 6
CMP X20, X22 STUR X2, [X20,#16] 7
BGT Loop STUR X3, [X20,#8] 8

FIGURE 4.69 The unrolled and scheduled code of Figure 4.68 as it would look on a static
two-issue LEGv8 pipeline. The empty slots are no-ops. Since the first instruction in the loop decrements
X20 by 32, the addresses loaded are the original value of X20, then that address minus 8, minus 16, and
minus 24.

Notice now that 14 of the 15 instructions in the loop execute as pairs. It takes
eight clocks for four loop iterations, which yields an IPC of 15/8 = 1.88. Loop
unrolling and scheduling more than doubled performance—8 versus 20 clock
cycles for 4 iterations—partly from reducing the loop control instructions and
partly from dual issue execution. The cost of this performance improvement is
using four temporary registers rather than one, as well as more than doubling
the code size.

Dynamic Multiple-Issue Processors


Dynamic multiple-issue processors are also known as superscalar processors, superscalar An
or simply superscalars. In the simplest superscalar processors, instructions issue in advanced pipelining
order, and the processor decides whether zero, one, or more instructions can issue technique that enables the
in a given clock cycle. Obviously, achieving good performance on such a processor processor to execute more
than one instruction per
still requires the compiler to try to schedule instructions to move dependences clock cycle by selecting
apart and thereby improve the instruction issue rate. Even with such compiler them during execution.
scheduling, there is an important difference between this simple superscalar
and a VLIW processor: the code, whether scheduled or not, is guaranteed by
the hardware to execute correctly. Furthermore, compiled code will always run
correctly independent of the issue rate or pipeline structure of the processor. In
some VLIW designs, this has not been the case, and recompilation was required
when moving across different processor models; in other static issue processors,
code would run correctly across different implementations, but often so poorly as dynamic pipeline
to make compilation effectively required. scheduling Hardware
support for reordering
Many superscalars extend the basic framework of dynamic issue decisions to the order of instruction
include dynamic pipeline scheduling. Dynamic pipeline scheduling chooses execution to avoid stalls.
which instructions to execute in a given clock cycle while trying to avoid hazards
350 Chapter 4 The Processor

and stalls. Let’s start with a simple example of avoiding a data hazard. Consider
the following code sequence:
LDUR X0, [X21,#20]
ADD X1, X0, X2
SUB X23, X23, X3
ANDI X5, X23, 20

Even though the SUB instruction is ready to execute, it must wait for the LDUR
and ADD to complete first, which might take many clock cycles if memory is slow.
(Chapter 5 explains cache misses, the reason that memory accesses are sometimes
very slow.) Dynamic pipeline scheduling allows such hazards to be avoided either
fully or partially.

Dynamic Pipeline Scheduling


Dynamic pipeline scheduling chooses which instructions to execute next,
possibly reordering them to avoid stalls. In such processors, the pipeline is divided
into three major units: an instruction fetch and issue unit, multiple functional units
commit unit The unit in (a dozen or more in high-end designs in 2015), and a commit unit. Figure 4.70
a dynamic or out-of-order shows the model. The first unit fetches instructions, decodes them, and sends
execution pipeline that each instruction to a corresponding functional unit for execution. Each functional
decides when it is safe to unit has buffers, called reservation stations, which hold the operands and the
release the result of an
operation to programmer-
operation. (In the next section, we will discuss an alternative to reservation stations
visible registers and used by many recent processors.) As soon as the buffer contains all its operands
memory. and the functional unit is ready to execute, the result is calculated. When the result
is completed, it is sent to any reservation stations waiting for this particular result
reservation station A as well as to the commit unit, which buffers the result until it is safe to put the
buffer within a functional result into the register file or, for a store, into memory. The buffer in the commit
unit that holds the unit, often called the reorder buffer, is also used to supply operands, in much the
operands and the same way as forwarding logic does in a statically scheduled pipeline. Once a result
operation.
is committed to the register file, it can be fetched directly from there, just as in a
reorder buffer The normal pipeline.
buffer that holds results in The combination of buffering operands in the reservation stations and results
a dynamically scheduled in the reorder buffer provides a form of register renaming, just like that used by
processor until it is safe the compiler in our earlier loop-unrolling example on page 348. To see how this
to store the results to conceptually works, consider the following steps:
memory or a register.
1. When an instruction issues, it is copied to a reservation station for the
appropriate functional unit. Any operands that are available in the register
file or reorder buffer are also immediately copied into the reservation station.
The instruction is buffered in the reservation station until all the operands
and the functional unit are available. For the issuing instruction, the register
copy of the operand is no longer required, and if a write to that register
occurred, the value could be overwritten.
4.10 Parallelism via Instructions 351

Instruction fetch
In-order issue
and decode unit

Reservation Reservation Reservation Reservation


...
station station station station

Functional Floating Load- Out-of-order execute


Integer Integer ...
units point store

Commit In-order commit


unit

FIGURE 4.70 The three primary units of a dynamically scheduled pipeline. The final step of
updating the state is also called retirement or graduation.

2. If an operand is not in the register file or reorder buffer, it must be waiting to


be produced by a functional unit. The name of the functional unit that will
produce the result is tracked. When that unit eventually produces the result,
it is copied directly into the waiting reservation station from the functional
unit bypassing the registers. out-of-order
These steps effectively use the reorder buffer and the reservation stations to execution A situation in
pipelined execution when
implement register renaming. an instruction blocked
Conceptually, you can think of a dynamically scheduled pipeline as analyzing from executing does
the data flow structure of a program. The processor then executes the instructions not cause the following
in some order that preserves the data flow order of the program. This style of instructions to wait.
execution is called an out-of-order execution, since the instructions can be
executed in a different order than they were fetched. in-order commit A
To make programs behave as if they were running on a simple in-order pipeline, commit in which the
results of pipelined
the instruction fetch and decode unit is required to issue instructions in order, execution are written to
which allows dependences to be tracked, and the commit unit is required to write the programmer visible
results to registers and memory in program fetch order. This conservative mode is state in the same order
called in-order commit. Hence, if an exception occurs, the computer can point to that instructions are
the last instruction executed, and the only registers updated will be those written fetched.
352 Chapter 4 The Processor

by instructions before the instruction causing the exception. Although the front
end (fetch and issue) and the back end (commit) of the pipeline run in order, the
functional units are free to initiate execution whenever the data they need are
available. Today, all dynamically scheduled pipelines use in-order commit.
Dynamic scheduling is often extended by including hardware-based
speculation, especially for branch outcomes. By predicting the direction of a
branch, a dynamically scheduled processor can continue to fetch and execute
instructions along the predicted path. Because the instructions are committed
in order, we know whether the branch was correctly predicted before any
instructions from the predicted path are committed. A speculative, dynamically
scheduled pipeline can also support speculation on load addresses, allowing load-
store reordering, and using the commit unit to avoid incorrect speculation. In the
next section, we will look at the use of dynamic scheduling with speculation in
the Intel Core i7 design.

Understanding Given that compilers can also schedule code around data dependences, you might
ask why a superscalar processor would use dynamic scheduling. There are three
Program major reasons. First, not all stalls are predictable. In particular, cache misses
Performance (see Chapter 5) in the memory hierarchy cause unpredictable stalls. Dynamic
scheduling allows the processor to hide some of those stalls by continuing to
execute instructions while waiting for the stall to end.
Second, if the processor speculates on branch outcomes using dynamic branch
prediction, it cannot know the exact order of instructions at compile time, since
it depends on the predicted and actual behavior of branches. Incorporating
dynamic speculation to exploit more instruction-level parallelism (ILP) without
incorporating dynamic scheduling would significantly restrict the benefits of
speculation.
Third, as the pipeline latency and issue width change from one implementation
to another, the best way to compile a code sequence also changes. For example, how
to schedule a sequence of dependent instructions is affected by both issue width
and latency. The pipeline structure affects both the number of times a loop must be
unrolled to avoid stalls as well as the process of compiler-based register renaming.
Dynamic scheduling allows the hardware to hide most of these details. Thus, users
and software distributors do not need to worry about having multiple versions of
a program for different implementations of the same instruction set. Similarly, old
legacy code will get much of the benefit of a new implementation without the need
for recompilation.
4.10 Parallelism via Instructions 353

Both pipelining and multiple-issue execution increase peak instruction


TheBIG
throughput and attempt to exploit instruction-level parallelism (ILP). Picture
Data and control dependences in programs, however, offer an upper limit
on sustained performance because the processor must sometimes wait for
a dependence to be resolved. Software-centric approaches to exploiting
ILP rely on the ability of the compiler to find and reduce the effects of such
dependences, while hardware-centric approaches rely on extensions to the
pipeline and issue mechanisms. Speculation, performed by the compiler
or the hardware, can increase the amount of ILP that can be exploited via
prediction, although care must be taken since speculating incorrectly is
likely to reduce performance.

Modern, high-performance microprocessors are capable of issuing several Hardware/


instructions per clock; unfortunately, sustaining that issue rate is very difficult. For Software
example, despite the existence of processors with four to six issues per clock, very
few applications can sustain more than two instructions per clock. There are two
Interface
primary reasons for this.
First, within the pipeline, the major performance bottlenecks arise from
dependences that cannot be alleviated, thus reducing the parallelism among
instructions and the sustained issue rate. Although little can be done about true
data dependences, often the compiler or hardware does not know precisely whether
a dependence exists or not, and so must conservatively assume the dependence
exists. For example, code that makes use of pointers, particularly in ways that
may lead to aliasing, will lead to more implied potential dependences. In contrast,
the greater regularity of array accesses often allows a compiler to deduce that no
354 Chapter 4 The Processor

dependences exist. Similarly, branches that cannot be accurately predicted whether


at runtime or compile time will limit the ability to exploit ILP. Often, additional
ILP is available, but the ability of the compiler or the hardware to find ILP that may
be widely separated (sometimes by the execution of thousands of instructions) is
limited.
Second, losses in the memory hierarchy (the topic of Chapter 5) also limit the
ability to keep the pipeline full. Some memory system stalls can be hidden, but
limited amounts of ILP also limit the extent to which such stalls can be hidden.

Energy Efficiency and Advanced Pipelining


The downside to the increasing exploitation of instruction-level parallelism via
dynamic multiple issue and speculation is potential energy inefficiency. Each
innovation was able to turn more transistors into performance, but they often did
so very inefficiently. Now that we have collided with the power wall, we are seeing
designs with multiple processors per chip where the processors are not as deeply
pipelined or as aggressively speculative as its predecessors.
The belief is that while the simpler processors are not as fast as their sophisticated
brethren, they deliver better performance per Joule, so that they can deliver more
performance per chip when designs are constrained more by energy than they are
by the number of transistors.
Figure 4.71 shows the number of pipeline stages, the issue width, speculation
level, clock rate, cores per chip, and power of several past and recent Intel
microprocessors. Note the drop in pipeline stages and power as companies switch
to multicore designs.

Pipeline Issue Out-of-Order/ Cores/


Microprocessor Year Clock Rate Stages Width Speculation Chip Power

Intel 486 1989 25 MHz 5 1 No 1 5 W


Intel Pentium 1993 66 MHz 5 2 No 1 10 W
Intel Pentium Pro 1997 200 MHz 10 3 Yes 1 29 W
Intel Pentium 4 Willamette 2001 2000 MHz 22 3 Yes 1 75 W
Intel Pentium 4 Prescott 2004 3600 MHz 31 3 Yes 1 103 W
Intel Core 2006 2930 MHz 14 4 Yes 2 75 W
Intel Core i5 Nehalem 2010 3300 MHz 14 4 Yes 2–4 87 W
Intel Core i5 Ivy Bridge 2012 3400 MHz 14 4 Yes 8 77 W

FIGURE 4.71 Record of Intel Microprocessors in terms of pipeline complexity, number of cores, and power. The Pentium
4 pipeline stages do not include the commit stages. If we included them, the Pentium 4 pipelines would be even deeper.

You might also like