M3.5 Instruction Level Parallesim
M3.5 Instruction Level Parallesim
imprecise Elaboration: The difficulty of always associating the proper exception with the
interrupt Also called correct instruction in pipelined computers has led some computer designers to relax
imprecise exception. this requirement in noncritical cases. Such processors are said to have imprecise
Interrupts or exceptions interrupts or imprecise exceptions. In the example above, PC would normally have 58hex
in pipelined computers at the start of the clock cycle after the exception is detected, even though the offending
that are not associated instruction is at address 4Chex. A processor with imprecise exceptions might put 58hex
with the exact instruction
into ELR and leave it up to the operating system to determine which instruction caused
that was the cause of the
the problem. LEGv8 and the vast majority of computers today support precise interrupts
interrupt or exception.
or precise exceptions. One reason is designers of a deeper pipeline processor might
be tempted to record a different value in the ELR, which would create headaches for the
precise interrupt Also OS. To prevent them, the deeper pipeline would likely be required to record the same PC
called precise exception. that would have been recorded in the five-stage pipeline. It is simpler for everyone to just
An interrupt or exception
record the PC of the faulting instruction instead. (Another reason is to support virtual
that is always associated
memory, which we shall see in Chapter 5.)
with the correct
instruction in pipelined
computers. Elaboration: We show that LEGv8 uses the exception entry address
0000 0000 1C09 0000hex, which is based on the ARMv8 Model Architecture. ARMv8
can have different exception entry addresses, depending on the platform.
Elaboration: The LEGv8 architecture has three levels of exception, each with their
own ELR and ESR registers, as we’ll see in Chapter 5.
move from a four-stage to a six-stage pipeline. To get the full speed-up, we need instruction-level
to rebalance the remaining steps so they are the same length, in processors or in parallelism The
laundry. The amount of parallelism being exploited is higher, since there are more parallelism among
instructions.
operations being overlapped. Performance is potentially greater since the clock
cycle can be shorter.
Another approach is to replicate the internal components of the computer so
that it can launch multiple instructions in every pipeline stage. The general name
for this technique is multiple issue. A multiple-issue laundry would replace our multiple issue A
household washer and dryer with, say, three washers and three dryers. You would scheme whereby multiple
also have to recruit more assistants to fold and put away three times as much instructions are launched
in one clock cycle.
laundry in the same amount of time. The downside is the extra work to keep all the
machines busy and transferring the loads to the next pipeline stage.
Launching multiple instructions per stage allows the instruction execution rate to
exceed the clock rate or, stated alternatively, the CPI to be less than 1. As mentioned
in Chapter 1, it is sometimes useful to flip the metric and use IPC, or instructions
per clock cycle. Hence, a 3-GHz four-way multiple-issue microprocessor can
execute a peak rate of 12 billion instructions per second and have a best-case CPI
of 0.33, or an IPC of 3. Assuming a five-stage pipeline, such a processor would have
15 instructions in execution at any given time. Today’s high-end microprocessors
attempt to issue from three to six instructions in every clock cycle. Even moderate
static multiple issue An
designs will aim at a peak IPC of 2. There are typically, however, many constraints approach to implementing
on what types of instructions may be executed simultaneously, and what happens a multiple-issue processor
when dependences arise. where many decisions
There are two main ways to implement a multiple-issue processor, with the are made by the compiler
major difference being the division of work between the compiler and the hardware. before execution.
Because the division of work dictates whether decisions are being made statically
dynamic multiple
(that is, at compile time) or dynamically (that is, during execution), the approaches issue An approach to
are sometimes called static multiple issue and dynamic multiple issue. As we will implementing a multiple-
see, both approaches have other, more commonly used names, which may be less issue processor where
precise or more restrictive. many decisions are made
Two primary and distinct responsibilities must be dealt with in a multiple-issue during execution by the
pipeline: processor.
1. Packaging instructions into issue slots: how does the processor determine issue slots The positions
how many instructions and which instructions can be issued in a given from which instructions
clock cycle? In most static issue processors, this process is at least partially could issue in a given
clock cycle; by analogy,
handled by the compiler; in dynamic issue designs, it is normally dealt with
these correspond to
at runtime by the processor, although the compiler will often have already positions at the starting
tried to help improve the issue rate by placing the instructions in a beneficial blocks for a sprint.
order.
2. Dealing with data and control hazards: in static issue processors, the compiler
handles some or all the consequences of data and control hazards statically.
In contrast, most dynamic issue processors attempt to alleviate at least some
classes of hazards using hardware techniques operating at execution time.
344 Chapter 4 The Processor
Since speculation can improve performance when done properly and decrease
performance when done carelessly, significant effort goes into deciding when it
is appropriate to speculate. Later in this section, we will examine both static and
dynamic techniques for speculation.
FIGURE 4.66 Static two-issue pipeline in operation. The ALU and data transfer instructions
are issued at the same time. Here we have assumed the same five-stage structure as used for the single-issue
pipeline. Although this is not strictly necessary, it does have some advantages. In particular, keeping the
register writes at the end of the pipeline simplifies the handling of exceptions and the maintenance of a
precise exception model, which become more difficult in multiple-issue processors.
To issue an ALU and a data transfer operation in parallel, the first need for
additional hardware—beyond the usual hazard detection and stall logic—is extra
ports in the register file (see Figure 4.67). In one clock cycle, we may need to read
two registers for the ALU operation and two more for a store, and also one write
port for an ALU operation and one write port for a load. Since the ALU is tied
up for the ALU operation, we also need a separate adder to calculate the effective
address for data transfers. Without these extra resources, our two-issue pipeline
would be hindered by structural hazards.
Clearly, this two-issue processor can improve performance by up to a factor of
two! Doing so, however, requires that twice as many instructions be overlapped
in execution, and this additional overlap increases the relative performance loss
from data and control hazards. For example, in our simple five-stage pipeline,
use latency Number loads have a use latency of one clock cycle, which prevents one instruction from
of clock cycles between using the result without stalling. In the two-issue, five-stage pipeline the result of
a load instruction and a load instruction cannot be used on the next clock cycle. This means that the next
an instruction that can
two instructions cannot use the load result without stalling. Furthermore, ALU
use the result of the
load without stalling the instructions that had no use latency in the simple five-stage pipeline now have a
pipeline. one-instruction use latency, since the results cannot be used in the paired load or
store. To effectively exploit the parallelism available in a multiple-issue processor,
more ambitious compiler or hardware scheduling techniques are needed, and static
multiple issue requires that the compiler take on this role.
4.10 Parallelism via Instructions 347
4
ALU
Registers
1C090000 PC Instruction
memory
Write
data
Data
Sign- ALU
extend memory
Sign-
extend
Address
FIGURE 4.67 A static two-issue datapath. The additions needed for double issue are highlighted: another 32 bits from instruction
memory, two more read ports and one more write port on the register file, and another ALU. Assume the bottom ALU handles address
calculations for data transfers and the top ALU handles everything else.
FIGURE 4.68 The scheduled code as it would look on a two-issue LEGv8 pipeline. The
empty slots are no-ops.
loop unrolling A
technique to get more An important compiler technique to get more performance from loops is loop
performance from loops unrolling, where multiple copies of the loop body are made. After unrolling, there
that access arrays, in is more ILP available by overlapping instructions from different iterations.
which multiple copies of
the loop body are made
and instructions from
different iterations are
scheduled together.
To schedule the loop without any delays, it turns out that we need to make
ANSWER four copies of the loop body. After unrolling and eliminating the unnecessary
loop overhead instructions, the loop will contain four copies each of LDUR,
ADD, and STUR, plus one SUBI, one CMP, and one CBZ. Figure 4.69 shows the
unrolled and scheduled code.
During the unrolling process, the compiler introduced additional registers
register renaming The (X1, X2, X3). The goal of this process, called register renaming, is to eliminate
renaming of registers dependences that are not true data dependences, but could either lead to
by the compiler or potential hazards or prevent the compiler from flexibly scheduling the code.
hardware to remove
Consider how the unrolled code would look using only X0. There would be
antidependences.
repeated instances of LDUR X0, [X20,#0], ADD X0, X0, X21 followed by STUR
X0, [X20,#8], but these sequences, despite using X0, are actually completely
independent—no data values flow between one set of these instructions and the
antidependence next set. This case is what is called an antidependence or name dependence,
Also called name which is an ordering forced purely by the reuse of a name, rather than a real
dependence An data dependence that is also called a true dependence.
ordering forced by the
Renaming the registers during the unrolling process allows the compiler to
reuse of a name, typically
a register, rather than by move these independent instructions subsequently to better schedule the code.
a true dependence that The renaming process eliminates the name dependences, while preserving the
carries a value between true dependences.
two instructions.
4.10 Parallelism via Instructions 349
FIGURE 4.69 The unrolled and scheduled code of Figure 4.68 as it would look on a static
two-issue LEGv8 pipeline. The empty slots are no-ops. Since the first instruction in the loop decrements
X20 by 32, the addresses loaded are the original value of X20, then that address minus 8, minus 16, and
minus 24.
Notice now that 14 of the 15 instructions in the loop execute as pairs. It takes
eight clocks for four loop iterations, which yields an IPC of 15/8 = 1.88. Loop
unrolling and scheduling more than doubled performance—8 versus 20 clock
cycles for 4 iterations—partly from reducing the loop control instructions and
partly from dual issue execution. The cost of this performance improvement is
using four temporary registers rather than one, as well as more than doubling
the code size.
and stalls. Let’s start with a simple example of avoiding a data hazard. Consider
the following code sequence:
LDUR X0, [X21,#20]
ADD X1, X0, X2
SUB X23, X23, X3
ANDI X5, X23, 20
Even though the SUB instruction is ready to execute, it must wait for the LDUR
and ADD to complete first, which might take many clock cycles if memory is slow.
(Chapter 5 explains cache misses, the reason that memory accesses are sometimes
very slow.) Dynamic pipeline scheduling allows such hazards to be avoided either
fully or partially.
Instruction fetch
In-order issue
and decode unit
FIGURE 4.70 The three primary units of a dynamically scheduled pipeline. The final step of
updating the state is also called retirement or graduation.
by instructions before the instruction causing the exception. Although the front
end (fetch and issue) and the back end (commit) of the pipeline run in order, the
functional units are free to initiate execution whenever the data they need are
available. Today, all dynamically scheduled pipelines use in-order commit.
Dynamic scheduling is often extended by including hardware-based
speculation, especially for branch outcomes. By predicting the direction of a
branch, a dynamically scheduled processor can continue to fetch and execute
instructions along the predicted path. Because the instructions are committed
in order, we know whether the branch was correctly predicted before any
instructions from the predicted path are committed. A speculative, dynamically
scheduled pipeline can also support speculation on load addresses, allowing load-
store reordering, and using the commit unit to avoid incorrect speculation. In the
next section, we will look at the use of dynamic scheduling with speculation in
the Intel Core i7 design.
Understanding Given that compilers can also schedule code around data dependences, you might
ask why a superscalar processor would use dynamic scheduling. There are three
Program major reasons. First, not all stalls are predictable. In particular, cache misses
Performance (see Chapter 5) in the memory hierarchy cause unpredictable stalls. Dynamic
scheduling allows the processor to hide some of those stalls by continuing to
execute instructions while waiting for the stall to end.
Second, if the processor speculates on branch outcomes using dynamic branch
prediction, it cannot know the exact order of instructions at compile time, since
it depends on the predicted and actual behavior of branches. Incorporating
dynamic speculation to exploit more instruction-level parallelism (ILP) without
incorporating dynamic scheduling would significantly restrict the benefits of
speculation.
Third, as the pipeline latency and issue width change from one implementation
to another, the best way to compile a code sequence also changes. For example, how
to schedule a sequence of dependent instructions is affected by both issue width
and latency. The pipeline structure affects both the number of times a loop must be
unrolled to avoid stalls as well as the process of compiler-based register renaming.
Dynamic scheduling allows the hardware to hide most of these details. Thus, users
and software distributors do not need to worry about having multiple versions of
a program for different implementations of the same instruction set. Similarly, old
legacy code will get much of the benefit of a new implementation without the need
for recompilation.
4.10 Parallelism via Instructions 353
FIGURE 4.71 Record of Intel Microprocessors in terms of pipeline complexity, number of cores, and power. The Pentium
4 pipeline stages do not include the commit stages. If we included them, the Pentium 4 pipelines would be even deeper.