Thanks to visit codestin.com
Credit goes to www.scribd.com

0% found this document useful (0 votes)
73 views12 pages

Time Base Circuit

This document summarizes a paper titled "Razor: A Low-Power Pipeline Based on Circuit-Level Timing Speculation" that was presented at the 36th Annual International Symposium on Microarchitecture (MICRO-36) in December 2003. The paper proposes a new approach called Razor for dynamic voltage scaling based on dynamic detection and correction of circuit timing errors. Razor monitors error rates during operation to tune the supply voltage, eliminating the need for voltage margins. It introduces a Razor flip-flop that double samples values with a fast and delayed clock, and uses a comparator to detect timing errors and trigger error recovery to restore correct program state. Analyses show Razor can achieve substantial energy savings for circuits

Uploaded by

karan007_m
Copyright
© Attribution Non-Commercial (BY-NC)
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
73 views12 pages

Time Base Circuit

This document summarizes a paper titled "Razor: A Low-Power Pipeline Based on Circuit-Level Timing Speculation" that was presented at the 36th Annual International Symposium on Microarchitecture (MICRO-36) in December 2003. The paper proposes a new approach called Razor for dynamic voltage scaling based on dynamic detection and correction of circuit timing errors. Razor monitors error rates during operation to tune the supply voltage, eliminating the need for voltage margins. It introduces a Razor flip-flop that double samples values with a fast and delayed clock, and uses a comparator to detect timing errors and trigger error recovery to restore correct program state. Analyses show Razor can achieve substantial energy savings for circuits

Uploaded by

karan007_m
Copyright
© Attribution Non-Commercial (BY-NC)
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 12

Appears in the 36th Annual International Symposium on Microarchitecture (MICRO-36), December 2003.

Razor: A Low-Power Pipeline Based on Circuit-Level Timing Speculation

Dan Ernst, Nam Sung Kim, Shidhartha Das, Sanjay Pant, Rajeev Rao, Toan Pham,
Conrad Ziesler, David Blaauw, Todd Austin, Krisztian Flautner1, and Trevor Mudge

Advanced Computer Architecture Lab


ARM Ltd 1
The University of Michigan
1301 Beal Ave 110 Fulbourn Road
Ann Arbor, MI 48109 Cambridge, UK CB1 9NJ
[email protected]
[email protected]
Abstract video is energy-inefficient for audio. The gap between high perfor-
mance and low power can be bridged through the use of dynamic
With increasing clock frequencies and silicon integration,
voltage scaling (DVS) [16], where periods of low processor utiliza-
power aware computing has become a critical concern in the design
tion are exploited by lowering the clock frequency to the minimum
of embedded processors and systems-on-chip. One of the more effec-
required level, allowing corresponding reduction in the supply volt-
tive and widely used methods for power-aware computing is
age. Since dynamic energy scales quadratically with supply voltage,
dynamic voltage scaling (DVS). In order to obtain the maximum
significant reduction in energy use can be obtained [14].
power savings from DVS, it is essential to scale the supply voltage as
Enabling systems to run at multiple frequency and voltage lev-
low as possible while ensuring correct operation of the processor.
els is a challenging process and requires characterization of the pro-
The critical voltage is chosen such that under a worst-case scenario
cessor to ensure that its operation remains correct at the required
of process and environmental variations, the processor always oper-
operating points. The minimum possible supply voltage that results
ates correctly. However, this approach leads to a very conservative
in correct operation is referred to as the critical supply voltage. The
supply voltage since such a worst-case combination of different vari-
critical supply voltage must be sufficient to ensure correct operation
abilities will be very rare. In this paper, we propose a new approach
in the face of a number of environmental and process related vari-
to DVS, called Razor, based on dynamic detection and correction of
abilities that can impact circuit performance. These include unex-
circuit timing errors. The key idea of Razor is to tune the supply volt-
pected voltage drops in the power supply network, temperature
age by monitoring the error rate during circuit operation, thereby
fluctuations, gate-length and doping concentration variations, cross-
eliminating the need for voltage margins and exploiting the data
coupling noise, etc. These variabilities may be data dependent,
dependence of circuit delay. A Razor flip-flop is introduced that dou-
meaning that they exhibit their worst-case impact on circuit perfor-
ble-samples pipeline stage values, once with a fast clock and again
mance only under certain instruction and data sequences, and are
with a time-borrowing delayed clock. A metastability-tolerant com-
composed of both local and global components. For instance, local
parator then validates latch values sampled with the fast clock. In
process variations will impact specific regions of the die in different
the event of a timing error, a modified pipeline mispeculation recov-
and independent ways, while global process variation impacts the
ery mechanism restores correct program state. A prototype Razor
circuit performance of the entire die and creates variation from one
pipeline was designed in 0.18 µm technology and was analyzed.
die to the next. Similarly, temperature and supply drop have local
Razor energy overheads during normal operation are limited to
and global components, while cross-coupling noise is a predomi-
3.1%. Analyses of a full-custom multiplier and a SPICE-level
nantly local effect.
Kogge-Stone adder model reveal that substantial energy savings are
To ensure correct operation under all possible variations, a con-
possible for these devices (up to 64.2%) with little impact on perfor-
servative supply voltage is typically selected at design-time using
mance due to error recovery (less than 3%).
corner analysis. Hence, margins are added to the critical voltage to
1 Introduction account for uncertainty in the circuit models and to account for the
A critical concern for embedded systems is the need to deliver worst-case combination of variabilities. However, such a worst-case
high levels of performance given ever-diminishing power budgets. combination of variabilities may be very rare or even impossible in a
This is evident in the evolution of the mobile phone: in the last 7 particular instance of a chip making this approach overly conserva-
years mobile phones have shown a 50X improvement in talk-time tive. And, with process scaling, the environmental and process vari-
per gram of battery1, while at the same time taking on new computa- abilities are expected to increase, worsening the required voltage
tional tasks that only recently appeared on desktop computers, such margins.
as 3D graphics, audio/video, internet access, and gaming. As the To allow for more aggressive power reduction, the supply volt-
breadth of applications for these devices widens, a single operating age can be tuned to an individual processor chip using embedded
point is no longer sufficient to efficiently meet their processing and inverter delay chains [5]. The delay of the inverter chain is used as a
power consumption requirements. For example, MPEG video play- prediction of the critical path delay of the circuit and the supply volt-
back requires an order-of-magnitude higher performance than play- age is tuned during processor operation to meet a predetermined
ing MP3s. However, running at the performance level necessary for delay through the inverter-chain. This approach to DVS has the
advantage that it dynamically adjusts the operating voltage to
account for global variations in supply voltage drop, temperature
fluctuation, and process variations. However, it cannot account for
1. Comparison of standard configurations of Nokia 232 and Ericsson local variations, such as local supply voltage drops, intra-die process
T68 phones. variations, and cross-coupled noise, and therefore requires the addi-
clk cycle 1 cycle 2 cycle 3 cycle 4

clock
Logic Stage D1 Q1 Logic Stage
0 Main
L1 1 Flip-Flop L2 clock_d
Error_L

D instr 1 instr 2
Shadow
Latch
comparator
Error Error
RAZOR FF

clk_del Q instr 1 instr 2

(a) (b)
Figure 1. Pipeline augmented with Razor latches and control lines.
tion of safety margins to the critical voltage. Also, the delay of an latched by the flip-flop and the shadow latch, a delay error in the
inverter chain does not scale with voltage and temperature in the main flip-flop is detected. The value in the shadow latch, which is
same way as the delays of the critical paths of the actual design, guaranteed to be correct, is then utilized to correct the delay failure.
which can contain complex gates and pass-transistor logic, which We present several architectural solutions for error correction, rang-
again necessitate extra voltage safety margins. In future technolo- ing from simple clock gating to more sophisticated mechanisms that
gies, the local component of environmental and process variation is augment the existing mispeculation recovery infrastructure.
expected to become more prominent and, as noted in [6], the sensi- The proposed Razor technique was implemented in a prototype
tivity of circuit performance to these variations is higher at lower 64-bit Alpha processor design. This prototype implementation was
operating voltages, thereby increasing the necessary margins and used to obtain a realistic prediction of the power overhead for in-situ
reducing the scope for energy savings. error correction and detection. We also studied the error-rate trends
In this paper, we propose a new approach to DVS, referred to as for datapath components using both circuit-level simulation as well
Razor, which is based on dynamic detection and correction of speed as silicon measurements of a full-custom multiplier block. Architec-
path failures in digital designs. The key idea of Razor is to tune the tural simulations were then performed to analyze the overall
supply voltage by monitoring the error rate during operation. Since throughput and power characteristics of Razor based DVS for differ-
this error detection provides in-situ monitoring of the actual circuit ent benchmark test programs. We demonstrate that on average,
delay, it accounts for both global and local delay variations and does Razor reduced simulated power consumption by more than 40%,
not suffer from voltage scaling disparities. It therefore eliminates the compared to traditional design-time DVS and delay-chain based
need for voltage margins that are necessary for “always-correct” cir- approaches.
cuit operation in traditional designs. In addition, a key feature of The remainder of this paper is organized as follows. In Section
Razor is that operation at sub-critical supply voltages does not con- 2, we present the implementation of Razor, providing a detailed
stitute a catastrophic failure, but instead represents a trade-off description of both the proposed circuit and architectural techniques.
between the power penalty incurred from error correction against In Section 3, we discuss the simulation framework for Razor-based
additional power savings obtained from operating at a lower supply DVS and present error rate studies and our simulation results. In
voltage. Section 4 we present a detailed survey of prior work in DVS. Finally,
It was previously observed that circuit delay is strongly data in Section 5, we draw our conclusions.
dependent, and only exhibits its worst-case delay for very specific
instruction and data sequences [24]. From this it can be conjectured 2 Razor Error Detection/Correction
that for moderately sub-critical supply voltages only a few critical Razor relies on a combination of architectural and circuit level
instructions will fail, while a majority of instructions will continue to techniques for efficient error detection and correction of delay path
operate correctly. Our hardware measurements and circuit simula- failures. The concept of Razor is illustrated in Figure 1(a) for a pipe-
tion studies support this conjecture and demonstrate that the circuit line stage. Each flip-flop in the design is augmented with a so-called
operation degrades gracefully for sub-critical supply voltages, show- shadow latch which is controlled by a delayed clock. We illustrate
ing a gradual increase in the error rate. The proposed Razor the operation of a Razor flip-flop in Figure 1(b). In clock cycle 1, the
approach automatically exploits this data-dependence of circuit combinational logic L1 meets the setup time by the rising edge of the
delay by tuning the supply voltage to obtain a small, but non-zero clock and both the main flip-flop and the shadow latch will latch the
error rate. It was found that if the error rate is maintained sufficiently correct data. In this case, the error signal at the output of the XOR
low, the power overhead from error correction is minimal, while gate remains low and the operation of the pipeline is unaltered.
substantial power savings are obtained due to operating the circuit at In cycle 2 in Figure 1(b), we show an example of the operation
a lower supply voltage. Note that as the processor executes different when the combinational logic exceeds the intended delay due to sub-
sets of instructions, the supply voltage automatically adjusts to the critical voltage scaling. In this case, the data is not latched by the
delay characteristics of the executed instruction sequence, lowering main flip-flop, but since the shadow-latch operates using a delayed
the supply voltage for instruction sequences with many non-critical clock, it successfully latches the data some time in cycle 3. To guar-
instructions, and raising the supply voltage for instruction sequences antee that the shadow latch will always latch the input data correctly,
that are more delay intensive. the allowable operating voltage is constrained at design time such
We propose a combination of circuit and architectural tech- that under worst-case conditions, the logic delay does not exceed the
niques for low cost in-situ error detection and correction of delay setup time of the shadow latch. By comparing the valid data of the
failures. At the circuit level, each delay-critical flip-flop is aug- shadow latch with the data in the main flip-flop, an error signal is
mented with a so-called shadow latch which is controlled using a then generated in cycle 3 and in the subsequent cycle, cycle 4, the
delayed clock. The operating voltage is constrained such that the valid data in the shadow latch is restored into the main flip-flop and
worst-case delay is guaranteed to meet the shadow latch setup time, becomes available to the next pipeline stage L2. Note that the local
even though the main flip-flop could fail. By comparing the values error signals Error_l are OR’ed together to ensure that the data in all
flip-flops is restored even when only one of the Razor flip-flops gen-
erates an error. clk clk_b
If an error occurs in pipeline stage L1 in a particular clock D Q

cycle, the data in L2 in the following clock cycle is incorrect and clk_b clk
Meta-stability detector
must be flushed from the pipeline using one of the pipeline control Inv_n
methods described in Section 2.2. However, since the shadow latch Error_L
Inv_p
contains the correct output data of pipeline stage L1, the instruction
does not need to be re-executed through this failing stage. Thus, a clk_del_b Error_L

key feature of Razor is that if an instruction fails in a particular pipe-


clk_del
line stage it is re-executed through the following pipeline stage,
while incurring a one cycle penalty. The proposed approach there-
Shadow Latch
fore guarantees forward progress of a failing instruction, which is
essential to avoid the perpetual failure of an instruction at a particu- Figure 2. Reduced overhead Razor flip-flop and meta-
lar stage in the pipeline. stability detection circuits.
In addition to invalidating the data in the following pipeline
stage, an error must also stall the preceding pipeline stages while the error rates. A number of methods were applied to reduce the power
shadow latch data is restored into the main flip-flops. A number of and delay overhead of the Razor flip-flop, shown in Figure 1. The
different methods, such as clock gating or flushing the instruction in multiplexer at the input the razor flip-flop results in a significant
the preceding stages, were examined to accomplish this and are dis- delay and power overhead, and was therefore moved to the feedback
cussed in Section 2.2. The proposed approach also raises a number path of the master latch of the main flip-flop, as shown in Figure 2.
of circuit related issues. The Razor flip-flop must be constructed Hence, it introduces only a slight increase in the capacitive loading
such that the power and delay overhead is minimized. Also, the pres- of the critical path and has minimal impact on the performance and
ence of the delayed clock introduces a new short-path constraint in power of the design.
the design. And finally, allowing the setup time of the main flip-flop The power overhead of Razor is also reduced by the fact that in
to be exceeded raises the possibility of meta-stability. These issues most cycles, the input of a flip-flop will not transition and only the
are discussed in more detail in Section 2.1. In the proposed Razor power overhead from switching the delayed clock is incurred. To
based DVS approach, the error signal is used to tune the supply volt- further minimize this additional clock power, the delayed clock is
age to its optimal value. In Section 2.3, we therefore discuss differ- locally generated, reducing its routing capacitance. If the delayed
ent algorithms to control the supply voltage based on the observed clock is delayed by half the clock cycle, it can be derived by simply
error rate. inverting the main clock. Also, many non-critical flip-flops in the
In general, maximum power savings is obtained from Razor design do not need Razor. If the maximum delay at the input of a
technology when it is applied to all parts of a microprocessor design. flip-flop is guaranteed to meet the required cycle time under the
To accomplish this, we identify three distinct design challenges. The worst-case sub-critical voltage, the flip-flop cannot fail and does not
first design challenge, and the focus of this paper, is the detection need to be replaced with a Razor flip-flop. It was found that in the
and recovery of timing errors in combinational logic contained prototype Alpha processor only 192 flip-flops out of a total of 2408
within pipeline datapaths, e.g., adders, shifters, and decode logic. required Razor, thereby significantly reducing the power overhead
The second design challenge is the application of Razor to on-chip of the Razor approach. For this prototype processor, the total power
SRAM structures. In SRAM structures, such as register files and overhead in error free operation (due to Razor flip-flops) was found
caches, it is necessary to introduce Razor-compatible sense amplifi- to be less than 1%, while the delay overhead was negligible.
ers and support for fast non-speculative stores. The third challenge is The use of a delayed clock at the shadow latch raises the possi-
the use of Razor on pipeline control logic to restore correct program bility that a short path in the combinational logic will corrupt the
execution in the presence of incorrect control decisions. data in the shadow latch. Figure 3 shows how a short-path allows
For the sake of brevity and clarity, the focus of this paper is data launched at the start of a cycle to be latched into the shadow
limited to the first design challenge, which is the use of Razor on latch, instead of the data launched from the previous cycle. To pre-
combinational logic blocks contained within the pipeline datapaths. vent this corruption of the shadow latch data, a minimum-path
We therefore apply Razor to a simple embedded processor which length constraint is added at the input of each Razor flip-flop in the
utilizes an in-order pipeline with simple control and small caches. In design. These minimum-path constraints result in the addition of
such a processor, control logic and SRAM structures remain error- buffers during logic synthesis to slow down fast paths and therefore
free, even at the worst-case frequency and voltage and do not require introduce a certain power overhead. Figure 3 shows that the mini-
Razor technology. However, to effectively apply Razor in large mum-path constraint is equal to the clock delay tdelay plus the hold
microprocessor designs with large caches and complex control logic, time thold of the shadow latch (which is typically a small negative
it will be necessary to apply Razor technology to all parts of the value). A large clock delay increases the severity of the short path
design. Therefore, in concert with the effort presented in this paper, constraint and therefore increases the power overhead due to the
we are developing Razor-compatible memory structures based on need for additional buffers. On the other hand, a small clock delay
bit-line sampling and architectural modifications for reduced typi- reduces the margin between the main flip-flop and the shadow latch,
cal-case latency. For control logic, we are developing techniques to and hence reduces the amount by which the supply voltage can be
checkpoint control state to enable control logic recovery. These addi- dropped below the critical supply voltage. The clock delay therefore
tional developments will be presented in future reports. presents a trade-off between the power overhead incurred from
short-path correction and the degree of possible power saving from
2.1 Circuit-level implementation issues sub-critical voltage operation. In the prototype 64-bit Alpha design,
A key requirement for Razor based DVS is that during error- the clock delay was set at 1/2 the clock period. This simplified the
free operation, the delay and power overhead due to the error detec- generation of the delayed clock while the short-path constraints
tion and correction circuitry is minimal. Otherwise, the power gain could still be easily met and resulted in a power overhead (due to
from more aggressive voltage scaling is overcome by the power buffers) of less than 3%.
overhead due to the presence of the error detection and correction In subcritical voltage operation, it is possible that the data at the
circuitry. In addition, the overhead of performing an error correction input of the main latch transitions at the same time as the clock. This
must also be minimized to enable efficient operation at moderate can give rise to meta-stability of the main flip-flop, where the output
ST WB
clock

Stabilizer FF
IF ID EX MEM

Razor FF

Razor FF

Razor FF

Razor FF
(reg/mem)

PC
error error error error

intended path short path recover recover recover recover


clock

clock_del
tdelay thold a)

Razor latch gets Correct value


Min. path delay Time (in cycles)
correct EX value provided to MEM

Min. Path Delay > tdelay + thold

Instructions
IF ID EX MEM ST stall WB

IF ID EX* MEM* MEM ST WB


Figure 3. Short Paths Constraints. IF ID EX stall MEM ST WB

IF ID stall EX MEM ST
voltage does not resolve to a definite high or low voltage, but instead
hovers near Vdd/2 [4]. The danger of meta-stability is that different b)
fan-out gates may interpret this indeterminate voltage level as differ-
ent logic states, or may even enter a meta-stable state themselves. It Figure 4. Pipeline recovery using global clock gating.
is important to note that, since the minimum sub-critical voltage is Figure a) shows the pipeline organization, Figure b)
constrained such that the setup time of the shadow latch is always illustrates the pipeline timing for a failure in the EX stage
met, the shadow latch is stable and can not exhibit meta-stability. of the pipeline. The “*” denotes a failed stage
computation.
However, if the main flip-flop is meta-stable, it is impossible to
determine if its latched value is correct or not using the XOR gate in possible perpetual failure of the same instruction. However, the pos-
Figure 2. Hence, we include a meta-stability detector circuit in the sibility of a meta-stable error signal is extremely small and does not
Razor flip-flop which detects the presence of a meta-stable voltage constitute a significant burden on the power and performance of the
levels, as shown in Figure 2. A detected meta-stability event is cor- processor. Also, only one set of double latches is needed for each
rected the same way as a regular delay failure, and results in the sta- pipeline stage, meaning that the power overhead during error-free
ble and correct data value from the shadow latch being restored in operation is negligible.
the main flip-flop. For simplicity, the meta-stability detector in Fig- 2.2 Pipeline error recovery mechanisms
ure 2 is constructed using two inverter gates with different skewed P/ The pipeline error recovery mechanism must guarantee that, in
N ratios, such that they switch at different voltage levels. If the two the presence of Razor errors, register and memory state is not cor-
inverters interpret the result differently, the flip-flop voltage is not rupted with an incorrect value. In this section, we highlight two pos-
definitive and may be meta-stable. Note that, any suitable compara- sible approaches to implementing pipeline error recovery. The first is
tor circuit could be utilized and that these meta-stability events do a simple but slow method based on clock gating, while the second
not result in a failure of the system but are corrected using the exist- method is a much more scalable technique based on counterflow
ing Razor error correction infrastructure. pipelining.
However, it is well known that complete system failure due Recovery using clock gating. Figure 4(a) illustrates a simple
meta-stability to cannot be completely avoided and only its probabil- approach to pipeline error recovery based on global clock gating. In
ity of occurrence can be reduced to negligible levels [4]. In the pro- the event that any stage detects a Razor error, the entire pipeline is
posed Razor design, this manifests itself in the small but finite stalled for one cycle by gating the next global clock edge. The addi-
probability that the error signal itself becomes meta-stable. This tional clock period allows every stage to recompute its result using
could occur if the main flip-flop output voltage was near the edge of the Razor shadow latch as input. Consequently, any previously for-
the meta-stable voltage range and, hence, the meta-stability detector warded errant values will be replaced with the correct value from the
was unable to determine if a meta-stability event occurred or not. In Razor shadow latch. Since all stages re-evaluate their result with the
this case, the error signal will not resolve to a definite voltage level Razor shadow latch input, any number of errors can be tolerated in a
and ambiguity will exist in the logic value of the error signal, possi- single cycle and forward progress is guaranteed. If all stages produce
bly causing a failure in the error correction mechanism. A standard an error each cycle, the pipeline will continue to run, but at 1/2 the
approach to reduce the probability of such an event to negligible lev- normal speed.
els is to double latch the signal. However, this would delay the It is imperative that errant pipeline results not be written to
detection of an error in the main flip-flop by one cycle, complicating architected state before it has been validated by Razor. Since valida-
the error recovery mechanism. We therefore employ at the same time tion of Razor values takes two additional cycles (i.e., one for error
an additional mechanism to detect metastable error signals, where detection and one for panic detection), there must be two non-specu-
the error signal is double latched using two skewed flip-flops. The lative stages between the last Razor latch and the writeback (WB)
probability that the outputs of the second set of flip-flops are meta- stage. In our design, memory accesses to the data cache are non-
stable is hence reduced to a negligible level and by comparing their speculative, hence, only one additional stage labeled ST for stabilize
output values, the presence of a meta-stable error signal one cycle is required before writeback (WB). The ST stage introduces an addi-
earlier can be reliably detected. Under normal operation, the error tional level of register bypass. Since store instructions must execute
signal will resolve to a definite voltage level and the output values of non-speculatively, they are performed in the WB stage of the pipe-
the two skewed flip-flops will match, indicating that the performed line.
error correction was executed correctly. However, in the unlikely Figure 4(b) gives a pipeline timing diagram of a pipeline recov-
event that the error signal is meta-stable, the outputs of the skewed ery for an instruction that fails in the EX stage of the pipeline. The
latches will differ in the subsequent clock cycle indicating that the first failed stage computation occurs in the 4th cycle, when the sec-
error correction was unsafe and could have failed. In this case, a so ond instruction computes an incorrect result in the EX stage of the
called panic signal is generated, which requires that the entire pipe- pipeline. This error is detected in the 5th cycle, but only after the
line is flushed and restarted. In this case, guaranteed forward MEM stage has computed an incorrect result using the errant value
progress is lost, and the supply voltage level must be raised to avoid
MEM ST WB
instruction to writeback. Panic situations complicate the guarantee

Stabilizer FF
IF ID EX
of forward progress, as the delay in detecting the situation may result

Razor FF

Razor FF

Razor FF

Razor FF
(read-only) (reg/mem)
PC error bubble error bubble error bubble
error bubble in the correct result being overwritten in the Razor shadow latch.
recover recover recover recover
Consequently, after experiencing a panic, the supply voltage is reset
Flush
to a known-safe operating level, and the pipeline is restarted. Once
flushID flushID flushID flushID
Control re-tuned, the errant instruction should complete without errors as
long as re-tuning is prohibited until after this instruction completes.
a)
A key requirement of the pipeline recovery control is that it not
Razor detects fault, fail under even the worst operating conditions (e.g., low voltage,
forwards bubble toward WB, Pipeline flush
initiates flush toward IF completes high temperature and high process variation). This requirement is
Time (in cycles)
met through a conservative design approach that validates the timing
Instructions

IF ID EX MEM ST WB of the error recovery circuits at the worst-case subcritical voltage.


IF ID EX* bubble MEM ST WB 2.3 Supply Voltage Control
IF ID EX flushEX flushID flushIF IF ID Many of the parameters that affect voltage margin vary over
IF ID IF
time. Temperature margins will track ambient temperatures and can
vary on-die with processing demands. Consequently, to optimize
b)
energy conservation it is desirable to introduce a voltage control sys-
tem into the design. The voltage control system adjusts the supply
Figure 5. Pipeline recovery using counterflow voltage based on monitored error rates. If the error rate is very low, it
pipelining. Figure a) shows the pipeline organization, could indicate circuit computation is finishing too quickly and volt-
Figure b) illustrates the pipeline timing for a failure in the age should be lowered. Similarly, a low error rate could indicate
EX stage of the pipeline. The “*” denotes a failed stage changes in the ambient environment (e.g., decreasing temperature),
computation.
giving additional opportunity to lower voltage. Increasing error
forward from the EX stage. After the error is detected, a global clock rates, on the other hand, indicate circuits are not meeting clock
stall occurs in the 6th cycle, permitting the correct EX result in the period constraints and voltage should be increased. The optimal
Razor shadow latch to be evaluated by the MEM stage. In the 7th error rate depends on a number of factors including the energy cost
cycle, normal pipeline operation resumes. of error recovery and overall performance requirements, but in gen-
Recovery using counterflow pipelining. In aggressively eral it is a small non-zero error rate.
clocked designs, it may not be possible to implement global clock Figure 6 illustrates the Razor voltage control system. The con-
gating without significantly impacting processor cycle time. Conse- trol systems works to maintain a constant error rate of Eref. At regu-
quently, we have designed and implemented a fully pipelined error lar intervals the error rate of the system is measured by resetting an
recovery mechanism based on counterflow pipelining techniques error counter which is sampled after a fixed period of time. The
[19]. The approach, illustrated in Figure 5(a), places negligible tim- computed error rate of the sample Esample is then subtracted from the
ing constraints on the baseline pipeline design at the expense of reference error rate to produce the error rate differential Ediff. Ediff is
extending pipeline recovery over a few cycles. When a Razor error the input to the voltage control function, which sets the target volt-
is detected, two specific actions must be taken. First, the errant stage age of the voltage regulator. If Ediff is negative the system is experi-
computation following the failing Razor latch must be nullified. This ence too many errors, and voltage should be increased. If Ediff is
action is accomplished using the bubble signal, which indicates to positive the error rate is too low and voltage should be lowered. The
the next and subsequent stages that the pipeline slot is empty. Sec- magnitude of Ediff indicates the degree to which the system is “out of
ond, the flush train is triggered by asserting the stage ID of failing tune”.
stage. In the following cycle, the correct value from the Razor While control of this system may seem simple on the surface, it
shadow latch data is injected back into the pipeline, allowing the is complicated by the slow response time of the voltage regulator.
errant instruction to continue with its correct inputs. Additionally, Typical commercial voltage regulators can take 10’s of microsec-
the flush train begins propagating the ID of the failing stage in the onds to adjust supply voltage by 100 mV. Consequently, if the con-
opposite direction of instructions. At each stage visited by the active troller reacts too fast or too abruptly, the system could become
flush train, the corresponding pipeline stage and the one immedi- unstable or go into oscillation. Moreover, an overly conservative
ately preceding are replaced with a bubble. (Two stages must be nul- control function that is slow to react to changing system environ-
lified to account for the twice relative speed of the main pipeline.) ments will reduce the overall efficiency of the design. As a starting
When the flush ID reaches the start of the pipeline, the flush control point, we have implemented a proportional control system [15]
logic restarts the pipeline at the instruction following the errant which adjusts supply voltage in proportion to the sampled Ediff. To
instruction. In the event that multiple stages experience errors in the prevent the control system from over-reacting and potentially plac-
same cycle, all will initiate recovery but only the Razor error closest ing the system in an unstable state, the error sample rate is roughly
to writeback (WB) will complete. Earlier recoveries will be flushed equivalent to the minimum voltage step period.
by later ones.
Figure 5(b) shows a pipeline timing diagram of a pipelined 3 Experimental Evaluation
recovery for an instruction that fails in the EX stage. As in the previ- 3.1 Razor Pipeline Implementation
ous example, the first failed stage computation occurs in the 4th The proposed Razor error detection and correction approach
cycle, when the second instruction computes an incorrect result in was implemented in a 64-bit Alpha processor. The processor was
the EX stage of the pipeline. This error is detected in the 5th cycle, implemented using a simple in-order pipeline consisting of instruc-
causing a bubble to be propagated out of the MEM stage and initia- tion fetch, instruction decode, execute, and memory/writeback with
tion of the flush train. The instruction in the EX, ID and IF stages are 8 Kbytes of I-cache and D-cache. The implementation details, as
flushed in the 6th, 7th and 8th cycles, respectively. Finally, the pipe- well as a die picture, are shown below in Figure 7. The processor
line is restarted after the errant instruction in cycle 9, after which was implemented using a 0.18 µm process and is expected to operate
normal pipeline operation resumes. at 200 MHz. After careful performance analysis, it was found that
In the event a panic signal is asserted, all pipeline state is only the instruction decode and execute stages were critical at the
flushed and the pipeline is restarted immediately after the last worst-case voltage and frequency settings and hence required Razor
reset
Ediff = Eref - Esample

Ediff Voltage Vdd Esample


Σ
.
Voltage

signals
error
Eref Control Pipeline .
Regulator .
- Function

panic

Figure 6. Supply Voltage Control System

Technology node 0.18 µm


Voltage range 1.8 V to 1.2 V
3 mm
Total number of logic gates 45,661
D-cache size 8 KBytes
I-cache size 8 KBytes
Die size 3 x 3.3 mm
Clock frequency 200 MHz I-Cache Register File
Clock delay 2.5 nS
Total number of flip-flops 2408 WB
Number of Razor flip-flops 192
Total number of delay buffers 2498

MEM
Error free operation
IF ID EX 3.3 mm
Total power 425 mW
Standard FF energy (switching/static) 49 fJ / 95 fJ
Razor FF energy (switching/static) 60 fJ / 160 fJ
Total delay buffer power overhead 12.2 mW D-Cache
% total power overhead 3.1%
Error correction and recovery overhead
Energy per Razor FF per error event 210 fJ
Total energy per error event 189 pJ
Razor FF recovery overhead at 10% error rate 1%

(a) (b)
Figure 7. Razor prototype implementation details and die photo.
flip-flops for their critical paths. Out of a total of 2408 flip-flops in error detection and correction power overhead does not include the
the design, 192 Razor flip-flops were used. The clock for the Razor overhead due to re-execution of instructions that were flushed from
flip-flops was delayed by 1/2 the clock cycle from the system clock. the pipeline. This additional power overhead is accounted for in the
Power analysis was performed on the processor design, using architectural simulations discussed in Sections 3.4 and 3.5.
both gate level power simulations and SPICE to evaluate the over-
head of the error correction and detection circuits. The total power
3.2 Error rate analysis
consumption during error free operation is expected to be 425 mW at Razor permits a microprocessor to tolerate circuit timing errors,
1.8 V at a clock frequency of 200 MHz. The energy consumption of thereby permitting operation at a lower voltage at the expense of
the standard and Razor flip-flops over one clock cycle in error free decreased instruction throughput. As an initial step in gauging the
operation is listed in Figure 7(a). Two values are shown for each flip- benefits of Razor technology, we empirically examined the error rate
flop, reflecting the cases when the latched data is changing (switch- of an 18x18-bit multiplier block contained within a high-density
ing) and is not changing (static). The total power overhead due to the FPGA. In addition, we used SPICE-level models to measure the
insertion of delay buffers to meet short-path constraints in the design error rates of an adder over a range of voltages and workloads.
was simulated and is expect to be 12.2 mW. The total power over- FPGA-based analysis. The multiplier experiments were per-
head due to the presence of the Razor error detection and correction formed using a Xilinx XC2V250-F456-5 FPGA [25]. This part was
circuitry in error-free operation is expected to be 3.1% of the total selected because it contains full-custom 18x18-bit multiplier blocks,
power. The final three rows of the table show the power overhead which permit the measurement of error rates for a multiplier with
due to error detection and recovery. The energy required to detect an minimal impact due to the overhead of the FPGA routing fabric. Fig-
error and restore the correct shadow latch data into the main flip-flop ure 8 illustrates the multiplier circuit under test (shaded in the sche-
was 210 fJ per error event for each Razor flip-flop. The total energy matic) and accompanying test harness. The multiplier circuit
to perform a single error detection and correction event in the Alpha implements an 18-bit by 18-bit multiplier, producing a 36-bit result
pipeline was 189 pJ, resulting in a power overhead of approximately each clock cycle. During placement, synthesis was directed to fore-
1% of total power when operating at a 10% error rate. Note that this most optimize the performance of the fast multiplier pipeline. The
resulting placement is fairly efficient with the Xilinx static timing
Slow Pipeline A

36

LFSR
18
X

48-bitLFSR

Counter
ErrorCounter
18x18

48-bit
!=

40-bitError
clk/2 clk/2
Slow Pipeline B

40-bit
36 clk/2
X
18x18

LFSR
48-bitLFSR
clk/2 clk/2

48-bit
18
Fast Pipeline

36 stabilize
X
18x18

clk clk clk

Figure 8. Multiplier Experiment Test Bench and Circuit Under Test.


analyzer (TRCE) indicating that 82% of the fast multiplier stage and evaluation latency. Initially, only those changes in circuit inputs
latency is in the custom multiplier block. that require a complete re-evaluation of the critical path results in a
Each cycle, two 48-bit linear feedback shift registers (LFSR) timing error. As the voltage continues to drop, more and more inter-
generate 18-bit uncorrelated random values, which are sent to a fast nal multiplier circuit paths cannot complete within the clock cycle
multiplier pipeline, and in alternating cycles to slow multiplier pipe- and the error rate increases. Eventually, voltage drops to the point
lines. The slow multiplier pipelines take turns safely computing the where none of the circuit paths can complete in the clock period, and
fast pipeline’s results, using a clock period that is twice as long as the error rate reaches 100%. Clearly, if the pipeline can tolerate a
the fast multiplier pipeline. The empty stage after the fast multiplier small rate of multiplier errors, it can operate with a much lower sup-
stage (labeled stabilize) allows potentially meta-stable results from ply voltage. For instance, at 1.36 V the multiplier would complete
the fast multiplier time to stabilize before they are compared with the 98.7% of all operations without error, for a total energy savings
known-correct slow multiplier results. A MUX on the output of the (excluding error recovery) of 22% over the zero-margin point, 30%
slow multiplier pipelines selects the correct result to compare against over the safety-margin point, and 35% over the environmental-mar-
the stabilized output of the fast multiplier pipeline. If the result of the gin point.
fast pipeline does not match the slow pipeline, an error counter is SPICE-level analysis. To gain a deeper understanding of the
incremented. The performance of the design was first analyzed with nature of circuit timing errors, a circuit-level design of a 64-bit
the Xilinx static timing analyzer after back-propagation of FPGA Kogge-Stone adder was implemented and analyzed. A Kogge-Stone
interconnect capacitance. The timing analyzer indicated that the fast adder is a high-performance carry-prefix adder used in a number of
multiplier stage could be clocked up to 83.5 MHz at 1.5 V and 85 C. commercial microprocessor designs [17]. The Kogge-Stone adder is
At room temperature 27 C and 1.5 V, the timing analyzer indicated implemented with the TSMC 0.18 µm standard cell library. The
that the design can run at 88.6 MHz. After the fast multiplier, the capacitance and resistance for cell interconnect was estimated based
next longest critical path in the design is the 40-bit error counter, on standard cell dimensions and adder topology. The delay of the
which works up to 140 MHz. As a result, we are confident that all standard cells were characterized for varied voltages, temperatures
errors experienced in these experiments are localized to the fast mul- and fan-out. A similar delay characterization was performed for
tiplier pipeline circuits. interconnect with varied wire lengths. Using these circuit-level char-
Figure 9 illustrates the relationship between voltage and error acterizations, a high-performance C-level timing model of the
rates for an 18x18-bit multiplier block running with random input Kogge-Stone adder was implemented and validated against SPICE
vectors at 90 MHz and 27 C. The error rates are given as a percent- simulations of the same baseline model. We rely on a C-level model
age on a log scale. Also shown on the graph are three additional to increase the number of sample vectors we can examine, and we
design points, gauged using the Xilinx static timing analyzer integrated this model into an architectural simulator to examine the
(TRCE). The zero-margin point is the lowest voltage where the cir- performance of the adder running with real programs. Comparing
cuit operates error-free at 27 C. The safety-margin point is the volt- the C-model to SPICE simulations (using HSPICE version 2001.2),
age at which the circuit runs without errors at 27 C in 90% of the we found that the error for 50 random vectors never exceeded 10%.
baseline clock period (i.e., 10ns at 100 MHz). We would expect this Using the C-level models, we then generate error rate estimates
to be approximately the voltage margin required for delay-chain tun- using 32,000 sample vector sequences. At a given frequency and
ing approaches, where voltage margins are necessary to accommo- voltage, the error rate is computed as the fraction of sample vectors
date intra-die process and temperature variations. Finally, the that do not complete within the clock period.
environmental-margin point is minimum voltage required to run Figure 10 shows the error rate of the Kogge-Stone adder, as a
without errors at 90% of the baseline clock period at the worst-case function of voltage, for three 32,000 long input sequence samples.
operating temperature of 85 C. For all experiments, error analysis was performed assuming an 870
As shown in Figure 9, the multiplier circuit fails quite grace- MHz clock and an ambient temperature of 27 C. The sample labeled
fully, taking nearly 200 mV to go from the point of the first error random is a random input sequence. The samples labeled ammp and
(1.54 V) to an error rate of 5% (1.34 V). Strikingly, at 1.52 V the bzip are adder operations sampled from the SPEC2000 benchmarks
error rate is approximately one error every 20 seconds, or put with the same name. The benchmark samples were generated by
another way, one error per 1.8 billion multiply operations. The grad- instrumenting the SimpleScalar v3.0 simulator [2] such that all
ual rise in error rate is due to the dependence between circuit inputs
100.0000000%
35% energy savings with 1.3% error 10.0000000%

Error rate (log scale)


30% energy saving 1.0000000%
0.1000000%
22% saving
0.0100000%
0.0010000%
0.0001000%
One error every ~20 seconds
0.0000100%
random 0.0000010%
0.0000001%
0.0000000%
1.78 1.74 1.70 1.66 1.62 1.58 1.54 1.50 1.46 1.42 1.38 1.34 1.30 1.26 1.22 1.18 1.14

Supply Voltage (V)


Environmental-margin Safety-margin Zero-margin
@ 1.69 V @ 1.63 V @ 1.54 V

Figure 9. Measured Error Rates for an 18x18-bit FPGA Multiplier Block at 90 MHz and 27 C.
Pipeline
Throughput
Energy
100.00%
IPC
10.00% Total Adder Energy,
E adder = E additions + E recovery
Error rate

1.00%

random 0.10%
bzip Optimal E adder
ammp 0.01%
Energy of Adder Energy of
Operations, E additions Pipeline
0.00% Recovery,
Energy of Adder
2 1.8 1.6 1.4 1.2 1 0.8 0.6 E recovery
w/o Razor Support
Supply Voltage

Decreasing Supply Voltage


Figure 10. Simulated Error Rates for a Kogge-Stone
Adder at 870 MHz and 27 C. Figure 11. The Qualitative Relationship Between
Supply Voltage, Energy and Pipeline Throughput (for
instructions using the adder (e.g., adds, subtracts, loads, stores) a fixed frequency).
recorded their inputs. The benchmark samples are taken in program
execution order starting at the SimPoint point of the execution, as adder model was integrated into the execute stage, where it was used
specified by Sherwood et al. [18]. to determine when voltage scaling introduced adder timing errors.
As shown in Figure 10, the random input, like for the multi- To perform our evaluation, we collected results from 11 of the
plier, demonstrates a gradual rise in the error rate with decreasing SPEC2000 benchmarks. All SPEC programs were compiled for a
voltage. We see a similar trend for the benchmark samples analyzed. Compaq Alpha AXP-21264 processor using the Compaq C and For-
The error rates for the real program samples increase even more tran compilers under the OSF/1 V4.0 operating system using full
slowly at first than the random sample sequence. For instance, the compiler optimization (-O4). The simulations were run for 100 mil-
ammp benchmark experiences very few errors until 1.05 V, and bzip lion instructions using the SPEC reference inputs. We used the Sim-
does not generate any substantial error rates until 1.2 V. With real Point toolset’s Early SimPoints to pinpoint program locations that
program samples, the error rate tends to rise faster once errors do were highly representative of the entire program execution [18].
take hold, even performing slightly worse than the random sequence 3.4 Energy Analysis for Fixed Voltage
at lower voltages. However, at error rates that we would expect to be Figure 11 illustrates qualitatively the relationship between sup-
easily tolerated (e.g., below 5%), the real program samples demon- ply voltage, adder energy and pipeline throughput. The total energy
strate substantially lower operating voltages than the random sample consumed by the adder (Eadder) is the sum of the energy required to
sequence. perform add operations (Eadditions) plus the energy required to
3.3 Simulator Framework and Benchmarks recover the pipeline in the event of an adder timing error (Erecovery).
The architectural simulators used in this paper are derived from Moreover, there is a fixed amount of energy overhead incurred to
the SimpleScalar/Alpha version 3.0 tool set [2], a suite of functional implement Razor checking for the adder. This energy is consumed
and timing simulation tools for the Alpha AXP ISA. Simulation is by the shadow latches and comparison logic. A trade-off exists
execution-driven, including execution down any speculative path between the adder and recovery energy components. When supply
until the detection of a fault, TLB miss, or branch misprediction. The voltage is decreased, the energy required to perform addition opera-
baseline processor modeled was a single-issue, in-order pipeline tions is decreased, but fewer of these operations are able to complete
with the pipeline stages that are described in Section 3.1. The base- within the clock period. As a result, pipeline recovery is invoked
line model was modified to simulate Razor error recovery with its more frequently with additional energy expense. Energy for the
proper penalties. Furthermore, the detailed C-level Kogge-Stone adder (Eadder) is optimized when any additional decrease in voltage
B ZIP

1 .5
Table 1. Energy-Optimal Characteristics R el Ene rgy
1 .3 R el Pe rform ance

Relative IPC and Energy


Optimal Error % Energy % IPC
Vdd 1 .1
Program Rate Reduced Reduced
bzip 1.1 0.31% 57.6% 0.70% 0 .9

crafty 1.175 0.41% 50.5% 0.60% 0 .7

eon 1.3 1.21% 34.4% 1.24% 0 .5


0 .3 1% E rror R ate
gap 1.275 1.15% 30.1% 2.49% 0 .3

6
5

5
65

35

05

75
1.

1.

1.

0.

0.
72

57

42

27

12

97

82

67
gcc 1.375 1.62% 23.7% 1.47%

1.

1.

1.

0.
1.

1.

1.

1.

1.

0.

0.

0.
Vo ltag e
gzip 1.3 1.03% 35.6% 0.41%
GCC
mcf 1.175 0.67% 48.7% 0.00%
1.5
parser 1.2 0.61% 47.9% 0.29%
R el Ene rgy
1.3 R el Pe rform ance

Relative IPC and Energy


twolf 1.275 2.67% 30.7% 0.31%
vortex 1.3 0.53% 42.8% 0.14% 1.1

vpr 1.075 0.01% 64.2% 0.00% 0.9

Average 42.4% 0.7


1 .6 2% E rror R ate

0.5
results in an energy savings that is smaller than the extra energy cost
0.3
incurred by more pipeline recoveries. The energy-optimal voltage

6
65

35

05

75
1.

1.

1.

0.

0.
72

57

42

27

12

97

82

67
varies from program to program (and even within the phases of a

1.

1.

1.

0.
1.

1.

1.

1.

1.

0.

0.

0.
program) because pipeline error rate is heavily dependent on the data V o lta g e

values sent to the adder. These trade-offs are further complicated


under a pipeline performance constraint. Decreasing voltage will Figure 12. Relative Adder Energy and Pipeline
incur additional pipeline errors, which in turn decreases pipeline Throughput for Simulated Benchmarks.
throughput (i.e., instructions per cycle). Consequently, the program We only consider the energy of the entire Razor pipeline when an
will take longer to execute. Under a performance constraint, the opti- adder timing error occurs. In this event, all activities (and pipeline
mal voltage is limited to the minimal energy that meets the perfor- energy) are directly attributable to the Razor timing error, and thus
mance constraint. must be counted against Razor adder energy savings. In essence, this
Table 1 lists for each benchmark the energy-optimal supply is the adder energy savings one could expect if the adder were given
voltage, average adder error rate, energy reduction, and IPC reduc- its own independently tunable voltage source. Total energy reduction
tion at the fixed energy-optimal voltage. The simulations are per- for the entire pipeline would only be the same if the remaining com-
formed by sweeping the voltage in 25 mV steps from 1.8 V down to ponents could scale their voltage to the same degree without increas-
0.6 V. The voltage remains fixed for the entire simulation (i.e., each ing the overall error rate.
point on the graph is a different simulation). All experiments are per- Clearly, there is significant energy to be reclaimed by running
formed at 27 C and 870 MHz, the maximum speed at which the the adder at a low error rate. All of the benchmarks experienced sig-
adder runs error-free at room temperature (i.e., the zero-margin nificant energy savings, ranging from 23.7% to 64.2%. One particu-
point). All Razor energy estimates were made using RTL-level larly encouraging result is that error rates and performance impacts
power analysis of the Razor prototype physical design described in are muted up to and slightly past the energy-optimal voltage, after
Section 3.1. The total energy of the Razor adder includes the energy which error rates rise very quickly. At the energy-optimal voltage
of the adder, Razor latch and check circuitry, and the total pipeline point, the benchmarks suffered at most a 2.49% reduction in pipeline
recovery energy incurred when a Razor adder error is detected. The performance (due to recovery flushes). There appears to be little
Razor latches and error detection circuitry increase adder energy by trade-off in performance when fully exploiting adder energy savings
about 4.3%. Error recovery energy is conservatively estimated at 18 at subcritical voltages. While we have simulated voltages down to
times the cost of a single add (at 1.8 V), based on a 6-cycle recovery 0.6V, our Razor prototype design is only capable of validating circuit
sequence at typical activity rates. It should be noted that the energy timing down to 1.2 V. This constraint will limit the energy savings of
savings reflect only that due to eliminating data-dependent delay four of the benchmarks. Since additional voltage scaling headroom
margins. If comparisons were made to existing DVS techniques that exists, we are examining techniques to further reduce voltage on
require safety margins (e.g., delay line speed detector) or tempera- future prototype designs.
ture margins (e.g., design-time DVS), the resulting energy saving
would be substantially higher. Table 1 also shows the relative perfor-
3.5 Energy Analysis for Dynamic Voltage Scaling
Reducing voltage to the energy-optimal fixed voltage point will
mance of the benchmark, given as the IPC of the program with
certainly improve the energy characteristics of a system that
Razor timing speculation divided by the IPC of a non-speculative
employs Razor. In this section, we consider the potential value of
pipeline. Since all the experiments are run at the same frequency, the
dynamically adjusting supply voltage to workload characteristics.
change in IPC due to pipeline recovery reflects true performance
We perform these experiments by engaging the proportional control
impacts. Figure 12 illustrates the relative energy and performance
system described in Section 2.3. For the simulated experiments, we
across the entire supply voltage operating range, for the benchmarks
assume a voltage regulator response time of 20 cycles per 1 mV. The
bzip and gcc.
control system samples (and then resets) an error counter every 5000
It is important to note that the energy analysis presented in this
cycles, and adjusts the voltage regular plus or minus 25 mV, depend-
section only reflects the energy savings in the Razor pipeline adder.
GCC

2
Voltage
40.00%
Table 2. Simulated DVS Energy Savings
1.8
Error Rate 35.00%
1.6 % Energy % IPC
30.00%
1.4 Program Reduced Reduced
Supply Voltage

25.00%

Error Rate
1.2 bzip 54.5% 4.13%
1 20.00%
crafty 54.8% 1.78%
0.8
15.00%

0.6 eon 30.4% 0.78%


10.00%
0.4
gap 12.9% 2.14%
5.00%
0.2

0 0.00%
gcc 31.3% 5.88%
Time gzip 44.6% 1.27%

Gap mcf 36.9% 0.47%


2 3 0. 00 % parser 53.0% 1.94%
V o ltag e 2 7. 00 %
1. 8 E rro r R ate
twolf 20.4% 0.06%
2 4. 00 %

1. 6 2 1. 00 %
vortex 49.1% 1.07%
Supply Voltage

vpr 63.6% 1.66%


Error Rate
1 8. 00 %
1. 4
1 5. 00 %
1. 2
Average 41.0%
1 2. 00 %

1 9 .0 0% age margins (implemented with additional timing loop delay) are


6 .0 0% required for safe operation.
0. 8
3 .0 0% A Delay Line Speed Detector is a device that models the worst-
0. 6 0 .0 0% case critical path of the system, plus a safety margin. Examples of
T im e these devices have been proposed by Dhar [5] and Uht [21]. Periodi-
cally, a signal transition is propagated down a delay chain and sam-
Figure 13. Adder Error Rate and Voltage Controller pled at the end of the current clock cycle. If the signal transition does
Response.
not propagate to the end of the delay chain within the clock period,
ing on the error rate differential. All simulations use a target error the system is running too close to failure and frequency and/or volt-
rate of 1.5%, which was set based on the energy-optimal error rates age must be adjusted. Since the delay chain fails prior to the core cir-
analyzed in the previous section. cuitry, any failure detected in the delay chain will proceed a core
Table 2 lists the adder energy reduction compared to a non- circuitry failing, assuming that the delay line is frequently monitored
Razor adder at the zero-margin voltage (1.8 V). Dynamically adjust- and the system is adjusted promptly upon detection of a delay line
ing voltage again results in substantial energy savings. Compared the failure. To ensure that the delay line fails first, it is necessary to add
fixed voltage experiments of the previous section, about one half of latency margins to accommodate intra-die process and temperature
the benchmarks see better energy savings, and the other half has variations, IR drop and noise. Unlike the Correlating VCO, it may be
slightly worse energy savings. With dynamic voltage scaling, most possible to put multiple delay line speed detectors across the die and
of the benchmarks ran slower, although overall performance impacts combine their timing signals in an effort to mitigate intra-die process
were still small, with the largest slowdown limited to just under 6%. and temperature variation. However, some variation is inherently
Figure 13 illustrates the change in error rate of the adder over time local (e.g., cross-coupling noise), thus some delay margin will
and the voltage control systems response to the error rate for the gcc always be required. We have not seen the use of multiple delay line
and gap benchmarks. Overall, the results for the proportional control detectors explored to date.
system are mixed. Given that it represents a fairly unsophisticated Kehl’s Triple-Latch Monitor is similar to the delay line speed
class of control functions; further investigations into supply voltage detector, but like Razor, utilizes in-situ circuit monitoring [11].
control will likely yield additional energy savings. Using this approach, all monitored system state is captured using
three latches, clocked in succession with a small delay between each.
4 Previous Work The staggered latches provide three closely spaced samples of a
Table 3 lists a number of prior proposals supporting adaptive logic block’s value each cycle. The value in the latest-clocked latch
voltage and frequency scaling. With Design-Time DVS, conservative is assumed correct and always forwarded to later logic. The system
design techniques are used to specify “legal” voltage and frequency is considered “tuned” when the first latch does not match the second
pairs that allow reliable operation of the processor under worst-case and third latch values, meaning that the logic transition was very
voltage, temperature, and process conditions. Examples of systems near the critical speed, but not dangerously close. If all latches see
that utilize this approach are Intel’s x86 SpeedStep technology [10] the same value, the system is running too slowly and should be sped
and Transmeta’s Longrun technology [20]. up. If the first two latches see different values than the last, then the
A Correlating VCO allows ambient margins to be eliminated; system is running dangerously fast and should be slowed down.
examples of this design have been proposed by Burd [3] and Gutnik Because of the in-situ nature of this approach, it could conceivably
[7]. The approach implements a voltage controlled oscillator using a adjust to intra-die process and temperature variations. However,
timing loop constructed to slightly exceed the latency of the worst- data-dependent delay variations complicate Kehl’s approach. To
case critical path in the machine, plus process and safety margins. avoid too aggressively clocking the system, speedup evaluations
When supply voltage changes, the oscillator speed will automati- must be limited to tests on worst-cast latency vectors. Kehl suggests
cally adjust to match the fastest safe clock speed. It is important to that the system should periodically stop and test worst-case vectors
note that this approach cannot compensate for intra-die process and to determine if the system should be sped up.
temperature variations, IR drop, or noise. As a result, additional volt-
Table 3. Adaptive Voltage/Frequency Scaling Landscape.

Margin Eliminated?

Technique Data Process Environmental Safety Speculative?


Design-time DVS [10][20] N N N N N
Correlating VCO [3][7] N N Y N N
Delay Line Speed Detector [5][21] N N Y N N
Triple-Latch Monitor [11] N Y Y N N
Circuit-Level Speculation [12][24] Y N N N Y
Razor Y Y Y Y Y

Circuit-level Speculation employs logic components that oper- eliminate the global clock and instead utilize data-driven control to
ate at two speeds, a fast typical speed and a slower atypical multi- orchestrate system state changes [8],[23]. The approach has long
cycle speed. The components are designed with typical usage in been held up as a promising technique to improve system throughput
mind, which in all published cases resulted in significantly favorable and power. For example, asynchronous designs readily adapt to data-
circuit speed due to shorter data-dependent circuit paths. Two prior dependence, ambient and process variation. Unfortunately, the tech-
proposal of this nature include Liu’s fast adder and scheduler designs nique is not without drawbacks, including substantial additional
[12] and Wolrich’s stutter adder [24]. Both fast adder designs were design complexity to deal with hazards and ordering of operations,
optimized to perform short-distance carry propagation in a single and more complicated system testing. While fundamentally a syn-
cycle, with longer carry propagations taking an additional cycle. chronous system, Razor can also adapt to data-dependence, ambient
Liu’s circuit-speculative scheduler provided very fast access to a few and process variation. Unlike asynchronous designs, Razor utilizes a
instructions. If dependencies warranted wake-up of other instruc- traditional synchronous design style using standard tools. An addi-
tions, multiple cycles were required. Like Razor, circuit-level specu- tional detractor for the use of asynchronous logic is its non-deter-
lation benefits by exploiting typical-case evaluation latency, which ministic operation. Temperature variation, for instance, can change
for most workloads is much more favorable than worst-case latency. the order of logic evaluation and state transitions, making functional
Unlike Razor however, circuit-level speculation cannot adapt timing and electrical validation more challenging. While Razor shares this
to changing workload or other margin factors such as temperature or non-determinism, we feel it will not put undue burden on the verifi-
process variation. Moreover, it is unclear how circuit-level timing cation process for two reasons. First, non-determinism is limited to
speculation could be adapted to dynamic voltage scaling. whether or not a stage of the pipeline will produce an error. Bugs
We are aware of three previous proposals that suggest using relating to the non-deterministic nature of the Razor pipeline will be
rate-matched redundant hardware to allow subcritical circuit opera- confined to the error recovery machinery. Second, it should be possi-
tion. Uht’s TIMERRTOL design methodology couples an over- ble to provide verification-time buffering of stage error signals,
clocked logic block with multiple safely clocked blocks of the same which would permit deterministic replay of non-deterministic execu-
logic [22]. By using multiple check logic blocks, his approach can tions. This support would address any reproducibility concerns dur-
check all overclocked computation with hardware blocks that are ing verification.
safely clocked. Uht does not address the possibility of metastability
in the fast block’s output latches or the problem of recovering system 5 Conclusions
state after a timing error. Razor addresses both of these issues and In this paper, we presented Razor, an error-tolerant dynamic
utilizes an implementation that is much less expensive. Austin sug- voltage scaling technology. The key advantage of Razor over exist-
gested that the DIVA checker could be over-designed to validate ing voltage scaling technologies is the use of in-situ timing error
computation from an overclocked core processor [1], but the details detection and correction, permitting increased energy reduction
of how this might be implemented were not explored. Hegde and because voltage margins are completely eliminated. The Razor flip-
Shanbhag proposed the use of algorithmic noise tolerance (ANT) to flop was introduced as a mechanism to double-sample pipeline stage
permit the operation of signal processing circuits at subcritical volt- values, once with an aggressive fast clock and again with a delayed
ages [9]. They couple the signal processor with a rate-matched error clock that guarantees a reliable second sample. A metastability-toler-
predictor that limits the additional noise incurred by errant circuit ant error detection circuit was described that validates all values
computations. Using their approach, voltage can be lowered to the latched on the fast Razor clock. In the event of a timing error, a mod-
extent that the application can tolerate additional noise in the signal ified pipeline flush mechanism restores the correct stage value into
processor output. the pipeline, flushes earlier instructions, and restarts the next instruc-
Our pipeline recovery mechanism is inspired from Sproull’s tion after the errant computation.
work on asynchronous counterflow pipelines [19], which was later A prototype Razor pipeline was designed and analyzed. We
adapted for synchronous systems by Miller [13]. The basic idea of a found that during normal (error-free) operation of the pipeline,
counterflow pipeline is that instruction and control signals flow in a Razor error detection increases pipeline energy demands by a mod-
direction opposite to data values. As such, global control is not nec- est 3.1%, compared to a non-Razor design of the architecture.
essary as all control signals will eventually reach the appropriate Energy requirements for error recovery were much greater. We
point in the datapath. We use a counterflow-style pipeline to imple- found that the energy required to fully recover the pipeline after an
ment low-complexity recovery of the Razor pipeline in the event of a adder timing error was about 18 times more expensive than the
circuit error. errant addition.
Razor shares many of the benefits of asynchronous designs, The error rates of real and simulated circuits were explored in
while mitigating many of their drawbacks. Asynchronous systems detail. A full-custom 18x18-bit FPGA multiplier block confirmed
that significant energy reductions are possible for real circuits, if [10] Intel Corp., “Intel SpeedStep Technology,” http://
small error rates can be tolerated. When computing on random www.intel.com.
inputs at room temperature, the multiplier circuit consumed 17% [11] T. Kehl, “Hardware Self-Tuning and Circuit Performance Mon-
less energy when all process and temperature margins on voltage itoring,” 1993 Int’l Conference on Computer Design (ICCD-93),
were eliminated. Continuing to decrease voltage to the point where October 1993.
[12] T. Liu and S. Lu, “Performance Improvement with Circuit-
1.3% of operations fail consumes 35% less energy. Detailed analysis Level Speculation,” 33rd Annual International Symposium on
of a SPICE-level Kogge-Stone adder model reveals that real pro- Microarchitecture (MICRO-33), December 2000.
gram data has more favorable error rates than random samples. [13] M. Miller, K. Janik and S.-L. Lu, “Non-Stalling Counterflow
Compared to random inputs, real program inputs see similar error Microarchitecture,” 4th International Symposium on High Perfor-
rates at a voltage that is nearly 400 mV lower. mance Computer Architecture (HPCA-4), February 1998.
Architectural simulations were performed to gauge the benefits [14] T. Mudge. “Power: A first class design constraint,” Computer,
of Razor DVS in the presence of potentially expensive pipeline vol. 34, no. 4, April 2001, pp. 52-57.
recoveries. Simulations at the fixed energy-optimal voltage for each [15] K. Ogata, “Modern Control Engineering,” 4th ed., Prentice
benchmark revealed that even with high pipeline recovery costs (in Hall, 2002.
[16] T. Pering, T. Burd, and R. Brodersen. “The Simulation and
terms of energy and performance) a Razor adder operated with 42% Evaluation of Dynamic Voltage Scaling Algorithms.” Proceedings
less energy, while only incurring at most a 2.5% reduction in pipe- of Int’l Symposium on Low Power Electronics and Design 1998, pp.
line throughput. The introduction of a proportional voltage control 76-81, June 1998.
system performed nearly as well overall, suggesting that near [17] J. Rabaey, “Digital Integrated Circuits,” Prentice Hall, 1996.
energy-optimal voltage points could be found automatically for indi- [18] T. Sherwood, E. Perelman, G. Hamerly and B. Calder, “Auto-
vidual program. In some cases, the voltage control system performed matically Characterizing Large Scale Program Behavior,” 10th Inter-
better than running with a fixed energy-optimal voltage, suggesting national Conference on Architectural Support for Programming
that program energy demands are phasic. It is likely that further Languages and Operating Systems (ASPLOS-X), October 2002.
improvement to the voltage control system would render additional [19] R. Sproull, I. Sutherland, and C. Molnar, “Counterflow Pipe-
line Processor Architecture,” Sun Microsystems Report SMLI-TR-
savings. 94-25, April 1994.
Looking ahead, there is much more ground to explore. In mid- [20] Transmeta Corporation, “LongRun Power Management,” http:/
November 2003, we tape-out our prototype Razor pipeline design /www.transmeta.com/technology/architecture/longrun.html.
for MOSIS fabrication. A few months later, we will have the first [21] A. Uht, “Uniprocessor Performance Enhancement Through
opportunity to analyze a complete Razor pipeline design. To increase Adaptive Clock Frequency Control,” 2003 International Conference
the scope of Razor, we have begun exploring its application to mem- on Advances in Infrastructure for e-Business, e-Education, e-Sci-
ory structures and pipeline control logic. Finally, there is a great ence, e-Medicine, and Mobile Technologies on the Internet (SSGRR
opportunity to “re-think” system design in the context of Razor. In 2003w), January 2003.
particular, we want to investigate the design of functional units and [22] A. Uht, “Achieving Typical Delays in Synchronous Systems
via Timing Error Toleration,” University of Rhode Island TR-
memory structures optimized for typical-case latency. These new 032000-0100, March 2000.
designs should have lower error rates, thereby creating additional [23] S. Unger, “Asynchronous Sequential Switching Circuits,” New
opportunity to lower energy demands. York: Wiley-Interscience, John Wiley & Sons, Inc., 1969.
[24] G. Wolrich, E. McLellan, L. Harada, J. Montanaro, and R. Yod-
Acknowledgements lowski, “A High Performance Floating Point Coprocessor,” IEEE
This work was supported by ARM, an Intel Graduate Fellow- Journal of Solid-State Circuits, 19 (5), October 1984.
ship, the Defense Advanced Research Projects Agency, the Semi- [25] Xilinx Corporation, “Virtex-II Platform FPGA,” http://
conductor Research Corporation, the Gigascale Systems Research www.xilinx.com/products/tables/fpga.htm#v2
Center, the National Science Foundation, and the Sloan Foundation.
References
[1] T. Austin, “DIVA: A Reliable Substrate for Deep Submicron
Microarchitecture Design,” 32nd Int’l Symposium on Microarchitec-
ture, Nov. 1999.
[2] T. Austin, E. Larson, D. Ernst. SimpleScalar: an Infrastructure
for Computer System Modeling, IEEE Computer, 35 (2), February
2002.
[3] T. Burd, T. Pering, A. Stratakos, and R. Brodersen, “A Dynamic
Voltage Scaled Microprocessor System,” Int’l Solid-State Circuits
Conf., Feb. 2000.
[4] W. Dally, J. Poulton, Digital System Engineering, Cambridge
Press, 1998
[5] S. Dhar, D. Maksimovic, and B. Kranzen, “Closed-Loop Adap-
tive Voltage Scaling Controller For Standard-Cell ASICs,” 2002
Int’l Symposium on Low Power Electronics and Design (ISLPED-
2002), August 2002.
[6] R. Gonzalez, B. Gordon, and M. Horowitz, “Supply and Thresh-
old Voltage Scaling for Low Power CMOS,” IEEE JSSC, 32 (8),
August 1997.
[7] V. Gutnik and A. Chandrakasan, “An Efficient Controller for
Variable Supply-Voltage Low Power Processing,” Symp. on VLSI
Circuits, June 1996.
[8] S. Hauck, “Asynchronous Design Methodologies: An Over-
view,” Proceedings of the IEEE, 83 (1), January 1995.
[9] R. Hegde and N. Shanbhag, “Energy-efficient signal processing
via algorithmic noise-tolerance,” 1999 International Symposium on
Low-Power Electronics and Design (ISLPED-99), August 1999.

You might also like