Time Base Circuit
Time Base Circuit
Dan Ernst, Nam Sung Kim, Shidhartha Das, Sanjay Pant, Rajeev Rao, Toan Pham,
Conrad Ziesler, David Blaauw, Todd Austin, Krisztian Flautner1, and Trevor Mudge
clock
Logic Stage D1 Q1 Logic Stage
0 Main
L1 1 Flip-Flop L2 clock_d
Error_L
D instr 1 instr 2
Shadow
Latch
comparator
Error Error
RAZOR FF
(a) (b)
Figure 1. Pipeline augmented with Razor latches and control lines.
tion of safety margins to the critical voltage. Also, the delay of an latched by the flip-flop and the shadow latch, a delay error in the
inverter chain does not scale with voltage and temperature in the main flip-flop is detected. The value in the shadow latch, which is
same way as the delays of the critical paths of the actual design, guaranteed to be correct, is then utilized to correct the delay failure.
which can contain complex gates and pass-transistor logic, which We present several architectural solutions for error correction, rang-
again necessitate extra voltage safety margins. In future technolo- ing from simple clock gating to more sophisticated mechanisms that
gies, the local component of environmental and process variation is augment the existing mispeculation recovery infrastructure.
expected to become more prominent and, as noted in [6], the sensi- The proposed Razor technique was implemented in a prototype
tivity of circuit performance to these variations is higher at lower 64-bit Alpha processor design. This prototype implementation was
operating voltages, thereby increasing the necessary margins and used to obtain a realistic prediction of the power overhead for in-situ
reducing the scope for energy savings. error correction and detection. We also studied the error-rate trends
In this paper, we propose a new approach to DVS, referred to as for datapath components using both circuit-level simulation as well
Razor, which is based on dynamic detection and correction of speed as silicon measurements of a full-custom multiplier block. Architec-
path failures in digital designs. The key idea of Razor is to tune the tural simulations were then performed to analyze the overall
supply voltage by monitoring the error rate during operation. Since throughput and power characteristics of Razor based DVS for differ-
this error detection provides in-situ monitoring of the actual circuit ent benchmark test programs. We demonstrate that on average,
delay, it accounts for both global and local delay variations and does Razor reduced simulated power consumption by more than 40%,
not suffer from voltage scaling disparities. It therefore eliminates the compared to traditional design-time DVS and delay-chain based
need for voltage margins that are necessary for “always-correct” cir- approaches.
cuit operation in traditional designs. In addition, a key feature of The remainder of this paper is organized as follows. In Section
Razor is that operation at sub-critical supply voltages does not con- 2, we present the implementation of Razor, providing a detailed
stitute a catastrophic failure, but instead represents a trade-off description of both the proposed circuit and architectural techniques.
between the power penalty incurred from error correction against In Section 3, we discuss the simulation framework for Razor-based
additional power savings obtained from operating at a lower supply DVS and present error rate studies and our simulation results. In
voltage. Section 4 we present a detailed survey of prior work in DVS. Finally,
It was previously observed that circuit delay is strongly data in Section 5, we draw our conclusions.
dependent, and only exhibits its worst-case delay for very specific
instruction and data sequences [24]. From this it can be conjectured 2 Razor Error Detection/Correction
that for moderately sub-critical supply voltages only a few critical Razor relies on a combination of architectural and circuit level
instructions will fail, while a majority of instructions will continue to techniques for efficient error detection and correction of delay path
operate correctly. Our hardware measurements and circuit simula- failures. The concept of Razor is illustrated in Figure 1(a) for a pipe-
tion studies support this conjecture and demonstrate that the circuit line stage. Each flip-flop in the design is augmented with a so-called
operation degrades gracefully for sub-critical supply voltages, show- shadow latch which is controlled by a delayed clock. We illustrate
ing a gradual increase in the error rate. The proposed Razor the operation of a Razor flip-flop in Figure 1(b). In clock cycle 1, the
approach automatically exploits this data-dependence of circuit combinational logic L1 meets the setup time by the rising edge of the
delay by tuning the supply voltage to obtain a small, but non-zero clock and both the main flip-flop and the shadow latch will latch the
error rate. It was found that if the error rate is maintained sufficiently correct data. In this case, the error signal at the output of the XOR
low, the power overhead from error correction is minimal, while gate remains low and the operation of the pipeline is unaltered.
substantial power savings are obtained due to operating the circuit at In cycle 2 in Figure 1(b), we show an example of the operation
a lower supply voltage. Note that as the processor executes different when the combinational logic exceeds the intended delay due to sub-
sets of instructions, the supply voltage automatically adjusts to the critical voltage scaling. In this case, the data is not latched by the
delay characteristics of the executed instruction sequence, lowering main flip-flop, but since the shadow-latch operates using a delayed
the supply voltage for instruction sequences with many non-critical clock, it successfully latches the data some time in cycle 3. To guar-
instructions, and raising the supply voltage for instruction sequences antee that the shadow latch will always latch the input data correctly,
that are more delay intensive. the allowable operating voltage is constrained at design time such
We propose a combination of circuit and architectural tech- that under worst-case conditions, the logic delay does not exceed the
niques for low cost in-situ error detection and correction of delay setup time of the shadow latch. By comparing the valid data of the
failures. At the circuit level, each delay-critical flip-flop is aug- shadow latch with the data in the main flip-flop, an error signal is
mented with a so-called shadow latch which is controlled using a then generated in cycle 3 and in the subsequent cycle, cycle 4, the
delayed clock. The operating voltage is constrained such that the valid data in the shadow latch is restored into the main flip-flop and
worst-case delay is guaranteed to meet the shadow latch setup time, becomes available to the next pipeline stage L2. Note that the local
even though the main flip-flop could fail. By comparing the values error signals Error_l are OR’ed together to ensure that the data in all
flip-flops is restored even when only one of the Razor flip-flops gen-
erates an error. clk clk_b
If an error occurs in pipeline stage L1 in a particular clock D Q
cycle, the data in L2 in the following clock cycle is incorrect and clk_b clk
Meta-stability detector
must be flushed from the pipeline using one of the pipeline control Inv_n
methods described in Section 2.2. However, since the shadow latch Error_L
Inv_p
contains the correct output data of pipeline stage L1, the instruction
does not need to be re-executed through this failing stage. Thus, a clk_del_b Error_L
Stabilizer FF
IF ID EX MEM
Razor FF
Razor FF
Razor FF
Razor FF
(reg/mem)
PC
error error error error
clock_del
tdelay thold a)
Instructions
IF ID EX MEM ST stall WB
IF ID stall EX MEM ST
voltage does not resolve to a definite high or low voltage, but instead
hovers near Vdd/2 [4]. The danger of meta-stability is that different b)
fan-out gates may interpret this indeterminate voltage level as differ-
ent logic states, or may even enter a meta-stable state themselves. It Figure 4. Pipeline recovery using global clock gating.
is important to note that, since the minimum sub-critical voltage is Figure a) shows the pipeline organization, Figure b)
constrained such that the setup time of the shadow latch is always illustrates the pipeline timing for a failure in the EX stage
met, the shadow latch is stable and can not exhibit meta-stability. of the pipeline. The “*” denotes a failed stage
computation.
However, if the main flip-flop is meta-stable, it is impossible to
determine if its latched value is correct or not using the XOR gate in possible perpetual failure of the same instruction. However, the pos-
Figure 2. Hence, we include a meta-stability detector circuit in the sibility of a meta-stable error signal is extremely small and does not
Razor flip-flop which detects the presence of a meta-stable voltage constitute a significant burden on the power and performance of the
levels, as shown in Figure 2. A detected meta-stability event is cor- processor. Also, only one set of double latches is needed for each
rected the same way as a regular delay failure, and results in the sta- pipeline stage, meaning that the power overhead during error-free
ble and correct data value from the shadow latch being restored in operation is negligible.
the main flip-flop. For simplicity, the meta-stability detector in Fig- 2.2 Pipeline error recovery mechanisms
ure 2 is constructed using two inverter gates with different skewed P/ The pipeline error recovery mechanism must guarantee that, in
N ratios, such that they switch at different voltage levels. If the two the presence of Razor errors, register and memory state is not cor-
inverters interpret the result differently, the flip-flop voltage is not rupted with an incorrect value. In this section, we highlight two pos-
definitive and may be meta-stable. Note that, any suitable compara- sible approaches to implementing pipeline error recovery. The first is
tor circuit could be utilized and that these meta-stability events do a simple but slow method based on clock gating, while the second
not result in a failure of the system but are corrected using the exist- method is a much more scalable technique based on counterflow
ing Razor error correction infrastructure. pipelining.
However, it is well known that complete system failure due Recovery using clock gating. Figure 4(a) illustrates a simple
meta-stability to cannot be completely avoided and only its probabil- approach to pipeline error recovery based on global clock gating. In
ity of occurrence can be reduced to negligible levels [4]. In the pro- the event that any stage detects a Razor error, the entire pipeline is
posed Razor design, this manifests itself in the small but finite stalled for one cycle by gating the next global clock edge. The addi-
probability that the error signal itself becomes meta-stable. This tional clock period allows every stage to recompute its result using
could occur if the main flip-flop output voltage was near the edge of the Razor shadow latch as input. Consequently, any previously for-
the meta-stable voltage range and, hence, the meta-stability detector warded errant values will be replaced with the correct value from the
was unable to determine if a meta-stability event occurred or not. In Razor shadow latch. Since all stages re-evaluate their result with the
this case, the error signal will not resolve to a definite voltage level Razor shadow latch input, any number of errors can be tolerated in a
and ambiguity will exist in the logic value of the error signal, possi- single cycle and forward progress is guaranteed. If all stages produce
bly causing a failure in the error correction mechanism. A standard an error each cycle, the pipeline will continue to run, but at 1/2 the
approach to reduce the probability of such an event to negligible lev- normal speed.
els is to double latch the signal. However, this would delay the It is imperative that errant pipeline results not be written to
detection of an error in the main flip-flop by one cycle, complicating architected state before it has been validated by Razor. Since valida-
the error recovery mechanism. We therefore employ at the same time tion of Razor values takes two additional cycles (i.e., one for error
an additional mechanism to detect metastable error signals, where detection and one for panic detection), there must be two non-specu-
the error signal is double latched using two skewed flip-flops. The lative stages between the last Razor latch and the writeback (WB)
probability that the outputs of the second set of flip-flops are meta- stage. In our design, memory accesses to the data cache are non-
stable is hence reduced to a negligible level and by comparing their speculative, hence, only one additional stage labeled ST for stabilize
output values, the presence of a meta-stable error signal one cycle is required before writeback (WB). The ST stage introduces an addi-
earlier can be reliably detected. Under normal operation, the error tional level of register bypass. Since store instructions must execute
signal will resolve to a definite voltage level and the output values of non-speculatively, they are performed in the WB stage of the pipe-
the two skewed flip-flops will match, indicating that the performed line.
error correction was executed correctly. However, in the unlikely Figure 4(b) gives a pipeline timing diagram of a pipeline recov-
event that the error signal is meta-stable, the outputs of the skewed ery for an instruction that fails in the EX stage of the pipeline. The
latches will differ in the subsequent clock cycle indicating that the first failed stage computation occurs in the 4th cycle, when the sec-
error correction was unsafe and could have failed. In this case, a so ond instruction computes an incorrect result in the EX stage of the
called panic signal is generated, which requires that the entire pipe- pipeline. This error is detected in the 5th cycle, but only after the
line is flushed and restarted. In this case, guaranteed forward MEM stage has computed an incorrect result using the errant value
progress is lost, and the supply voltage level must be raised to avoid
MEM ST WB
instruction to writeback. Panic situations complicate the guarantee
Stabilizer FF
IF ID EX
of forward progress, as the delay in detecting the situation may result
Razor FF
Razor FF
Razor FF
Razor FF
(read-only) (reg/mem)
PC error bubble error bubble error bubble
error bubble in the correct result being overwritten in the Razor shadow latch.
recover recover recover recover
Consequently, after experiencing a panic, the supply voltage is reset
Flush
to a known-safe operating level, and the pipeline is restarted. Once
flushID flushID flushID flushID
Control re-tuned, the errant instruction should complete without errors as
long as re-tuning is prohibited until after this instruction completes.
a)
A key requirement of the pipeline recovery control is that it not
Razor detects fault, fail under even the worst operating conditions (e.g., low voltage,
forwards bubble toward WB, Pipeline flush
initiates flush toward IF completes high temperature and high process variation). This requirement is
Time (in cycles)
met through a conservative design approach that validates the timing
Instructions
signals
error
Eref Control Pipeline .
Regulator .
- Function
panic
MEM
Error free operation
IF ID EX 3.3 mm
Total power 425 mW
Standard FF energy (switching/static) 49 fJ / 95 fJ
Razor FF energy (switching/static) 60 fJ / 160 fJ
Total delay buffer power overhead 12.2 mW D-Cache
% total power overhead 3.1%
Error correction and recovery overhead
Energy per Razor FF per error event 210 fJ
Total energy per error event 189 pJ
Razor FF recovery overhead at 10% error rate 1%
(a) (b)
Figure 7. Razor prototype implementation details and die photo.
flip-flops for their critical paths. Out of a total of 2408 flip-flops in error detection and correction power overhead does not include the
the design, 192 Razor flip-flops were used. The clock for the Razor overhead due to re-execution of instructions that were flushed from
flip-flops was delayed by 1/2 the clock cycle from the system clock. the pipeline. This additional power overhead is accounted for in the
Power analysis was performed on the processor design, using architectural simulations discussed in Sections 3.4 and 3.5.
both gate level power simulations and SPICE to evaluate the over-
head of the error correction and detection circuits. The total power
3.2 Error rate analysis
consumption during error free operation is expected to be 425 mW at Razor permits a microprocessor to tolerate circuit timing errors,
1.8 V at a clock frequency of 200 MHz. The energy consumption of thereby permitting operation at a lower voltage at the expense of
the standard and Razor flip-flops over one clock cycle in error free decreased instruction throughput. As an initial step in gauging the
operation is listed in Figure 7(a). Two values are shown for each flip- benefits of Razor technology, we empirically examined the error rate
flop, reflecting the cases when the latched data is changing (switch- of an 18x18-bit multiplier block contained within a high-density
ing) and is not changing (static). The total power overhead due to the FPGA. In addition, we used SPICE-level models to measure the
insertion of delay buffers to meet short-path constraints in the design error rates of an adder over a range of voltages and workloads.
was simulated and is expect to be 12.2 mW. The total power over- FPGA-based analysis. The multiplier experiments were per-
head due to the presence of the Razor error detection and correction formed using a Xilinx XC2V250-F456-5 FPGA [25]. This part was
circuitry in error-free operation is expected to be 3.1% of the total selected because it contains full-custom 18x18-bit multiplier blocks,
power. The final three rows of the table show the power overhead which permit the measurement of error rates for a multiplier with
due to error detection and recovery. The energy required to detect an minimal impact due to the overhead of the FPGA routing fabric. Fig-
error and restore the correct shadow latch data into the main flip-flop ure 8 illustrates the multiplier circuit under test (shaded in the sche-
was 210 fJ per error event for each Razor flip-flop. The total energy matic) and accompanying test harness. The multiplier circuit
to perform a single error detection and correction event in the Alpha implements an 18-bit by 18-bit multiplier, producing a 36-bit result
pipeline was 189 pJ, resulting in a power overhead of approximately each clock cycle. During placement, synthesis was directed to fore-
1% of total power when operating at a 10% error rate. Note that this most optimize the performance of the fast multiplier pipeline. The
resulting placement is fairly efficient with the Xilinx static timing
Slow Pipeline A
36
LFSR
18
X
48-bitLFSR
Counter
ErrorCounter
18x18
48-bit
!=
40-bitError
clk/2 clk/2
Slow Pipeline B
40-bit
36 clk/2
X
18x18
LFSR
48-bitLFSR
clk/2 clk/2
48-bit
18
Fast Pipeline
36 stabilize
X
18x18
Figure 9. Measured Error Rates for an 18x18-bit FPGA Multiplier Block at 90 MHz and 27 C.
Pipeline
Throughput
Energy
100.00%
IPC
10.00% Total Adder Energy,
E adder = E additions + E recovery
Error rate
1.00%
random 0.10%
bzip Optimal E adder
ammp 0.01%
Energy of Adder Energy of
Operations, E additions Pipeline
0.00% Recovery,
Energy of Adder
2 1.8 1.6 1.4 1.2 1 0.8 0.6 E recovery
w/o Razor Support
Supply Voltage
1 .5
Table 1. Energy-Optimal Characteristics R el Ene rgy
1 .3 R el Pe rform ance
6
5
5
65
35
05
75
1.
1.
1.
0.
0.
72
57
42
27
12
97
82
67
gcc 1.375 1.62% 23.7% 1.47%
1.
1.
1.
0.
1.
1.
1.
1.
1.
0.
0.
0.
Vo ltag e
gzip 1.3 1.03% 35.6% 0.41%
GCC
mcf 1.175 0.67% 48.7% 0.00%
1.5
parser 1.2 0.61% 47.9% 0.29%
R el Ene rgy
1.3 R el Pe rform ance
0.5
results in an energy savings that is smaller than the extra energy cost
0.3
incurred by more pipeline recoveries. The energy-optimal voltage
6
65
35
05
75
1.
1.
1.
0.
0.
72
57
42
27
12
97
82
67
varies from program to program (and even within the phases of a
1.
1.
1.
0.
1.
1.
1.
1.
1.
0.
0.
0.
program) because pipeline error rate is heavily dependent on the data V o lta g e
2
Voltage
40.00%
Table 2. Simulated DVS Energy Savings
1.8
Error Rate 35.00%
1.6 % Energy % IPC
30.00%
1.4 Program Reduced Reduced
Supply Voltage
25.00%
Error Rate
1.2 bzip 54.5% 4.13%
1 20.00%
crafty 54.8% 1.78%
0.8
15.00%
0 0.00%
gcc 31.3% 5.88%
Time gzip 44.6% 1.27%
1. 6 2 1. 00 %
vortex 49.1% 1.07%
Supply Voltage
Margin Eliminated?
Circuit-level Speculation employs logic components that oper- eliminate the global clock and instead utilize data-driven control to
ate at two speeds, a fast typical speed and a slower atypical multi- orchestrate system state changes [8],[23]. The approach has long
cycle speed. The components are designed with typical usage in been held up as a promising technique to improve system throughput
mind, which in all published cases resulted in significantly favorable and power. For example, asynchronous designs readily adapt to data-
circuit speed due to shorter data-dependent circuit paths. Two prior dependence, ambient and process variation. Unfortunately, the tech-
proposal of this nature include Liu’s fast adder and scheduler designs nique is not without drawbacks, including substantial additional
[12] and Wolrich’s stutter adder [24]. Both fast adder designs were design complexity to deal with hazards and ordering of operations,
optimized to perform short-distance carry propagation in a single and more complicated system testing. While fundamentally a syn-
cycle, with longer carry propagations taking an additional cycle. chronous system, Razor can also adapt to data-dependence, ambient
Liu’s circuit-speculative scheduler provided very fast access to a few and process variation. Unlike asynchronous designs, Razor utilizes a
instructions. If dependencies warranted wake-up of other instruc- traditional synchronous design style using standard tools. An addi-
tions, multiple cycles were required. Like Razor, circuit-level specu- tional detractor for the use of asynchronous logic is its non-deter-
lation benefits by exploiting typical-case evaluation latency, which ministic operation. Temperature variation, for instance, can change
for most workloads is much more favorable than worst-case latency. the order of logic evaluation and state transitions, making functional
Unlike Razor however, circuit-level speculation cannot adapt timing and electrical validation more challenging. While Razor shares this
to changing workload or other margin factors such as temperature or non-determinism, we feel it will not put undue burden on the verifi-
process variation. Moreover, it is unclear how circuit-level timing cation process for two reasons. First, non-determinism is limited to
speculation could be adapted to dynamic voltage scaling. whether or not a stage of the pipeline will produce an error. Bugs
We are aware of three previous proposals that suggest using relating to the non-deterministic nature of the Razor pipeline will be
rate-matched redundant hardware to allow subcritical circuit opera- confined to the error recovery machinery. Second, it should be possi-
tion. Uht’s TIMERRTOL design methodology couples an over- ble to provide verification-time buffering of stage error signals,
clocked logic block with multiple safely clocked blocks of the same which would permit deterministic replay of non-deterministic execu-
logic [22]. By using multiple check logic blocks, his approach can tions. This support would address any reproducibility concerns dur-
check all overclocked computation with hardware blocks that are ing verification.
safely clocked. Uht does not address the possibility of metastability
in the fast block’s output latches or the problem of recovering system 5 Conclusions
state after a timing error. Razor addresses both of these issues and In this paper, we presented Razor, an error-tolerant dynamic
utilizes an implementation that is much less expensive. Austin sug- voltage scaling technology. The key advantage of Razor over exist-
gested that the DIVA checker could be over-designed to validate ing voltage scaling technologies is the use of in-situ timing error
computation from an overclocked core processor [1], but the details detection and correction, permitting increased energy reduction
of how this might be implemented were not explored. Hegde and because voltage margins are completely eliminated. The Razor flip-
Shanbhag proposed the use of algorithmic noise tolerance (ANT) to flop was introduced as a mechanism to double-sample pipeline stage
permit the operation of signal processing circuits at subcritical volt- values, once with an aggressive fast clock and again with a delayed
ages [9]. They couple the signal processor with a rate-matched error clock that guarantees a reliable second sample. A metastability-toler-
predictor that limits the additional noise incurred by errant circuit ant error detection circuit was described that validates all values
computations. Using their approach, voltage can be lowered to the latched on the fast Razor clock. In the event of a timing error, a mod-
extent that the application can tolerate additional noise in the signal ified pipeline flush mechanism restores the correct stage value into
processor output. the pipeline, flushes earlier instructions, and restarts the next instruc-
Our pipeline recovery mechanism is inspired from Sproull’s tion after the errant computation.
work on asynchronous counterflow pipelines [19], which was later A prototype Razor pipeline was designed and analyzed. We
adapted for synchronous systems by Miller [13]. The basic idea of a found that during normal (error-free) operation of the pipeline,
counterflow pipeline is that instruction and control signals flow in a Razor error detection increases pipeline energy demands by a mod-
direction opposite to data values. As such, global control is not nec- est 3.1%, compared to a non-Razor design of the architecture.
essary as all control signals will eventually reach the appropriate Energy requirements for error recovery were much greater. We
point in the datapath. We use a counterflow-style pipeline to imple- found that the energy required to fully recover the pipeline after an
ment low-complexity recovery of the Razor pipeline in the event of a adder timing error was about 18 times more expensive than the
circuit error. errant addition.
Razor shares many of the benefits of asynchronous designs, The error rates of real and simulated circuits were explored in
while mitigating many of their drawbacks. Asynchronous systems detail. A full-custom 18x18-bit FPGA multiplier block confirmed
that significant energy reductions are possible for real circuits, if [10] Intel Corp., “Intel SpeedStep Technology,” http://
small error rates can be tolerated. When computing on random www.intel.com.
inputs at room temperature, the multiplier circuit consumed 17% [11] T. Kehl, “Hardware Self-Tuning and Circuit Performance Mon-
less energy when all process and temperature margins on voltage itoring,” 1993 Int’l Conference on Computer Design (ICCD-93),
were eliminated. Continuing to decrease voltage to the point where October 1993.
[12] T. Liu and S. Lu, “Performance Improvement with Circuit-
1.3% of operations fail consumes 35% less energy. Detailed analysis Level Speculation,” 33rd Annual International Symposium on
of a SPICE-level Kogge-Stone adder model reveals that real pro- Microarchitecture (MICRO-33), December 2000.
gram data has more favorable error rates than random samples. [13] M. Miller, K. Janik and S.-L. Lu, “Non-Stalling Counterflow
Compared to random inputs, real program inputs see similar error Microarchitecture,” 4th International Symposium on High Perfor-
rates at a voltage that is nearly 400 mV lower. mance Computer Architecture (HPCA-4), February 1998.
Architectural simulations were performed to gauge the benefits [14] T. Mudge. “Power: A first class design constraint,” Computer,
of Razor DVS in the presence of potentially expensive pipeline vol. 34, no. 4, April 2001, pp. 52-57.
recoveries. Simulations at the fixed energy-optimal voltage for each [15] K. Ogata, “Modern Control Engineering,” 4th ed., Prentice
benchmark revealed that even with high pipeline recovery costs (in Hall, 2002.
[16] T. Pering, T. Burd, and R. Brodersen. “The Simulation and
terms of energy and performance) a Razor adder operated with 42% Evaluation of Dynamic Voltage Scaling Algorithms.” Proceedings
less energy, while only incurring at most a 2.5% reduction in pipe- of Int’l Symposium on Low Power Electronics and Design 1998, pp.
line throughput. The introduction of a proportional voltage control 76-81, June 1998.
system performed nearly as well overall, suggesting that near [17] J. Rabaey, “Digital Integrated Circuits,” Prentice Hall, 1996.
energy-optimal voltage points could be found automatically for indi- [18] T. Sherwood, E. Perelman, G. Hamerly and B. Calder, “Auto-
vidual program. In some cases, the voltage control system performed matically Characterizing Large Scale Program Behavior,” 10th Inter-
better than running with a fixed energy-optimal voltage, suggesting national Conference on Architectural Support for Programming
that program energy demands are phasic. It is likely that further Languages and Operating Systems (ASPLOS-X), October 2002.
improvement to the voltage control system would render additional [19] R. Sproull, I. Sutherland, and C. Molnar, “Counterflow Pipe-
line Processor Architecture,” Sun Microsystems Report SMLI-TR-
savings. 94-25, April 1994.
Looking ahead, there is much more ground to explore. In mid- [20] Transmeta Corporation, “LongRun Power Management,” http:/
November 2003, we tape-out our prototype Razor pipeline design /www.transmeta.com/technology/architecture/longrun.html.
for MOSIS fabrication. A few months later, we will have the first [21] A. Uht, “Uniprocessor Performance Enhancement Through
opportunity to analyze a complete Razor pipeline design. To increase Adaptive Clock Frequency Control,” 2003 International Conference
the scope of Razor, we have begun exploring its application to mem- on Advances in Infrastructure for e-Business, e-Education, e-Sci-
ory structures and pipeline control logic. Finally, there is a great ence, e-Medicine, and Mobile Technologies on the Internet (SSGRR
opportunity to “re-think” system design in the context of Razor. In 2003w), January 2003.
particular, we want to investigate the design of functional units and [22] A. Uht, “Achieving Typical Delays in Synchronous Systems
via Timing Error Toleration,” University of Rhode Island TR-
memory structures optimized for typical-case latency. These new 032000-0100, March 2000.
designs should have lower error rates, thereby creating additional [23] S. Unger, “Asynchronous Sequential Switching Circuits,” New
opportunity to lower energy demands. York: Wiley-Interscience, John Wiley & Sons, Inc., 1969.
[24] G. Wolrich, E. McLellan, L. Harada, J. Montanaro, and R. Yod-
Acknowledgements lowski, “A High Performance Floating Point Coprocessor,” IEEE
This work was supported by ARM, an Intel Graduate Fellow- Journal of Solid-State Circuits, 19 (5), October 1984.
ship, the Defense Advanced Research Projects Agency, the Semi- [25] Xilinx Corporation, “Virtex-II Platform FPGA,” http://
conductor Research Corporation, the Gigascale Systems Research www.xilinx.com/products/tables/fpga.htm#v2
Center, the National Science Foundation, and the Sloan Foundation.
References
[1] T. Austin, “DIVA: A Reliable Substrate for Deep Submicron
Microarchitecture Design,” 32nd Int’l Symposium on Microarchitec-
ture, Nov. 1999.
[2] T. Austin, E. Larson, D. Ernst. SimpleScalar: an Infrastructure
for Computer System Modeling, IEEE Computer, 35 (2), February
2002.
[3] T. Burd, T. Pering, A. Stratakos, and R. Brodersen, “A Dynamic
Voltage Scaled Microprocessor System,” Int’l Solid-State Circuits
Conf., Feb. 2000.
[4] W. Dally, J. Poulton, Digital System Engineering, Cambridge
Press, 1998
[5] S. Dhar, D. Maksimovic, and B. Kranzen, “Closed-Loop Adap-
tive Voltage Scaling Controller For Standard-Cell ASICs,” 2002
Int’l Symposium on Low Power Electronics and Design (ISLPED-
2002), August 2002.
[6] R. Gonzalez, B. Gordon, and M. Horowitz, “Supply and Thresh-
old Voltage Scaling for Low Power CMOS,” IEEE JSSC, 32 (8),
August 1997.
[7] V. Gutnik and A. Chandrakasan, “An Efficient Controller for
Variable Supply-Voltage Low Power Processing,” Symp. on VLSI
Circuits, June 1996.
[8] S. Hauck, “Asynchronous Design Methodologies: An Over-
view,” Proceedings of the IEEE, 83 (1), January 1995.
[9] R. Hegde and N. Shanbhag, “Energy-efficient signal processing
via algorithmic noise-tolerance,” 1999 International Symposium on
Low-Power Electronics and Design (ISLPED-99), August 1999.