Thanks to visit codestin.com
Credit goes to www.scribd.com

0% found this document useful (0 votes)
50 views10 pages

Improving Energy Efficiency Through Parallelization

Uploaded by

SAI GOWTHAM
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
50 views10 pages

Improving Energy Efficiency Through Parallelization

Uploaded by

SAI GOWTHAM
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 10

2012 SC Companion: High Performance Computing, Networking Storage and Analysis

Improving Energy Efficiency through Parallelization


and Vectorization on Intel Core i5 and i7 R TM

Processors
Juan M. Cebrián and Lasse Natvig Jan Christian Meyer
Dept. of Computer and Information Science (IDI) High Performance Computing Section, IT Dept.
NTNU NTNU
Trondheim, NO-7491, Norway. Trondheim, NO-7491, Norway.
Email: [email protected], [email protected] Email: [email protected]

Abstract—Driven by the utilization wall and the Dark Silicon Furthermore, since process technologies under 22nm will
effect, energy efficiency has become a key research area in mi- benefit less from voltage scaling than previously, future many-
croprocessor design. Vectorization, parallelization, specialization core designs will be severely constrained by power and tem-
and heterogeneity are the key design points to deal with the
utilization wall. Heterogeneous architectures are enhanced with perature.
architectural optimizations, such as vectorization, to further in- The increasing number of on-chip components is leading to
crease the energy efficiency of the processor, reducing the number the “Dark Silicon” effect, as the resulting power density does
of instructions that go through the pipeline and leveraging the not permit them all to be in simultaneous use. This utilization
TM
usage of the memory hierarchy. AMD R Fusion or Intel Core wall [1] creates a design space of chips which are collections
i5 and i7 are commercial examples of this new generation of
of heterogeneous components, each carefully tuned to execute
microprocessors. Still, there is a question to be answered: How
can software developers maximize energy efficiency of these particular tasks efficiently.
architectures? Heterogeneous parallel computing increases the complexity
In this paper, we evaluate the energy efficiency of different faced by software developers, as they must not only develop
processors from the Intel Core i5 and i7 family, using selected parallel applications, but also decide which type of processor
benchmarks from the PARSEC suite with variable core counts
and vectorization techniques to quantify energy efficiency under
to apply for each type of calculation.
the Thermal Design Power (TDP). Results show that software Several companies in the field have introduced features
developers should prioritize vectorization over parallelization to provide green computing. Intel introduced Model Specific
TM TM
whenever possible, as it is much better in terms of energy Registers (MSRs) in the Sandy Bridge and Ivy Bridge
efficiency. When using vectorization and parallelization simulta- processor families, to enable processor energy estimates [2].
neously, scalability of the application can be reduced drastically,
and may require different development strategies to maximize
This represents an improvement over earlier techniques, which
resource utilization in order to increase energy efficiency. This relied on performance counters. These processors also im-
TM
is especially true in the server market, where we can find more plement Turbo Boost Technology (TBT) [3], which can
than one processor per board. Finally, when comparing on-chip overclock a single core and turn off the rest, improving single-
and “at the wall” energy savings, we can see variations from thread performance as well as reducing idle power. Finally,
5 to 20%, depending on the benchmark and system. This high
variability shows the need to develop a more detailed model to
power-related tools such as the Intel Energy Checker [4]
predict system power based on on-chip power information. provide energy information to software developers, enabling
them to tune applications with respect to it. AMD has intro-
I. I NTRODUCTION duced per-core voltage regulators to increase energy efficiency.
Recent years have seen an increasing focus on energy NVIDIA R has recently enhanced power control mechanisms
TM
efficiency in new computer designs. Its importance ranges and implemented power capping mechanisms in the Kepler
R
from improving sensor networks where environmental factors GPU family. IBM has simplified core designs to increase
TM
prohibit battery replacement, through increased operation time the performance per watt ratio of the BlueGene/Q . ARM R
TM
of mobile devices, to reducing the levels of heat, noise, and target the desktop market with A15 and their big.LITTLE
cost of operations in general purpose and server systems. technology, which switches between simple and complex cores
In high performance processors, doubling the number of depending on the workload. Finally, Calxeda R promise 5-Watt
processing cores also approximately doubles power dissipa- servers for cloud applications, using ARM based processors.
tion. While technology scaling trends partly limited dynamic The importance of energy efficiency in High Performance
power increase for several hardware generations, the increas- Computing (HPC) is creating a convergence with embedded
ing complexity of interconnects and cache memory result in systems, in that both market segments now have it as a
higher power dissipation as more cores are integrated on a die. major design goal. The Mont Blanc project, which is part of

978-0-7695-4956-9/13 $26.00 © 2013 IEEE 675


DOI 10.1109/SC.Companion.2012.93
the European Exascale Software Initiative (EESI), exemplifies operate in parallel, managing current, power, and temperature
this with the goal of developing a European scalable and to optimize performance and energy efficiency. Intel Turbo
power efficient HPC platform based on low-power embedded Boost technology allows the processor to operate at a power
technology [5]. Their objective is to deploy a prototype HPC level above its rated upper power limit (TDP) for brief
system based on currently available energy-efficient embedded intervals.
technology that scales to 50 PFLOPS on 7 MWatt, to compete Intel Core i5 and i7 processors feature SSE and AVX vector
with Green500 leaders by 2014 [6]. Green500 ranks supercom- instructions. SSE (Streaming SIMD Extensions) were intro-
puters by their energy efficiency using the F LOP S/W metric duced in 1999, and are implemented in most contemporary
for the LINPACK benchmark. Based on simplified IBM Power x86 CPUs. Our experiments use SSE4.2, which is the latest
processors, IBM BlueGene/Q has achieved 2.1GF LOP S/W , version. Apart from a small number of instructions to rearrange
topping the June 2012 list. consecutive values, these SIMD instructions provide similar
This paper presents on-chip and external power and en- operations to scalar instructions, but they work with dedicated,
ergy analyses of application benchmarks on Intel hardware, enlarged registers to fit multiple data elements. Advanced
describing energy efficiency effects of resource utilization, Vector Extensions (AVX) were introduced in 2008. AVX reg-
vectorization, and parallel scaling. Its main contributions are isters extend the 128 bit SSE registers with an additional 128
• a power and energy analysis of the PARSEC 2.1 bench- bits, theoretically doubling the throughput [7]. Both SSE and
mark suite, AVX are programmed using intrinsics, assembly, or automatic
• comparison of “on-chip” and “at the wall” energy mea- compiler vectorization. Intel claims that use of AVX improves
surements, energy efficiency [8], and we see it quantified in our results.
• an analysis of temperature effects on power measure- The total energy spent executing an AVX instruction over
ments, 256 bits is lower than that of executing several simpler instruc-
• a resource utilization study with respect to TDP, and tions over 32-64 bits. Vectorization also reduces instruction
• an analysis of the energy efficiency impacts of vectoriza- and data cache pressure, and the number of instructions
tion and threading. occupying the pipeline, further improving energy efficiency.
The paper is organized as follows: Section II describes some On the other hand, vectorization has the inherent drawback
of the Core i5 and i7 optimizations, motivates and defines the of increasing latency, and potentially reduces programmability
selection of energy efficiency metrics, and introduces the se- and software portability.
lected benchmarks. Section III explains the energy evaluation
environment, and provides details of the analyzed platforms. B. Performance and Energy Metrics
It also discusses the energy results. Section IV describes the Power divides into dynamic and static dissipation, with
related work, before Section V concludes the evaluation and dynamic dissipation being proportional to usage due to charge
suggests directions for further research. and discharge every time a structure is accessed, and static
II. BACKGROUND (leakage) dissipation deriving from gate and subthreshold leak-
age currents which flow even when the transistor is not in use.
A. Hyper-Threading, Turbo Boost, SSE and AVX As process technology advances toward deep submicron, the
Our energy efficiency analysis is performed using direct static power component becomes a serious problem, especially
hardware measurements. Three versions of the Core i7 pro- for large on-chip array structures such as caches or prediction
cessor are tested: Sandy Bridge, Sandy Bridge-EP and Ivy tables. For current technologies (under 32nm), even with gate
Bridge. A mobile processor from the i5 family is also tested. leakage under control by using high-k dielectrics, subthreshold
Processor details are provided in Section III-B. leakage has a great impact in the total power dissipated by
TM
All processor models implement Intel Hyper-Threading processors [9].
technology, which presents the operating system with two There is a trade-off between the partly conflicting goals of
logical cores that share the workload when possible. This high performance and low energy consumption. Comparing
is a form of simultaneous multi-threading which exploits systems based on energy consumption alone can motivate
the multiple instruction issue capability of super-scalar ar- the use of simple cores with low frequency, since energy is
chitecture. Logical cores appear as independent processors to the product of power and execution time. The Energy-Delay
software, permitting processes or threads to be scheduled for Product (EDP) places greater emphasis on performance, and
simultaneous execution. corresponds to the reciprocal of performance pr. energy unit.
Intel Turbo Boost Technology (TBT) is activated when the Different measures are appropriate to different cases when
operating system requests the highest processor performance studying energy efficiency, and Rivoire et al. [10] give a
state (P0). It monitors the number of active cores, estimated readable introduction to the pros and cons of various metrics.
power dissipation, and processor temperature. When the pro- PerformanceN /W att is among the most general, as it allows
cessor is operating below limits and the workload demands varying the emphasis of high performance versus low energy
additional performance, the processor frequency dynamically consumption. N = 0 implies a focus on the power alone,
increases until its upper limit is reached. Multiple algorithms while N = 2 corresponds to EDP.

676
There is no substantial difference in the architecture and in leakage power, which depends exponentially on VT . This
operating frequency of different processor versions in our makes leakage an important source of power dissipation at
study, so we show both performance and EDP numbers. scales below 65nm [14]–[16].
Energy measurements are obtained from both the energy In terms of power and performance, multicore architectures
consumption fields of the non-architectural Model Specific exhibit some peculiarities when running parallel workloads.
Registers (MSRs) made available by the Running Average It is typical for such workloads to feature periodic thread
Power Limit (RAPL) interface, and with an external power synchronization (e.g. for communication purposes), causing
meter (Yokogawa WT210) described in Section III-A. delayed threads to impede the progress of an entire application.
2) Clock Gating and Power Gating: Clock gating [17] adds
C. Power Control Mechanisms an AND gate to the clock signal of a specific unit or structure,
Power control is part of the Advanced Configuration and augmenting it with a control signal to disable clocking of units
Power Interface (ACPI), which has the following states for which are not needed for one or more cycles.
microprocessors: Power gating [18] is a hardware mechanism which turns off
• G0 : Working (Enables processor power states or C-state) the supply voltage of a circuit block, to lower leakage power
– C0 : normal execution (Enables performance states dissipation by applying a sleep-signal to the gate of its header
or P-state) or footer transistor. Power gating can lower the leakage power
dissipation of the circuit to almost zero [19]. Power gating is
∗ P0: highest performance, highest power
not free, however, as it implies area and energy overheads. The
∗ P1
main source of area overhead comes from the header or footer
∗ Pn
transistor, and depends on the overall switching current of the
– C1 : idle target circuit. Iyer [20] estimates this dependency to a factor
– C2 : lower power but longer resume latency than C1 3. To compensate for this overhead, the circuit block must stay
– C3 : lower power but longer resume latency than C2 in sleep mode long enough to amortize the energy used in the
• G1 : Sleeping (e.g., suspend, hibernate) header-footer transistor switching. This time is determined by
– Sleep State (S-state): S0, S1, S2, S3, S4 the circuit design limits [21].
• G2 : Soft off (S5)
D. Selection of Benchmarks
• G3 : Mechanical off
For our energy efficiency evaluation we will be using
Performance state (P-State) defines the current pair of
the PARSEC 2.1 benchmark suite. PARSEC workloads were
voltage and frequency the processor cores operate at. From
selected to include different combinations of parallel models,
the energy saving features of the Core i5 and i7 processors
machine requirements and runtime behaviors. At the time
we evaluate in this paper, we wish to highlight the following:
TM
of writing, the PARSEC distribution has been downloaded
• Enhanced Intel SpeedStep . This is Intel’s implementa- over 4,000 times. Total usage of PARSEC has consistently
tion of dynamic voltage and frequency scaling (DVFS). risen and reached 36% of all analyzed publications in 2011,
• Clock gating. As discussed in Section I, resources must and the PARSEC summary paper is cited in more than 280
be gated to prevent the processor from burning out, due publications.
to the utilization wall. We started our research by performing automatic vectoriza-
• Turbo Boost Technology. This is Intel’s implementation tion on the whole PARSEC benchmark suite, using both Intel
of core power gating. ICC 12.1.0 and GCC 4.7. Results showed almost no improve-
The next subsections provide greater detail on how the ment for most of the benchmarks. This is normal if the code is
different power saving mechanisms work. not properly written to allow the compiler to find code sections
1) Dynamic Voltage and Frequency Scaling: Dynamic Volt- that can be vectorized [22]. In the end, BlackScholes and
age and Frequency Scaling (DVFS) has been widely used to Streamcluster codes were selected for further study (manual
reduce energy consumption in microprocessors since the early vectorization), because of their amenability to vectorization.
1990s [11]. DVFS relies on how dynamic power dissipation These benchmarks will be our case study for energy benefits
2
PD ≈ VDD · f depends both on voltage and frequency, by of vectorization and parallelization. Our goal is to perform
scaling these terms to save dynamic power [12], [13]. manual vectorization and extend our energy evaluation for the
As process technology goes into deep submicron, the margin whole benchmark suite once PARSEC 3.0 is released.
between the supply voltage VDD and the threshold voltage VT The PARSEC 2.1 benchmark suite includes a “SSE” version
is lowered. Among other undesirable effects, processor relia- of the Blackscholes benchmark. After running and testing this
bility is reduced as this margin decreases. Moreover, transistor implementation we experienced a small reduction on runtime
delay (switching speed) is given by δ ≈ 1/(VDD −VT )α , with (around 10%) over the sequential code, still too low for the
α > 1. This means that VDD can be lowered for DVFS as potential benefits of SSE. We decided to profile the application
long as the margin between VDD and VT remains constant, using HPCToolkit [23] to identify potential bottlenecks. This
in order to obtain the desired speed. In addition to further identified several code sections which serialized the SSE im-
reducing reliability, however, reducing VT also causes growth plementation, and we decided to perform manual vectorization

677
of the suitable code sections, for both SSE and AVX. This new
version obtained a reduction on runtime of 70% for SSE and
77% for AVX.
For the Streamcluster benchmark we directly profiled the
code using HPCToolkit, and found that 95% of execution time
was focused on a vectorizable function. The result is similar
to that obtained in [24]. This function calculates the Euclidean
distance squared between two points. Our SSE and AVX
implementations reduce runtime by 40 and 56% respectively.
We repeated the profiling to further improve the vectorization
Fig. 1.Yokogawa Setup (Source: Intel Energy Checker Device Kit
results, and found that most of the execution time was now User Guide).
spent on reading vector data (47%). There are few operations
performed on the loaded data, so the benchmark becomes
memory bound. No additional adjustments were made, as of results. We could only find one benchmark that actually
further optimizations would not closely relate to vectorization. overflowed the counters, specifically, the sequential version of
Benchmark details Raytrace, running on the SB-EP machine.
1 The device selected to measure power “at the wall” was
• Blackscholes is an Intel RMS benchmark [25]. It cal-
culates the prices for a portfolio of European options Yokogawa’s WT210. This device is connected directly be-
analytically with the Black-Scholes partial differential tween the power supply of the machine we want to measure
equation (PDE). The Black-Scholes equation must be and the power source (device setup can be seen in Figure
computed numerically. 1). By means of a custom serial cable, Yokogawa’s WT210
• Streamcluster is an Intel RMS kernel developed by can communicate with a logging machine (called Yokogawa-
Princeton University. It solves the online clustering prob- Interface in our setup), using a proprietary software from
lem. This kernel represents data mining algorithms and Yokogawa or other third party software like Intel Energy
the prevalence of problems with streaming characteristics. Checker. Due to the lack of support for Linux we couldn’t
test Yokogawa’s proprietary software, but according to the
III. E XPERIMENTAL R ESULTS documentation, it should provide similar measurements to
A. Measuring Power. Processor Power vs “At the Wall” Power those obtained by Intel Energy Checker.
Until recently, empirical power estimates were based on per- Intel Energy Checker provides an interface between applica-
formance counters provided by the processor, but the second tions and power-temperature measuring devices. This is done
generation of Core i5 and i7 extends the MSRs with energy by exchanging data using “productivity links” (PL). These PL’s
information, enabling applications to examine their energy can be seen as objects that store different data provided by
efficiency without the use of external devices. Instrumenting external devices, and can be accessed by applications with
this introduces an overhead proportional to the sampling rate. minimal effort.
It may also distort measurements of actual energy consumption The computer running the benchmarks communicates with
if e.g. the measures of CPU and GPU power are isolated: every Yokogawa-Interface using ssh to start/stop Intel Energy
other machine component also dissipates power, and needs to Checker data retrieval software (Intel Energy Server), and
be accounted for. A straightforward solution to this problem, start/stop data logging to disk (Intel’s pl csv logger).
is to estimate full machine power based on the sum of the
B. Intel Core i5 and i7 Architecture
processor and a general power estimate. Such estimates are
particularly useful in scenarios involving HPC clusters, where In this section, we present the main architectural aspects
scaling the cost of power measuring equipment to hundreds or of the Core i5 and i7 processors. Figures 2 and 3 show the
thousands of machines quickly becomes prohibitive. We wish core layout for the analyzed architectures. The analyzed i7-
to address this issue in future work. 2600K Sandy Bridge (SB) and Core i7-3770K Ivy Bridge (IB)
In order to retrieve per-processor energy information processors essentially share the same architecture, number of
we use the RAPL MSR interface. Bits 12:8 of the cores (4 physical cores with 8 threads) and memory hierarchy
MSR_RAPL_POWER_UNIT register describe the granularity (Table I), but the latter has an improved GPU (Intel HD 4000
of the energy values, denoted ESU (Energy Status Units). vs 3000), and is scaled down from 32nm to 22nm. Intel’s 22-
The default value is 2−16 J ≈ 15.3µJ. Consumed energy is nm process introduces tri-gate transistors, to reduce leakage
read from the bits 31:0 of the MSR_PKG_ENERGY_STATUS power and improve overall energy efficiency. Representing the
register, which has a wrap-around time of about 60 seconds mobile segment, we will be testing a Core i5 2430M processor.
on high processor load [2]. To ensure we never overflow Built on the same technology as the i7-2600K, the i5 trades
the energy counters we performed tests on both “native” and two cores for area reduction, simplifies the architecture to
“simlarge” inputs, and compared the proportional correctness minimize power and works at a lower frequency. The final
analyzed processor is a dual socket Xeon E5-2670 Sandy
1 Recognition, mining and synthesis. Bridge EP (SB-EP). It is a version of the Sandy Bridge

678
Cache Size Sharing Ways of Line size Latency
associativity (cycles)
Level 1 Instruction 32KB Private 8 64B 4
Level 1 Data 32KB Private 8 64B 4
Level 2 256KB Private 8 64B 12
Level 3 8MB (20MB) Shared 16 64B 23-40

TABLE I
C ACHE INFORMATION FOR I NTEL C ORE I 5 AND I 7 PLATFORMS , 3.4GH Z
FOR SB AND IB, 2.6G HZ FOR SB-EP AND 2.4G HZ FOR THE I 5.

the future. Leakage power can come from several sources


including gate leakage and sub-threshold leakage.
Butts and Sohi [27] created a leakage power model based
on the BSIM3v3.2 MOSFET transistor model:
W  −VDS  VGS −VT −Vof f
IDsub = Is0 · · 1 − e vt ·e n·vt (1)
L
Fig. 2. Sandy Bridge (bot) vs Ivy Bridge (top) (Source: Intel).
In this equation IDsub is the sub-threshold drain current,
VDS is the voltage across the drain and the source, VGS ,
the voltage across the gate and the source terminal. Vof f
is an empirically determined model parameter and vt is a
physical parameter in exponential proportion to temperature.
The term n encapsulates various device parameters. The term
Is0 depends on transistor geometry width W and length L.
According to this model, leakage power has an exponential
dependence on temperature. This reduces the energy efficiency
of the processor as the temperature increases, due to the extra
leakage power dissipated. In our experiments, we discovered a
3 to 5% variation in the power measurements between a “cold”
(30C) processor and a “warm” (50-55C) processor. This is
especially important when running a big set of benchmarks, as
they will be running at different temperatures as the processor
heats up, adding some unwanted variability to the results. In
Fig. 3. Sandy Bridge EP (Source: Intel). our setup, we heat up the processor before starting the bench-
mark to prevent this problem. However, this is not critical
when testing “native” inputs that run for several minutes on
processor targeted towards the server market, so the GPU has the processor. These inputs reduce variability in the results,
been replaced by 4 additional cores, and L3 has been increased as the processor has enough time to heat up during program
from 8MB to 20MB. A total of 16 cores, 32 threads and 40MB execution. For other inputs, such as “simsmall” or “simlarge”
of L3 is available when using both processors. All cores were it is recommended to always heat up the processor before
clocked at their maximum rate of 3.4GHz, for the SB and IB running the benchmark.
machines, 2.4Ghz for the Core i5 and 2.6GHz for the SB-EP
machine. TBT is disabled for comparative purposes. Cache D. Evaluation Methodology
latencies are taken from [26]. TDP for the IB processor is In this section we will describe some details about our
77W, 95W for the SB, 35W for the i5 and 115W for the SB- evaluation infrastructure. The PARSEC 2.1 benchmark suite
EP. has been compiled on all platforms using GCC 4.7 and the -
O2 flag on a Ubuntu 12.04 distribution running Kernel 3.2.30.
C. Temperature Effects on Power The -ftree-vectorize flag was only used for the tests related to
This section provides insights on how temperature affects auto vectorization. Simulations were performed sequentially
power measurements, and why the processor should be heated 5 times, using both the native and simlarge input sizes of
up before start measuring, especially when reading from the the PARSEC benchmark suite. The results are reproducible
MSRs. and stable, with a relative standard deviation less than 3%
While dynamic power dissipation has been the predominant for both problem sizes. The processor is heated up to 50
factor in CMOS power dissipation for many years, leakage degrees Celsius before running the benchmarks, to minimize
power has been increasingly prominent in recent technologies. variability in the energy measurements. Systems are running
Representing roughly 20-36% or more of the power dissipation with minimal console configurations, to reduce OS influence
in current designs, its proportion is expected to increase in on the results.

679
1 1,4
1 2 4 8 16 32 1 2 4 8 16 32
0,9
1,2
0,8
0,7 1
Normalized Runtime

Normalized Runtime
0,6 0,8
0,5
0,4 0,6
0,3 0,4
0,2
0,2
0,1
0 0
Seq. SSE AVX Seq. SSE AVX Seq. SSE AVX Seq. SSE AVX Seq. SSE AVX Seq. SSE AVX Seq. SSE AVX Seq. SSE AVX Seq. SSE AVX Seq. SSE AVX
Ivy 1.6 Ivy 3.4 Sandy 3.4 Sandy i5 2.4 Sandy EP 2.6 Ivy 1.6 Ivy 3.4 Sandy 3.4 Sandy i5 2.4 Sandy EP 2.6

Fig. 4. Normalized runtime to Seq. Ivy Bridge at 1.6Ghz for different combinations of processor, optimization and frequencies. Benchmarks:
Blackscholes (left) Streamcluster (right). The different bars represent the amount of threads (1 to 32).

System Processor
250

200
Average Total Power (W)

150

100

50

0
1 Th.
2 Th.
4 Th.
8 Th.
1 Th.
2 Th.
4 Th.
8 Th.
1 Th.
2 Th.
4 Th.
8 Th.
1 Th.
2 Th.
4 Th.
8 Th.
1 Th.
2 Th.
4 Th.
8 Th.
1 Th.
2 Th.
4 Th.
8 Th.
1 Th.
2 Th.
4 Th.
8 Th.
1 Th.
2 Th.
4 Th.
8 Th.
1 Th.
2 Th.
4 Th.
8 Th.
1 Th.
2 Th.
4 Th.
1 Th.
2 Th.
4 Th.
1 Th.
2 Th.
4 Th.
1 Th.
2 Th.
4 Th.
8 Th.
16 Th.
32 Th.
1 Th.
2 Th.
4 Th.
8 Th.

1 Th.
2 Th.
4 Th.
8 Th.
16 Th.
16 Th.
32 Th.

32 Th.
Seq. SSE AVX Seq. SSE AVX Seq. SSE AVX Seq. SSE AVX Seq. SSE AVX
Ivy 1.6 Ivy 3.4 Sandy 3.4 Sandy i5 2.4 Sandy EP 2.6

System Processor
250

200
Average Total Power (W)

150

100

50

0
1 Th.
2 Th.
4 Th.
8 Th.
1 Th.
2 Th.
4 Th.
8 Th.
1 Th.
2 Th.
4 Th.
8 Th.
1 Th.
2 Th.
4 Th.
8 Th.
1 Th.
2 Th.
4 Th.
8 Th.
1 Th.
2 Th.
4 Th.
8 Th.
1 Th.
2 Th.
4 Th.
8 Th.
1 Th.
2 Th.
4 Th.
8 Th.
1 Th.
2 Th.
4 Th.
8 Th.
1 Th.
2 Th.
4 Th.
1 Th.
2 Th.
4 Th.
1 Th.
2 Th.
4 Th.
1 Th.
2 Th.
4 Th.
8 Th.
16 Th.
32 Th.
1 Th.
2 Th.
4 Th.
8 Th.

1 Th.
2 Th.
4 Th.
8 Th.
16 Th.
16 Th.
32 Th.

32 Th.
Seq. SSE AVX Seq. SSE AVX Seq. SSE AVX Seq. SSE AVX Seq. SSE AVX
Ivy 1.6 Ivy 3.4 Sandy 3.4 Sandy i5 2.4 Sandy EP 2.6

Fig. 5. Average total power for different combinations of processor, optimization and frequencies. Benchmarks: Blackscholes (top)
Streamcluster (bottom). Dashed line represents the TDP of each processor (black bar).

E. Energy Evaluation and Discussion efficient. We also tested the Sandy Bridge processor running
at 1.6Ghz, but it is omitted from all figures, as it did not
In this section we will show the main results obtained using provide significantly different information compared to the IB
our experimental setup. Due to space limitations we will only at 1.6Ghz.
report results for the BlackScholes and Streamcluster bench- The different combinations of threads, vectorization, proces-
marks. We are especially interested in these two benchmarks sor and frequency clearly show the benefits of vectorization
as they are our case studies for (manual) vectorization effects over threading. For the same operating frequency, the Ivy
on power dissipation II-D. Still, we have executed all PARSEC Bridge processor using SSE reduces execution time to 30%
benchmarks on all the analyzed platforms, both with and of the original. Using AVX further reduces the execution
without automatic vectorization, to analyze the effects of auto- time (22%), while their “equivalent” in threading (4-8 threads)
vectorization and parallelization on these platforms (results reduce runtime to 35% and 28% respectively. When feasible,
can be provided on-demand). vectorization not only makes the processor faster than thread-
Figure 4 shows the normalized runtime (Y axis) of ing, but also has almost no impact on power. Figure 5 shows
Blackscholes and Streamcluster on different pairs of platform- how the processor power variations are minimal when using
optimization (X axis), and for different numbers of threads SSE or AVX compared to the sequential version (< 0.3%). In
(bars). Results are normalized to the runtime of the Ivy fact, average power is slightly reduced due to the reduction
Bridge running on a single thread at 1.6Ghz. We selected this of total instruction count and cache misses (i.e., Blackscholes
machine as it is the newest, and theoretically, the most energy - Ivy 3.4 - Seq. 4 threads uses 31W vs. Ivy 3.4 - AVX 4

680
10 100
1 2 4 8 16 32 1 2 4 8 16 32

10
1
Normalized EDP

Normalized EDP
1

0,1
0,1

0,01 0,01
Seq. SSE AVX Seq. SSE AVX Seq. SSE AVX Seq. SSE AVX Seq. SSE AVX Seq. SSE AVX Seq. SSE AVX Seq. SSE AVX Seq. SSE AVX Seq. SSE AVX
Ivy 1.6 Ivy 3.4 Sandy 3.4 Sandy i5 2.4 Sandy EP 2.6 Ivy 1.6 Ivy 3.4 Sandy 3.4 Sandy i5 2.4 Sandy EP 2.6

Fig. 6.Normalized on-chip EDP to Seq. Ivy Bridge at 1.6Ghz for different combinations of processor, optimization and frequencies.
Benchmarks: Blackscholes (left) Streamcluster (right). The different bars represent the amount of threads (1 to 32).

threads that uses 22W). We can also see good scalability of Another noteworthy result is that obtained by the Sandy
both benchmarks up to 4-8 threads. This number varies if we Bridge EP running Streamcluster (Figure 4-right). We can see
use real physical cores or Hyper-Threading2 . After that point, an increment on execution time of almost 30% when compared
even “native” inputs have scalability problems, as sequential with the 1.6Ghz Ivy Bridge. After some research we found
code dominates over parallel. However, we have identified two out that binding application threads to specific cores solves
interesting findings during our research that require further the problem. The SB-EP then obtains speedups somewhere in
investigation. between the 1.6 and 3.2 Ghz Sandy Bridge. We don’t think
Dashed lines in Figure 5 represent the TDP of each pro- that binding threads in the source-code is the correct solution
cessor (black bar). It can be seen that none of the processors to this problem, so we decided to leave results with default
gets close to it’s TDP (77W for IB, 95W for SB, 35W for i5 configuration and code. We will try to find out what is causing
and 115W x 2 for SB-EP), except for the i5 machine. This this behavior at a OS/Kernel level, and solve it in our future
information is important as it tells us that further optimizations work.
are possible without forcing the processor to throttle down due
In order to see if the vectorization/threading speedup pays
to temperature constraints.
off for the extra power dissipation we will now present some
The first thing to mention is that there was a 5-10%
EDP (Energy Delay Product) numbers. Figure 6 shows the
slowdown between the Sandy Bridge and the Ivy Bridge
on-chip EDP (obtained from the processors MSRs) normalized
architectures, favoring the older processor (Sandy Bridge). All
against the on-chip EDP of the sequential implementation run-
components on these machines are different, but the main
ning on the Ivy Bridge at 1.6Ghz (logarithmic scale). For the
characteristics are similar (same operating frequency, same
Blackscholes benchmark, all the processor configurations ben-
memory and memory speed, etc.), so we investigated the cause
efit from threading and vectorization in terms of EDP. This is
of this slowdown on the Ivy Bridge. The first thing we focused
mainly because of two reasons. First, the Blackscholes bench-
was TBT. We set frequency manually to 3.4 Ghz and 1.6 Ghz
mark experiences speedups for all the analyzed processors
before the experiments, so TBT should never fire up, however,
when increasing threads or using vectorization, this reduction
to make sure this wasn’t the problem, we tried again disabling
on runtime benefits the EDP metric quadratically. Second, the
TBT on both machines at the Bios level. Results didn’t change.
power saving mechanisms included in the analyzed processors
We then decided to focus on the software side. GCC version
minimize power dissipation when resources are not used. For
was the same on both machines (4.7), but the OS of the Sandy
the Streamcluster benchmark (Figure 6-right), the SB and IB
Bridge (OpenSuse 12.1 - Kernel 3.1.0) was different form the
running at full speed suffer from data starvation, so using
Ivy Bridge (Ubuntu 12.04 - Kernel 3.2.30). We tried the Sandy
many threads does not benefit performance when using SSE
Bridge machine using the exact same Ubuntu version and
or AVX, and consequently, EDP. Again, it is important to
found out that this runtime difference disappeared. However,
note that due to the power saving mechanisms implemented
even with this performance slowdown, Ubuntu obtained neat
in these architectures, even if the extra threads do not provide
energy savings compared to OpenSuse (between 2-10% less
performance, they do not affect power dissipation enough as to
energy despite the longer execution time). Final results are
cause an increment on EDP. We can also conclude that when
for all machines running Ubuntu and Kernel 3.2.30. We
measuring processor power only, the most energy efficient
will have to analyze more in detail the differences between
processor is the IB (not only in EDP but also in terms of
both combinations of OS-Kernel in order to find out what
energy), outperforming even the low power Core i5 processor.
parameters are causing this slowdown on execution runtime,
This is mainly because the extra hardware complexity and
but that’s out of the scope of this paper.
frequency of the IB pays off in terms of runtime. It is also
2 All processors have Hyper-Threading enabled, so the last pair of proces- important to notice that the SB-EP obtains really bad results
sor/thread do not map to physical cores. in terms of EDP. This is mainly due to the low scalability

681
1 10
1 2 4 8 16 32 1 2 4 8 16 32

0,1 1
Normalized EDP

Normalized EDP
0,01 0,1

0 0,01
Seq. SSE AVX Seq. SSE AVX Seq. SSE AVX Seq. SSE AVX Seq. SSE AVX Seq. SSE AVX Seq. SSE AVX Seq. SSE AVX Seq. SSE AVX Seq. SSE AVX
Ivy 1.6 Ivy 3.4 Sandy 3.4 Sandy i5 2.4 Sandy EP 2.6 Ivy 1.6 Ivy 3.4 Sandy 3.4 Sandy i5 2.4 Sandy EP 2.6

Fig. 7. Normalized total EDP (“at the wall”) to Seq. Ivy Bridge at 1.6Ghz for different combinations of processor, optimization and
frequencies. Benchmarks: Blackscholes (left) Streamcluster (right). The different bars represent the amount of threads (1 to 32).

Energy Saving Difference ( MSR - System ) in %


Energy Saving Difference ( MSR - System ) in %

35 60
2 4 8 16 32 2 4 8 16 32
30 50
25
40
20
30
15
20
10

5 10

0 0
Seq. SSEAVX Seq. SSEAVX Seq. SSEAVX Seq. SSEAVX Seq. SSEAVX Seq. SSEAVX Seq. SSEAVX Seq. SSEAVX Seq. SSEAVX Seq. SSEAVX
-5 Ivy 1.6 Ivy 3.4 Sandy 3.4 Sandy i5 Sandy EP Ivy 1.6 Ivy 3.4 Sandy 3.4 Sandy i5 Sandy EP
-10

Fig. 8. Difference between normalized on-chip energy and total energy. Benchmarks: Blackscholes (left) Streamcluster (right). The different
bars represent the amount of threads (2 to 32).

of the applications when running on more than 4-8 threads. is around 40W. This huge difference between the processor
The SB-EP machine is running two SB-EP processors on the power and the system power makes the Core i5 the overall
same board, thus doubling the base power dissipation of the winner, not only in terms of EDP, but especially in peak
application when running (Figure 5). In order to overcome power dissipation and energy consumption. As it happened
this additional power dissipation we require a high resource when measuring only processor power, the SB-EP machine
utilization, something that does not happen for the analyzed has a high hardware overhead (idle power around 125W), that
benchmarks and inputs. If we look at the whole PARSEC doesn’t pay off for the analyzed benchmarks and input sizes.
benchmark suite, we can identify 3 benchmarks that could
benefit from running on this platform, Swaptions, Vips and Finally, we are interested in the possibility to estimate
x264 using 16 threads. In addition, when running 32 threads, system power without the need of external hardware, based
Fluidanimate can also obtain significant energy savings on this on the energy estimations provided by the on-chip energy
platform. counters. This can be especially useful in computing centers
due to the system complexity and the additional cost of the
When evaluating the energy efficiency of a system we have power measuring devices. Figure 8 shows the normalized
to take into account all hardware, not only the processor processor energy (against one core on the same processor
(i.e., motherboard, memory, etc). This additional energy can and frequency) minus the normalized total energy (against
alter the conclusions obtained from looking at the processor one core on the same system and frequency). This metric
energy readings on some platforms. Figure 7 shows the EDP can give us an idea of how processor energy saving relate
at a system level (“at the wall”) normalized against the EDP to total energy savings. There are two key elements that
of the sequential implementation running on the Ivy Bridge seem to affect this relationship, runtime and processor to
at 1.6 Ghz (logarithmic scale). For both benchmarks, the system power ratio. Both the Core i5 machine and the SB-
general trends follow those of the on-chip power values, EP have a processor to system power ratio between 40 to
obtaining substantial EDP reduction when using SSE, AVX 60%, according to Figure 5. When this ratio gets close to
and threading as long as the input size allows it. However, 50%, the difference between system and processor savings is
the IB and the Core i5 processors are switching places when relatively small, ±5%. Runtime also has an impact on this
measuring system power. The i5 is running on a laptop, built ratio. When runtime reduction comes from additional power,
mainly on low power components, having a total system idle it increases the difference between processor energy and total
power around 9W, while the idle system power of the IB energy. However, when the reduction on runtime comes “for

682
free” in terms of power (i.e., SSE or AVX), the difference V. C ONCLUSIONS AND F UTURE W ORK
is reduced. We can conclude that the relationship between
processor and total energy savings is not straightforward and Energy efficiency has become one of the main design factors
depends a lot on how resources are used and the ratio between in modern microprocessor design. Processor manufacturers
processor and total power. As part of our future work we want from all market segments, from embedded to HPC are increas-
to develop and validate a model that will help developers to ing their focus on this aspect of computer system design. There
optimize energy efficiency of their applications without relying are four base pillars in these more energy efficient systems:
on expensive external power measuring hardware, and this vectorization, parallelization, specialization and heterogeneity.
paper represents a first step in that direction. Intel Sandy Bridge and Ivy Bridge family of processors are
examples of this design paradigm. However, several questions
rise: How should software be designed in order to obtain the
IV. R ELATED W ORK
best energy efficiency? What is the optimal platform for my
There has been much research on energy efficiency and software from those available in the market?
software performance optimizations in the past years. As a first step towards possible answers to these questions,
Ge et al. [28] show how the PowerPack framework can we performed a case study of two benchmarks from the
be used to study in depth the energy efficiency of parallel PARSEC 2.1 suite, analyzing energy efficiency effects of
applications on clusters with multi-core nodes. The framework vectorization and parallelization, on different processors in
is measurement based, and can be used to identify the energy Intel’s Sandy and Ivy Bridge families, from mobile to server.
consumption of all major system components. Molka et al. When measuring energy at a core level (using the MSRs
[29] discuss weaknesses of the Green500 list with respect to provided by the different Intel processors) we found out at
ranking HPC system energy efficiency. They introduce their the Ivy Bridge machine running at default speed was the
own benchmark using a parallel workload generator to stress most energy efficient, for all possible combinations of threads
main components in a HPC system. Anzt et al. [30] present and vectorization techniques. The 22nm technology and the
an energy performance analysis of different iterative solver aggressive power saving mechanisms make this processor the
implementations on a hybrid CPU-GPU system. The study best in both power and performance. The mobile processor
is based on empirical measurements, and energy is saved couldn’t match Ivy Bridge in terms of performance, and the
by using DVFS (Dynamic Voltage and Frequency Scaling) Sandy Bridge – EP system could not take advantage of all of
to lower the CPU clock frequency while computations are its resources for the analyzed benchmarks/inputs.
offloaded to the GPU. Ferdman et al. [31] introduces a However, accounting for total system power (or “at the wall”
benchmark suite (CloudSuite) for emerging scale-out work- power), results favor the Core i5 processor and the SB-EP.
loads, and evaluates and compares it against state of the art For these platforms, the processor power represents around
benchmark suites such as PARSEC and SPLASH-2 in terms 50-60% of the total power, while the SB and IB platforms is
of performance, cache misses, branch prediction misses, etc. around 20-40%. The smaller system power overhead should
However, none of these approaches analyzes how vectorization make these platforms more energy efficient, and it does for the
and parallelization affect the energy efficiency of the different Core i5. Still, due to low resource utilization on the SB-EP
applications. machine, the SB and IB machines perform better in overall
Li and Martinez [32] develop and use an analytical model of energy reduction. As future work it would be interesting to
the power-performance implications of degree of parallelism provide some feedback to the programmer, to improve energy
and voltage/frequency scaling. They confirm their analytical efficiency by making better use of the system resources, e.g.
results by detailed simulation. This analytical model does not running several tasks simultaneously when the problem size
take into account the effects of vectorization. Biena et al. is too small to make use of threading.
[33] characterize the PARSEC suite’s working sets and the We have also shown that processor to total energy savings
communication patterns among the threads. Ghose et al. [24] do not translate directly for all platforms. This relationship
evaluate different vectorization options for the Streamcluster depends mainly on resource utilization and processor to system
PARSEC kernel. The main difference with our work is that power ratio. Creating a better model to estimate total energy
none of them perform any energy evaluation, neither at a savings based on-chip energy data makes an interesting direc-
core level nor “at the wall”. Also, the vectorization proposed tion for further work. In addition, we found some difficulties
by Ghose is slightly different from our code (we use SSE3 for the default Linux Kernel to deal with threading on the dual-
specific instructions for horizontal adds). socket (2 x 8 multicore) system. The problem can be solved
Finally, Totoni et al. [34] compares the power and perfor- by affinity control, but this appears to be a poor solution, so
mance of several Intel CPUs and GPUs, but does not analyze more research in this area is needed to obtain better results
the effects of vectorization on the energy efficiency of the without programmer intervention.
systems. Kim et al. [22] shows how blocking, vectorization Our ultimate goal is to understand the energy implications of
and minor algorithmic changes can speedup applications close every subsystem, and use that knowledge to develop methods
to the best tuned version known, however, they do not perform for more energy efficient computers, both through software
an energy evaluation of their experiments. and hardware improvements, and their interactions. Hardware

683
will provide more means for energy readouts supplementing [21] Z. Hu, A. Buyuktosunoglu, V. Srinivasan, V. Zyuban, H. Jacobson,
the current performance counters, but there will also be a and P. Bose, “Microarchitectural techniques for power gating of
execution units,” in Proceedings of the 2004 international symposium
need for abstractions and techniques so the programmer may on Low power electronics and design, ser. ISLPED ’04. New
control energy efficiency without losing too much portability York, NY, USA: ACM, 2004, pp. 32–37. [Online]. Available:
and programmability. http://doi.acm.org/10.1145/1013235.1013249
[22] C. Kim, N. Satish, J. Chhugani, H. Saito, R. Krishnaiyer, M. Smelyan-
skiy, M. Girkar, and P. Dubey, “Technical report: Closing the ninja
ACKNOWLEDGMENTS performance gap through traditional programming and compiler tech-
nology,” 2012.
The authors gratefully acknowledge the support of the [23] L. Adhianto, M. Fagan, M. Krentel, G. Marin, J. Mellor-crummey,
and N. Tallent, “Hpctoolkit: Performance measurement and analysis for
PRACE 2IP project, the NOTUR project, and the HiPEAC supercomputers with node-level parallelism,” 2010.
Network of Excellence. [24] S. Ghose, S. Srinath, and J. Tse, “Accelerating a parsec benchmark using
portable subword simd,” in CS 5220: Final Project Report, 2011.
[25] S. Kumar, C. J. Hughes, and A. Nguyen, “Intel technology journal
R EFERENCES 11(3): Architectural support for fine-grained parallelism on multi-core
architectures,” 2007.
[1] H. Esmaeilzadeh, E. Blem, R. St. Amant, K. Sankaralingam, and [26] Intel, Intel R 64 and IA-32 Architectures Optimization Reference Man-
D. Burger, “Dark silicon and the end of multicore scaling,” in ual, http://www.intel.com/Assets/en\ US/PDF/manual/248966.pdf, jun
Proceedings of the 38th annual international symposium on Computer 2011.
architecture, ser. ISCA ’11. New York, NY, USA: ACM, 2011, pp. 365– [27] J. A. Butts and G. S. Sohi, “A static power model for architects,” in
376. [Online]. Available: http://doi.acm.org/10.1145/2000064.2000108 Proceedings of the 33rd Annual IEEE/ACM International Symposium
[2] Intel, Intel R 64 and IA-32 Architecture Software Development on Microarchitecture, 2000, pp. 191–201.
Manual, http://download.intel.com/products/processor/manual/325462. [28] R. Ge, X. Feng, S. Song, H.-C. Chang, D. Li, and K. Cameron,
pdf, Aug 2012. “Powerpack: Energy profiling and analysis of high-performance systems
[3] I. Corporation, “White paper: Intel turbo boost technology in intel core and applications,” Parallel and Distributed Systems, IEEE Transactions
microarchitecture (nehalem) based processors,” 2008. on, vol. 21, no. 5, pp. 658 –671, may 2010.
[4] Intel, “White paper: Intel energy checker, how green is your software?” [29] D. Molka, D. Hackenberg, R. Schöne, T. Minartz, and W. Nagel,
2010. “Flexible workload generation for HPC cluster efficiency benchmark-
[5] “Mont Blanc project website,” http://www.montblanc-project.eu/. ing,” Computer Science - Research and Development, pp. 1–9, 2011.
[6] “The Green 500 - Ranking the World’s Most Energy Efficient Super- [Online]. Available: http://dx.doi.org/10.1007/s00450-011-0194-9
computers,” http://www.green500.org. [30] H. Anzt, M. Castillo, J. Fernández, V. Heuveline, F. Igual, R. Mayo, and
[7] Intel, Avoiding AVX-SSE Transition Penalties, http://software.intel.com/ E. Quintana-Ortı́, “Optimization of power consumption in the iterative
file/39798, nov 2011. solution of sparse linear systems on graphics processors,” Computer
[8] N. Firasta, M. Buxton, P. Jinbo, K. Nasri, and S. Kuo, “White paper: Intel Science - Research and Development, pp. 1–9, 2011.
avx: New frontiers in performance improvements and energy efficiency,” [31] M. Ferdman, A. Adileh, O. Kocberber, S. Volos, M. Alisafaee, D. Jevd-
2008. jic, C. Kaynak, A. D. Popescu, A. Ailamaki, and B. Falsafi, “Clearing
[9] S. Kaxiras and M. Martonosi, Computer Architecture Techniques for the Clouds: A Study of Emerging Scale-out Workloads on Modern
Power-Efficiency, 1st ed. Morgan and Claypool Publishers, 2008. Hardware,” in 17th International Conference on Architectural Support
[10] S. Rivoire, M. Shah, P. Ranganatban, C. Kozyrakis, and J. Meza, “Mod- for Programming Languages and Operating Systems (ASPLOS), 2012,
els and metrics to enable energy-efficiency optimizations,” Computer, recognized as Best Paper by the program committee.
vol. 40, no. 12, pp. 39 –48, Dec. 2007. [32] J. Li and J. F. Martı́nez, “Power-performance considerations of parallel
[11] P. Macken, M. Degrauwe, M. Van Paemel, and H. Oguey, “A voltage computing on chip multiprocessors,” ACM Transactions on Architecture
reduction technique for digital systems,” in Proceedings of the 37th and Code Optimization, vol. 2, no. 4, pp. 397–422, Dec. 2005.
IEEE International Solid-State Circuits Conference. Digest of Technical [Online]. Available: http://doi.acm.org/10.1145/1113841.1113844
Papers, 1990, pp. 238–239. [33] C. Bienia, S. Kumar, J. P. Singh, and K. Li, “The PARSEC benchmark
[12] T. Simunic, L. Benini, A. Acquaviva, P. Glynn, and G. de Micheli, suite: characterization and architectural implications,” in Proc. of the
“Dynamic voltage scaling and power management for portable systems,” 17th int’l conf. on Parallel Architectures and Compilation Techniques,
in Proceedings on Design Automation Conference, 2001, pp. 524–529. ser. PACT ’08, 2008, pp. 72–81.
[13] Q. Wu, P. Juang, M. Martonosi, and D. W. Clark, “Voltage and [34] E. Totoni, B. Behzad, S. Ghike, and J. Torrellas, “Comparing the
frequency control with adaptive reaction time in multiple-clock-domain power and performance of intel’s scc to state-of-the-art cpus and gpus,”
processors,” in Proceedings of the 11th International Symposium on Performance Analysis of Systems and Software, IEEE International
High-Performance Computer Architecture, 2005, pp. 178–189. Symmposium on, vol. 0, pp. 78–87, 2012.
[14] M. J. Flynn and P. Hung, “Microprocessor design issues: Thoughts on
the road ahead,” IEEE Micro, vol. 25, no. 3, pp. 16–31, 2005.
[15] A. e. a. Keshavarzi, “Intrinsic IDDQ: Origins, reduction, and applica-
tions in deep sub- low-power cmos ic’s,” in Proceedings of the IEEE
International Test Conference, 1997.
[16] N. S. Kim, T. Austin, D. Baauw, T. Mudge, K. Flautner, J. S. Hu, M. J.
Irwin, M. Kandemir, and V. Narayanan, “Leakage current: Moore’s law
meets static power,” Computer, vol. 36, no. 12, pp. 68–75, 2003.
[17] A. Chandrakasan and R. Brodersen, Low Power CMOS Design. IEEE
Press, 1998.
[18] A. P. Chandrakasan, W. J. Bowhill, and F. Fox, Design of High-
Performance Microprocessor Circuits, 1st ed. Wiley-IEEE Press, 2000.
[19] M. Powell, S.-H. Yang, B. Falsafi, K. Roy, and T. N. Vijaykumar,
“Gated-Vdd: A circuit technique to reduce leakage in deep-submicron
cache memories,” in Proceedings of the International Symposium on
Low Power Electronics and Design, 2000, pp. 90–95.
[20] A. Iyer, “Demystify power gating and stop leakage cold. web-
site: http://www.powermanagementdesignline.com/howto/ 181500691.”
[Online]. Available: http://www.phoronix.com/scan.php?page=news
item&px=OTgxNQ

684

You might also like