Thanks to visit codestin.com
Credit goes to www.scribd.com

0% found this document useful (0 votes)
78 views10 pages

VM FT PDF

Uploaded by

Cuong Tran
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
78 views10 pages

VM FT PDF

Uploaded by

Cuong Tran
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 10

The Design of a

The Design
Practical of afor
System
Practical Virtual
Fault-Tolerant SystemMachines
for
Fault-Tolerant Virtual Machines
Daniel J. Scales, Mike Nelson, and Ganesh Venkitachalam
VMware, Inc
Daniel J. {scales,mnelson,ganesh}@vmware.com
Scales, Mike Nelson, and Ganesh Venkitachalam
VMware, Inc
{scales,mnelson,ganesh}@vmware.com
ABSTRACT Implementing coordination to ensure deterministic exe-
cution of physical servers [14] is difficult, particularly as
ABSTRACT
We have implemented a commercial enterprise-grade system Implementing
processor frequenciescoordination
increase.toInensure deterministic
contrast, a virtual exe-ma-
for providing fault-tolerant virtual machines, based on the cution of physical
We have implemented
approach of replicatinga commercial
the execution enterprise-grade
of a primary system
virtual chine (VM) runningserverson top[14] of aishypervisor
difficult, particularly
is an excellent as
for processor frequencies increase. In
platform for implementing the state-machine approach. A contrast, a virtual ma-
machine (VM) via a backup virtual machinebased
providing fault-tolerant virtual machines, on the
on another chine (VM) running on atop of a hypervisor is an excellent
approach
server. We of have
replicating the aexecution
designed completeofsystem
a primary virtual
in VMware VM can be considered well-defined state machine whose
machine (VM) platform
operationsforare implementing
the operations the ofstate-machine
the machine approach.
being virtu- A
vSphere 4.0 thatvia a backup
is easy to use,virtual
runs onmachine
commodityon another
servers, VM can be considered a well-defined state machine whose
server. We have designed a complete system
and typically reduces performance of real applications in VMware by alized (including all its devices). As with physical servers,
vSphere operations
VMs have some are the operations of the
non-deterministic machine(e.g.
operations beingreading
virtu-
less than4.010%.thatInis addition,
easy to use,
theruns
dataon commodity
bandwidth servers,
needed to alized (including all its devices). As with physical servers,
and
keep typically
the primary reduces performance
and secondary VMof executing
real applications
in lockstepby a time-of-day clock or delivery of an interrupt), and so extra
less than 10%. In addition, the data VMs have some
information mustnon-deterministic
be sent to the backup operations (e.g. reading
to ensure that it
is less than 20 Mbit/s for several realbandwidth needed
applications, to
which a
keep
allowstheforprimary and secondary
the possibility VM executing
of implementing faultintolerance
lockstep is time-of-day
kept in sync. clock or delivery
Since of an interrupt),
the hypervisor has full and so extra
control over
is lesslonger
than distances.
20 Mbit/s An for several real applications, which information
the executionmust of a beVM, sent to the delivery
including backup to of ensure thatthe
all inputs, it
over easy-to-use, commercial system is kept in sync.
allows for the possibility
that automatically restores of implementing
redundancy after fault
failuretolerance
requires hypervisor is ableSince the hypervisor
to capture has full control
all the necessary over
information
over the
about execution of a VM, including
non-deterministic operationsdelivery of all inputs,
on the primary VM and the
manylonger distances.
additional An easy-to-use,
components commercial
beyond replicated VM system
execu- hypervisor is able to capture all the necessary information
that
tion. automatically
We have designed restores
andredundancy
implemented after failure
these requires
extra com- to replay these operations correctly on the backup VM.
many additional components beyond replicated VM execu- about
Hence,non-deterministic
the state-machine operations
approach on thecan primary VM and
be implemented
ponents and addressed many practical issues encountered in to replay these operations correctly on the backup VM.
tion. We have
supporting VMsdesigned
runningand implemented
enterprise these extra
applications. In thiscom-
pa- for virtual machines on commodity hardware, with no hard-
ponents wareHence, the state-machine
modifications, allowingapproach can be to
fault tolerance implemented
be imple-
per, we and addressed
describe manydesign,
our basic practical issuesalternate
discuss encountered in
design for virtual machines on
supporting
choices andVMs runningofenterprise
a number applications.details,
the implementation In this and
pa- mented immediately forcommodity
the newest hardware, with noIn
microprocessors. hard-
ad-
per, we performance
describe our results
basic design, ware modifications, allowing fault
dition, the low bandwidth required for the state-machine tolerance to be imple-
provide for bothdiscuss alternate design
micro-benchmarks and mented
choices and a number of the implementation details, and
real applications. approachimmediately
allows for the for possibility
the newestofmicroprocessors.
greater physicalIn ad-
sepa-
provide performance results for both micro-benchmarks and dition, the low bandwidth required
ration of the primary and the backup. For example, repli- for the state-machine
real applications. approach
cated virtual allows for thecan
machines possibility
be run on of physical
greater physical
machinessepa-dis-
1. INTRODUCTION ration
tributed of across
the primary
a campus, and the whichbackup.
provides For more
example, repli-
reliability
A common approach to implementing fault-tolerant servers cated virtual machines cansamebe run on physical machines dis-
1. INTRODUCTION than VMs
tributed
running
across
in the
a campus,fault-tolerant
building.
which provides
is the primary/backup approach [1], where a backup server is We have implemented VMsmore usingreliability
the pri-
A common approach
to taketooverimplementing fault-tolerant servers than VMs running in theonsame
always available if the primary server fails. The mary/backup approach the building.
VMware vSphere 4.0 plat-
is the of
state primary/backup
the backup server approach
must [1],be where a backup
kept nearly serverto
identical is We which
have implemented fault-tolerant VMs using the pri-
form, runs fully virtualized x86 virtual machines in
always available to take over if the primary server
the primary server at all times, so that the backup server can fails. The mary/backup approach on the VMware vSphere 4.0imple-
plat-
a highly-efficient manner. Since VMware vSphere
state of the backup server mustthe be kept nearly identical to form,
take over immediately when primary fails, and in such ments which
a completeruns x86fullyvirtual
virtualized
machine, x86we virtual machines in
are automatically
the primary server at allistimes, so to
that the backup server
andcan a highly-efficient manner. Since
a way that the failure hidden external clients no able to provide fault tolerance forVMware
any x86vSphere
operating imple-
sys-
take
data over
is lost.immediately
One way ofwhen the primary
replicating fails,
the state onandthe in such
backup ments a complete x86 virtual machine, we are that
automatically
tems and applications. The base technology allows us
a way isthat the failure is to
hidden to external clients and no able to provide fault tolerance for anyand x86ensure
operating
server to ship changes all state of the primary, includ- to record the execution of a primary that sys-
the
data is lost. One way of replicating the
ing CPU, memory, and I/O devices, to the backup nearly state on the backup tems andexecutes
applications. The base technology that allows re- us
backup identically is known as deterministic
server is to ship
continuously. changesthe
However, to all state of the
bandwidth primary,
needed to sendinclud-
this to record the execution of a primary and ensure that the
play [15]. VMware vSphere Fault Tolerance (FT) is based
ing CPU, memory, and I/O devices, canto the backup nearly backup executes replay,
identically is known
state, particular changes in memory, be very large. on deterministic but adds in the as deterministic
necessary extra pro- re-
continuously.
A different However, thereplicating
bandwidthserversneededthat to send
can this play
method for use tocols[15].
and VMware
functionality vSphere FaultaTolerance
to build complete (FT) is based
fault-tolerant
state, particular changes in memory,referred
can be toveryas large. on deterministic replay, but adds inhardware the necessary
much less bandwidth is sometimes the state- system. In addition to providing fault extra pro-
tolerance,
A different
machine method
approach [13].forThereplicating
idea is to servers
modelthat the can use
servers tocols and automatically
functionality to build redundancy
a complete after fault-tolerant
our system restores a failure
much less bandwidth stateismachines
sometimes referred to asinthe state- system. In addition to providing hardwareon fault
as deterministic that are kept sync by by starting a new backup virtual machine anytolerance,
available
machine
starting themapproach
from [13].
the sameTheinitial
idea is to model
state the servers
and ensuring that our system automatically restores redundancy after a failure
server in the local cluster. At this time, the production ver-
as deterministic stateinput
machines that aresame
kept order.
in sync by by starting
they receive the same requests in the Since sions of botha deterministic
new backup virtual replay machine
and VMware on any FTavailable
support
starting themorfrom the same
have initial state and ensuring
that arethat server in the local cluster. At this time,
most servers services some operations not only uni-processor VMs. Recording andthe production
replaying ver-
the exe-
they receive theextra
same coordination
input requestsmust in thebesame order. Since sions
deterministic, used to ensure cutionofofboth deterministic VM
a multi-processor replay andwork
is still VMware FT support
in progress, with
most
that a primary and backup are kept in sync. However, not
servers or services have some operations that are only uni-processor VMs.issues Recording
the significant performance becauseand replaying
nearly the exe-
every access to
deterministic, extra coordination need must bethe
used to ensure cution
amount of extra information to keep primary and shared memory can be a non-deterministic operation. with
of a multi-processor VM is still work in progress,
that a primary andfarbackup arethekept in sync. However, the significant
backup in sync is less than amount of state (mainly Bressoudperformance
and Schneider issues [3]because
describenearly every access
a prototype imple-to
amount
memory of extra information
updates) that is changingneed to in keep the primary and
the primary. shared memory can be a non-deterministic operation.
backup in sync is far less than the amount of state (mainly Bressoud and Schneider [3] describe a prototype imple-
memory updates) that is changing in the primary.

30

30
backup VM via a network connection known as the logging
Primary Backup channel. For server workloads, the dominant input traffic
VM VM
is network and disk. Additional information, as discussed
below
backupinVM Section
via a2.1, is transmitted
network connectionasknownnecessary
as theto logging
ensure
Primary Backup that the backup
channel. For serverVM workloads,
executes non-deterministic
the dominant input operations
traffic
VM VM
in the sameand
is network waydisk.
as theAdditional
primary VM. The resultasisdiscussed
information, that the
backup
below inVM always
Section executes
2.1, identically
is transmitted as to the primary
necessary VM.
to ensure
However, the outputs
that the backup of the backup
VM executes VM are dropped
non-deterministic operations by
Logging the hypervisor,
in the same waysoasonly the the primary
primary VM. produces actual
The result outputs
is that the
channel
that
backupareVMreturned
always toexecutes
clients. As described
identically tointhe
Section
primary 2.2,VM.
the
primary
However,and thebackup
outputsVMoffollow a specific
the backup VM protocol, including
are dropped by
Logging explicit acknowledgments
the hypervisor, so only the by the backup
primary producesVM, in order
actual outputsto
channel
ensure
that arethat no data
returned is lost ifAs
to clients. thedescribed
primary infails.
Section 2.2, the
Shared Disk To detect
primary and ifbackup
a primary or backup
VM follow VM has
a specific failed,including
protocol, our sys-
tem usesacknowledgments
explicit a combination of heartbeating
by the backup between
VM, the relevant
in order to
servers and monitoring
ensure that no data is lost of the traffic
if the on thefails.
primary logging channel.
Shared Disk In To
addition,
detect ifwea must
primary ensure that only
or backup VM onehas of the primary
failed, our sys-
or
tembackup
uses a VM takes over
combination execution, even
of heartbeating if there
between theisrelevant
a split-
brain
serverssituation where the
and monitoring primary
of the trafficand on backup servers
the logging have
channel.
Figure 1: Basic FT Configuration. lost communication
In addition, we must with eachthat
ensure other.only one of the primary
or In the following
backup VM takes sections, we provide
over execution, more
even details
if there is on sev-
a split-
eral
brainimportant areas. the
situation where In Section
primary2.1, andwe give some
backup servers details
have
mentationFigureof fault-tolerant
1: Basic VMs for the HP PA-RISC plat-
FT Configuration. on
lostthe deterministic with
communication replayeach
technology
other. that ensures that pri-
form. Our approach is similar, but we have made some mary andfollowing
In the backup VMs are kept
sections, in syncmore
we provide via the information
details on sev-
fundamental changes for performance reasons and investi- sent over the logging
eral important areas. channel.
In Section In2.1,
Section 2.2, some
we give we describe
details
gated
mentationa number of design alternatives.
of fault-tolerant VMs for theInHP addition,
PA-RISC we plat-
have a
onfundamental rule ofreplay
the deterministic our FT protocolthat
technology thatensures
ensuresthat thatpri-
no
had
form.to Our
design and implement
approach is similar,many butadditional
we have components
made some data
mary isandlostbackup
if the primary
VMs are fails.
kept inIn sync
Section
via 2.3, we describe
the information
in the systemchanges
fundamental and deal forwith a numberreasons
performance of practical issues
and investi- our
sent methods
over the for detecting
logging and responding
channel. to a we
In Section 2.2, failure in a
describe
to build
gated a complete
a number system
of design that is efficient
alternatives. and usable
In addition, we have by correct fashion.rule of our FT protocol that ensures that no
a fundamental
customers
had to design running enterprise applications.
and implement many additional Similar to most
components data is lost if the primary fails. In Section 2.3, we describe
other
in thepractical
system and systemsdeal discussed,
with a number we only of attempt
practicaltoissues
deal 2.1 methods
our Deterministic
for detecting Replay Implementation
and responding to a failure in a
with fail-stop
to build failuressystem
a complete [12], whichthatare server failures
is efficient that can
and usable by correct
As wefashion.
have mentioned, replicating server (or VM) exe-
be detectedrunning
customers before the failing applications.
enterprise server causes an incorrect
Similar to mostex-
cution can be modeled as the replication of a determinis-
ternally visible action.
other practical systems discussed, we only attempt to deal 2.1state
tic Deterministic
machine. If twoReplay Implementation
deterministic state machines are
Thefail-stop
with rest of failures
the paper [12],iswhich
organized as follows.
are server failuresFirst,
that can we
started
As weinhave
the same initial state
mentioned, and provided
replicating server the (orexact
VM) same
exe-
describe
be detected our before
basic design and detail
the failing server our fundamental
causes an incorrect proto-
ex-
inputs
cution incanthebesame order,as
modeled thenthethey will go through
replication the same
of a determinis-
cols thatvisible
ternally ensureaction.
that no data is lost if a backup VM takes
sequences of states If
tic state machine. and
twoproduce the same
deterministic state outputs.
machines A vir-
are
over
Theafter
rest aofprimary
the paper VMis fails.
organizedThen,as we describe
follows. in de-
First, we
tual machine
started in the has
samea initial
broad state
set ofand
inputs, including
provided incoming
the exact same
tail manyour
describe of basic
the practical
design and issues thatour
detail must be addressed
fundamental proto-to
network
inputs in packets,
the samedisk reads,
order, thenandtheyinput
will gofrom the keyboard
through the same
build
cols thata robust,
ensurecomplete,
that no data and isautomated system.VM
lost if a backup Wetakesalso
and mouse.of states
sequences Non-deterministic
and produceeventsthe same(such as virtual
outputs. in-
A vir-
describe
over afterseveral designVM
a primary choices
fails.that arisewefordescribe
Then, implementing
in de-
terrupts)
tual machineand has
non-deterministic
a broad set of operations (such asincoming
inputs, including reading
fault-tolerant
tail many of the VMs and discuss
practical issues thethat
tradeoffs
must in be these choices.
addressed to
the clock packets,
network cycle counter of the processor)
disk reads, and input alsofromaffect the VM’s
the keyboard
Next,
build awerobust,
give performance
complete, and results for oursystem.
automated implementation
We also
state. This presents
and mouse. three challenges
Non-deterministic eventsfor replicating
(such as virtualexecu-
in-
for some several
describe benchmarks designand some that
choices real enterprise applications.
arise for implementing
tion of anyand
terrupts) VMnon-deterministic
running any operating system
operations and as
(such workload:
reading
Finally, we describe
fault-tolerant VMs and related
discussworktheand conclude.
tradeoffs in these choices.
(1) correctly
the clock cyclecapturing
counter ofall thethe input and
processor) alsonon-determinism
affect the VM’s
Next, we give performance results for our implementation
necessary
state. This to presents
ensure deterministic
three challengesexecution of a backup
for replicating vir-
execu-
2. some
for BASIC FT DESIGN
benchmarks and some real enterprise applications.
tual machine,
tion of any VM (2) correctly
running applying system
any operating the inputs and non-
and workload:
Finally, we describe related work and conclude.
Figure 1 shows the basic setup of our system for fault- determinism
(1) correctly to the backup
capturing virtual
all the inputmachine, and (3) doing
and non-determinism
tolerant VMs. For a given VM for which we desire to provide so in a manner
necessary thatdeterministic
to ensure doesn’t degrade performance.
execution of a backupIn addi-
vir-
2. BASIC
fault tolerance FT (theDESIGN
primary VM), we run a backup VM on tion, many complex
tual machine, operations
(2) correctly in x86 microprocessors
applying the inputs and have non-
a different
Figure 1 physical
shows the server
basicthat is kept
setup in sync
of our systemandforexecutes
fault- undefined,
determinism hence
to thenon-deterministic, side effects.
backup virtual machine, andCapturing
(3) doing
identically
tolerant VMs. to the
For primary
a given VM virtual machine,
for which though
we desire with a
to provide these
so in aundefined
manner that side doesn’t
effects and replaying
degrade them to In
performance. produce
addi-
small time lag. (the
fault tolerance We say that the
primary VM), twoweVMs runare in virtual
a backup VM lock-
on the
tion,same
manystate presents
complex an additional
operations in x86 challenge.
microprocessors have
step. The virtual
a different physicaldisksserver forthat
the isVMskeptare on shared
in sync storage
and executes VMware hence
undefined, deterministic replay [15]side
non-deterministic, provides
effects.exactly this
Capturing
(such as a to
identically FibretheChannel
primary or iSCSImachine,
virtual disk array), and with
though there- a functionality
these undefined for x86
side virtual
effects machines on thethem
and replaying VMware vSphere
to produce
fore
smallaccessible
time lag. to We thesayprimary
that the and backup
two VMs are VMinfor inputlock-
virtual and platform.
the same stateDeterministic
presents an replay recordschallenge.
additional the inputs of a VM
output.
step. The (We will discuss
virtual disks for a design
the VMs in which
are onthe primary
shared and
storage and all possible
VMware non-determinism
deterministic replay [15]associated
provideswith the VM
exactly this
backup
(such asVM have Channel
a Fibre separate or non-shared
iSCSI disk virtual
array), disks
andinthere-
Sec- execution
functionalityin afor
stream of log machines
x86 virtual entries written
on theto a log file.
VMware The
vSphere
tion 4.1.) Onlytothe
fore accessible theprimary
primaryVM and advertises
backup VM itsfor
presence
input and on VM execution
platform. may be exactly
Deterministic replayreplayed
records later
the inputsby reading
of a VMthe
the network,
output. (We so willalldiscuss
network inputs in
a design come
whichto the
the primary
primaryVM. and log
and entries from the
all possible file. For non-deterministic
non-determinism associated with operations,
the VM
Similarly,
backup VM all have
otherseparate
inputs (such as keyboard
non-shared virtualand mouse)
disks in Sec-go sufficient
execution information
in a stream of is logged to allow
log entries writtenthetooperation
a log file.toThebe
only to theOnly
tion 4.1.) primary VM.
the primary VM advertises its presence on reproduced
VM execution with
may the
besame
exactly state change
replayed andbyoutput.
later reading Forthe
theAll input that
network, so allthe primary
network inputs VM receives
come to theisprimary
sent toVM. the non-deterministic
log entries from the events
file. such as timer or IO completion
For non-deterministic operations, in-
Similarly, all other inputs (such as keyboard and mouse) go sufficient information is logged to allow the operation to be
only to the primary VM. reproduced with the same state change and output. For
All input that the primary VM receives is sent to the non-deterministic events such as timer or IO completion in-

31

31
terrupts, the exact instruction at which the event occurred output
failure!
is also recorded. During replay, the event is delivered at the async input output async
same point in the instruction stream. VMware determinis- event operation event
tic replay the
terrupts, implements an efficient
exact instruction at event
whichrecording
the eventand event
occurred Primary output
failure!
delivery mechanism
is also recorded. that replay,
During employsthe various
eventtechniques,
is deliveredinclud-
at the async input output async
ing
samethe use of
point in hardware performance
the instruction stream.countersVMwaredeveloped
determinis- in event operation event
conjunction with AMDan[2]efficient
tic replay implements and Intel event [8].recording and event Primary
Bressoud
delivery and Schneider
mechanism [3] mention
that employs variousdividing the execution
techniques, includ-
of
ingVM the into
use of epochs,
hardwarewhere non-deterministic
performance countersevents such as
developed in
interrupts
conjunction are onlyAMD
with delivered
[2] andat the
Intelend [8].of an epoch. The no- Backup
tion of epoch
Bressoud andseems to be used
Schneider as a batching
[3] mention dividingmechanism
the execution be-
cause
of VMitinto is too expensive
epochs, where to non-deterministic
deliver each interrupt eventsseparately
such as Backup takes over
at the exact
interrupts areinstruction
only delivered where
at theit end
occurred. However,
of an epoch. The our
no- Backup
event
tion ofdelivery
epoch seems mechanism
to be usedis efficient enoughmechanism
as a batching that VMware be-
Figure 2: FT Protocol.
deterministic
cause it is tooreplay expensive has to
no deliver
need toeach use interrupt
epochs. Each inter-
separately Backup takes over
rupt
at the is exact
recorded as it occurs
instruction whereanditefficiently
occurred.delivered However, at our
the
appropriate
event delivery instruction
mechanism while being replayed.
is efficient enough that VMware each output operation.
Figure Then,
2: FTthe Output Requirement may
Protocol.
deterministic replay has no need to use epochs. Each inter- be enforced by this specific rule:
2.2
rupt isFT Protocol
recorded as it occurs and efficiently delivered at the
appropriate
For VMware instruction
FT, we use while being replayed.
deterministic replay to produce Output
each output Rule: the
operation. primary
Then, VM may
the Output not send may
Requirement
the necessary log entries to record the execution of the pri- an output to the external
be enforced by this specific rule: world, until the backup
2.2 VM,
mary FT Protocol
but instead of writing the log entries to disk, VM has received and acknowledged the log en-
weForsend them to
VMware FT,thewebackup VM via thereplay
use deterministic logging to channel.
produce try
Outputassociated
Rule:withthethe operation
primary VM producing
may not send the
The backup VM
the necessary log replays
entries the entriesthe
to record in execution
real time, of and hence
the pri- output.
an output to the external world, until the backup
executes
mary VM, identically
but instead to theof primary
writing the VM.log However,
entries we to must
disk, VM has received and acknowledged the log en-
If thetrybackup VM has
associated withreceived all the producing
the operation log entries,the includ-
augment
we send them the logging
to the entries
backup with VM via a strict FT protocol
the logging channel.on ing the log entry for the output-producing operation, then
the
Thelogging
backupchannel VM replays in order
the to ensure
entries in that we achieve
real time, fault
and hence output.
the backup VM will be able to exactly reproduce the state
tolerance. Our fundamental
executes identically requirement
to the primary is the following:
VM. However, we must of primary VM
If the backup VM hasat that output
received all point,
the logand so if the
entries, pri-
includ-
augment the logging entries with a strict FT protocol on mary
ing thedies,
log the
entrybackup
for thewill correctly reach operation,
output-producing a state that then is
the logging
Output channel in order to if
Requirement: ensure that weVM
the backup achieve
ever fault consistent
the backupwith VM thatwill output.
be able to Conversely, if the backup
exactly reproduce VM
the state
tolerance.
takesOur over fundamental
after a failurerequirement
of the primary, is the
thefollowing:
backup takes
of theover without
primary VMreceiving
at that all necessary
output point,log andentries, thenpri-
so if the its
VM will continue executing in a way that is en- state
mary maydies, quickly diverge
the backup willsuch that itreach
correctly is inconsistent
a state that with is
tirely
Output consistent
Requirement:with all ifoutputs
the backup that VM the ever
pri- the primary’s
consistent withoutput. The Output
that output. Conversely, Ruleif isthe in backup
some ways VM
mary VM after
takes over has sent to the
a failure of external
the primary, world.the backup analogous to the approach
takes over without receiving described
all necessary in [11], where then
log entries, an “ex-its
VM will continue executing in a way that is en- ternally
state may synchronous”
quickly diverge IO cansuchactually
that it be buffered, as with
is inconsistent long
Note thattirelyafter
consistent
a failoverwith all outputs
occurs (i.e. thethat backupthe VM pri- takes as
theitprimary’s
is actuallyoutput.
written to diskOutput
The before the Rulenext external
is in some com-ways
marythe
over after VMfailure
has sent to the
of the external
primary VM), world.
the backup VM munication.
analogous to the approach described in [11], where an “ex-
will likely start executing quite differently from the way the Note that
ternally the Output
synchronous” IORule does not be
can actually saybuffered,
anythingasabout long
primary
Note that VM would
after have continued
a failover occurs (i.e. executing,
the backup because
VM of the
takes stopping the execution
as it is actually written to ofdisk
the before
primary theVM.next We need com-
external only
many
over afternon-deterministic
the failure of events happening
the primary VM),during execution.
the backup VM delay the sending of the output, but the VM itself can con-
munication.
However,
will likely as longexecuting
start as the backup
quite VM satisfies
differently the the
from Output
way Re-the tinue
Noteexecution. Since operating
that the Output Rule doessystemsnot saydo non-blocking
anything about
quirement,
primary VMnowould externally visible state
have continued or data because
executing, is lost during
of the network
stopping and the disk outputs
execution withprimary
of the asynchronousVM. We interrupts
need only to
a failover
many to the backup events
non-deterministic VM, and the clients
happening willexecution.
during notice no indicate
delay thecompletion,
sending of the the output,
VM canbut easily
the continue
VM itselfexecution
can con-
interruption
However, as long or inconsistency
as the backup inVMtheirsatisfies
service.the Output Re- and
tinuewill not necessarily
execution. Since be immediately
operating systemsaffected by the delay
do non-blocking
The Output
quirement, Requirement
no externally can be
visible state ensured
or data byisdelaying any
lost during in the output.
network and disk In contrast,
outputs previous work [3, 9] interrupts
with asynchronous has typically to
external
a failoveroutput
to the(typically
backup VM, a network
and the packet)
clients until
willthe backup
notice no indicated that the primary
indicate completion, the VM VMcan must be completely
easily continue executionstopped
VM has received
interruption all information
or inconsistency that service.
in their will allow it to replay prior to doing
and will an outputbeuntil
not necessarily the backup
immediately VM has
affected acknowl-
by the delay
execution
The Output at least to the point
Requirement can be of ensured
that output operation.
by delaying any edged all necessary
in the output. information
In contrast, fromwork
previous the [3,primary
9] hasVM. typically
One necessary
external outputcondition
(typicallyisa network
that the packet)
backupuntil VM the must have
backup As an example,
indicated we show VM
that the primary a chart
mustillustrating
be completely the stopped
require-
received all log entries
VM has received generatedthat
all information priorwill to allow
the output oper-
it to replay ments
prior toofdoing
the FT an protocol
output untilin Figure
the backup2. This VMfigure shows a
has acknowl-
ation.
execution These log entries
at least to thewill pointallowof itthatto execute up to the
output operation. timeline
edged all of events on
necessary the primary
information from andthebackup
primaryVMs. VM. The
point of the lastcondition
One necessary log entry.is However,
that the backup supposeVM a failure
must were
have arrows
As angoing from we
example, theshow
primary line illustrating
a chart to the backup the line rep-
require-
to happen
received allimmediately after the primary
log entries generated prior to executed
the output theoper-
out- resent
ments the transfer
of the of log entries,
FT protocol in Figure and2.theThis arrows
figure going from
shows a
put
ation. operation.
These logThe backup
entries will VMallowmustit toknow execute thatupittomustthe the backup
timeline line to on
of events thetheprimary
primary lineandrepresent
backupacknowledg-
VMs. The
keep
pointreplaying
of the lastuplog to entry.
the point of the suppose
However, output operation
a failure wereand ments.
arrows goingInformation
from theonprimary
asynchronous
line to the events,
backup inputs, and
line rep-
only “go live”
to happen (stop replaying
immediately after the andprimary
take over as the the
executed primary
out- output
resent theoperations
transfer must
of logbe sent toand
entries, thethebackup
arrows as going
log entries
from
VM, as describedThe
put operation. in Section
backup2.3) VM at must
that point.
know If the it
that backup
must and acknowledged.
the backup line to Astheillustrated
primary line in the figure, an
represent output to
acknowledg-
were to go live up
keep replaying at the point
to the pointof the
of the lastoutput
log entry before and
operation the the external
ments. world is delayed
Information until the primary
on asynchronous events, inputs,VM hasand re-
output
only “gooperation,
live” (stopsome non-deterministic
replaying and take over event
as the(e.g. timer
primary ceived
output an acknowledgment
operations must be fromsent to thethebackup
backup VM as that it has
log entries
interrupt delivered
VM, as described in to the VM)
Section 2.3) might
at thatchangepoint. If itsthe
execution
backup received the log entry
and acknowledged. Asassociated
illustratedwith in thethe output
figure, anoperation.
output to
path
were before
to go live it executed the output
at the point of the lastoperation.
log entry before the Given that the
the external Output
world Rule isuntil
is delayed followed, the backup
the primary VMVM haswill
re-
Givenoperation,
output the abovesome constraints, the easiestevent
non-deterministic way (e.g.
to enforce
timer be ablean
ceived to acknowledgment
take over in a state fromconsistent
the backup withVM thethat
primary’s
it has
the Output
interrupt Requirement
delivered to the isVM) to create
mightachange specialits logexecution
entry at last output.
received the log entry associated with the output operation.
path before it executed the output operation. Given that the Output Rule is followed, the backup VM will
Given the above constraints, the easiest way to enforce be able to take over in a state consistent with the primary’s
the Output Requirement is to create a special log entry at last output.

32

32
We cannot guarantee that all outputs are produced ex- VM is actually still running, there will likely be data cor-
actly once in a failover situation. Without the use of trans- ruption and problems for the clients communicating with the
actions with two-phase commit when the primary intends VM. Hence, we must ensure that only one of the primary or
to We
send an output,
cannot therethat
guarantee is no
all way thatare
outputs theproduced
backup can ex- backup
VM VM goesstill
is actually liverunning,
when a there
failurewill
is detected.
likely be data To avoid
cor-
determine
actly once if in aa primary crashed immediately
failover situation. Without the before
use ofortrans-
after split-brain
ruption andproblems,
problems we formake use ofcommunicating
the clients the shared storage withthat
the
sending with
actions its last output. Fortunately,
two-phase commit when thethe
network
primaryinfrastruc-
intends storesHence,
VM. the virtual disks
we must of the
ensure VM.
that onlyWhen
one ofeither a primary
the primary or
turesend
to (including the common
an output, there isuse
no ofway
TCP) is designed
that the backup to deal
can or backup
backup VMVM wants
goes live towhengo live, it executes
a failure an atomic
is detected. test-
To avoid
with lost packets
determine and identical
if a primary (duplicate) packets.
crashed immediately before or Note
after and-set operation
split-brain problems,on we themake
shared
use storage.
of the sharedIf the operation
storage that
that incoming
sending its lastpackets
output.toFortunately,
the primarythe may also beinfrastruc-
network lost dur- succeeds,
stores the the VMdisks
virtual is allowed
of the to
VM. go When
live. either
If the aoperation
primary
ing a(including
ture failure of the
the primary
commonand use therefore
of TCP) is won’t be delivered
designed to deal fails,
or then the
backup VMother
wantsVM to must have
go live, already gone
it executes live, sotest-
an atomic the
to thelost
with backup.
packets However, incoming
and identical packets may
(duplicate) be dropped
packets. Note current VM
and-set actually
operation on halts itself (“commits
the shared storage. If suicide”). If the
the operation
for any
that numberpackets
incoming of reasons unrelated
to the primarytomayserver failure,
also be lostsodur-
the VM cannotthe
succeeds, access
VM the shared storage
is allowed when Iftrying
to go live. to do the
the operation
network
ing infrastructure,
a failure operating
of the primary systems,won’t
and therefore and applications
be delivered atomic
fails, thenoperation,
the otherthen
VMitmust just have
waitsalready
until itgonecan. live,
Notesothatthe
arethe
to all backup.
written to ensure that
However, they packets
incoming can compensate for lost
may be dropped if sharedVM
current storage is not
actually accessible
halts because ofsuicide”).
itself (“commits some failure in
If the
packets.
for any number of reasons unrelated to server failure, so the the storage
VM cannot network,
access thethen sharedthestorage
VM would whenlikely
trying nottobedoable
the
network infrastructure, operating systems, and applications to do useful
atomic work then
operation, anyway because
it just waits the
untilvirtual
it can.disksNotereside
that
are all Detecting
2.3 and Responding
written to ensure to Failurefor lost
that they can compensate onshared
if the same shared
storage storage.
is not Thus,
accessible using shared
because of somestoragefailure to
in
packets. resolve
the storage split-brain
network, situations
then thedoes VM not wouldintroduce
likely not anybeextra
able
As mentioned above, the primary and backup VMs must
unavailability.
to do useful work anyway because the virtual disks reside
respond quickly if the other VM appears to have failed. If
2.3
the backupDetecting
VM fails, andtheResponding
primary VM to willFailure
go live – that onOne
the final
sameaspect
sharedofstorage.
the design is that
Thus, usingonce a failure
shared has oc-
storage to
curred and
resolve one of the
split-brain VMs hasdoes
situations gonenotlive, VMwareany
introduce FT auto-
extra
is,As
leave recordingabove,
mentioned modethe (and hence stop
primary sending VMs
and backup entries muston
matically restores redundancy by starting a new backup VM
unavailability.
the logging
respond channel)
quickly if theand other start
VMexecuting
appears to normally.
have failed. If theIf
onOne
another
final host.
aspectThough this process
of the design is thatisonce
not acovered
failure inhasmost
oc-
primary
the backup VMVM fails,fails,
the the
backup primaryVM should
VM will similarly
go live go live,
– that
previous
curred and work, it the
one of is fundamental
VMs has gonetolive, making
VMware fault-tolerant
FT auto-
butleave
is, the process
recording is amode
bit more
(and complex.
hence stop Because
sendingofentries
its lag on in
VMs useful
matically and requires
restores redundancy careful design. aMore
by starting details VM
new backup are
execution,
the loggingthe backupand
channel) VMstart will likely have normally.
executing a number If of the
log
given in Section
on another host. 3.1.
Though this process is not covered in most
entries that
primary VMitfails,
has received
the backup andVM acknowledged,
should similarly but have not
go live,
yet been consumed previous work, it is fundamental to making fault-tolerant
but the process is a because
bit morethe backupBecause
complex. VM hasn’t reached
of its lag in
the appropriate
execution, point in
the backup VM its will
execution yet. The
likely have backupofVM
a number log 3.
VMs PRACTICAL
useful and requires IMPLEMENTATION
careful design. More details OF FT are
must given in Section 3.1. our fundamental design and protocols
entriescontinue
that it has replaying
received itsandexecution from the
acknowledged, butloghave
entries
not Section 2 described
untilbeen
yet it has consumed
consumed the last
because thelog entry. VM
backup At that
hasn’t point,
reachedthe for FT. However, to create a usable, robust, and automatic
backup
the VM will stop
appropriate pointreplaying mode and
in its execution start
yet. The executing
backup VM as a 3. PRACTICAL
system, there are manyIMPLEMENTATION
other components that mustOF
be FT
de-
normal
must VM. In replaying
continue essence, the its backup
execution VMfrom has been
the logpromoted
entries signed
Sectionand2 implemented.
described our fundamental design and protocols
to theit primary
until has consumed VM (and the lastis now missing
log entry. Atathatbackuppoint, VM).
the for FT. However, to create a usable, robust, and automatic
Since itVM
backup is nowilllonger a backupmode
stop replaying VM,and thestart
new executing
primary VM as a 3.1
system,Starting
there are and manyRestarting
other components FT VMs that must be de-
will nowVM.
normal produce output
In essence, thetobackup
the external
VM hasworld when the
been promoted One and
signed of the biggest additional components that must be
implemented.
guest
to theOS does output
primary VM (and operations. During the
is now missing transition
a backup VM). to designed is the mechanism for starting a backup VM in the
normalitmode,
Since is no there
longermay be some
a backup VM,device-specific
the new primary operationsVM 3.1 Starting
same state and Restarting
as a primary FT VMswill also be
VM. This mechanism
needed
will now to produce
allow thisoutputoutputtoto the occur properly.
external worldIn particular,
when the used
Onewhenof the re-starting a backup VM
biggest additional after a failure
components that must has oc-
be
for theOS
guest purposes
does outputof networking,
operations. VMware
DuringFT theautomatically
transition to curred. Hence,
designed is the this mechanism
mechanism must be ausable
for starting backup forVM a running
in the
advertises
normal the MAC
mode, there may address of thedevice-specific
be some new primary VM on the
operations primary
same VMasthat
state is in an VM.
a primary arbitrary
This state (i.e. notwill
mechanism justalso
start-
be
network,to so
needed that
allow physical
this output network switches will
to occur properly. know on
In particular, ing up).
used when Inre-starting
addition, we would prefer
a backup VM afterthata the mechanism
failure has oc-
what
for theserver
purposesthe new primary VM
of networking, is located.
VMware In addition,
FT automatically does
curred.notHence,
significantly disrupt the
this mechanism mustexecution
be usable offorthea primary
running
the newly promoted
advertises the MAC addressprimaryof VM themay
newneedprimaryto reissue
VM onsome the VM, since
primary VM that
that will affect
is in any current
an arbitrary clients
state of the
(i.e. not justVM.
start-
disk IOs (as
network, so described
that physical in Section
network 3.4).
switches will know on ingFor
up).VMware FT, wewe
In addition, adapted
would the existing
prefer that theVMotion
mechanism func-
whatThere are many
server the new possible
primary waysVM to attempt
is located. to detect failure
In addition, tionality
does not of VMware vSphere.
significantly disrupt the VMware VMotion
execution of the [10] allows
primary
of the
the primary
newly promotedand backup
primaryVMs. VM may VMware
need to FT uses UDP
reissue some the migration
VM, since thatofwill a running
affect any VMcurrent
from one server
clients to another
of the VM.
heartbeating
disk between servers
IOs (as described in Section that3.4).
are running fault-tolerant server with minimal
For VMware FT, we disruption
adapted –the VM pause VMotion
existing times arefunc- typ-
VMs to detect
There are many when a server
possible maytohave
ways crashed.
attempt In addition,
to detect failure ically lessofthan
tionality VMware a second.
vSphere. WeVMware
created VMotion
a modified [10]form of
allows
VMware
of FT monitors
the primary and backup the logging
VMs. traffic VMware thatFTis uses
sent UDPfrom VMotion
the migration that ofcreates an exact
a running VM running
from onecopy serverof atoVM on a
another
the primary to
heartbeating the backup
between serversVM thatand the acknowledgments
are running fault-tolerant remotewith
server server, but without
minimal disruption destroying the VM
– VM pause on the
times are local
typ-
sent from
VMs the backup
to detect VM tomay
when a server the have
primary
crashed.VM. In Because
addition, of server.less
ically That thanis, aour modified
second. WeFT VMotion
created clones aform
a modified VM to of
regular timer
VMware FT interrupts,
monitors the thelogging
loggingtraffictraffic thatshould is be
sent regular
from a remote that
VMotion host creates
rather than migrating
an exact running it. copy
The of FTa VM VMotion
on a
and primary
the never stop to for
thea backup
functioning VM guest and theOS. acknowledgments
Therefore, a halt also
remotesetsserver,
up a butlogging channel,
without and causes
destroying the VM the onsource VM
the local
in thefrom
sent flowthe of log entries
backup VM or toacknowledgments
the primary VM. could indicate
Because of to enterThat
server. logging mode
is, our as the FT
modified primary, and clones
VMotion the destination
a VM to
the failure
regular timer of interrupts,
a VM. A failure the loggingis declared
trafficifshould
heartbeating
be regular or VM to enter
a remote hostreplay
rathermodethan as the new it.
migrating backup.
The FT LikeVMotion
normal
logging
and nevertraffic
stophasfor stopped
a functioningfor longer guestthan
OS.aTherefore,
specific timeout a halt VMotion,
also sets up FTaVMotion typically and
logging channel, interrupts
causes thetheexecution
source VM of
(onthe
in thefloworder of aentries
of log few seconds).
or acknowledgments could indicate theenter
to primary VM mode
logging by lessasthan the aprimary,
second. andHence,theenabling
destination FT
theHowever,
failure ofany a VM.such A failure
failure detection
is declaredmethod is susceptible
if heartbeating or on a to
VM running VM is mode
enter replay an easy, as non-disruptive
the new backup. operation.
Like normal
to a split-brain
logging traffic has problem.
stopped If for
thelonger
backupthan server stops receiving
a specific timeout AnotherFT
VMotion, aspect
VMotionof starting
typicallya backup
interrupts VM theisexecution
choosing of a
heartbeats
(on the order fromof athefewprimary
seconds). server, that may indicate that server
the on which
primary VMtoby run it. than
less Fault-tolerant
a second. VMsHence, runenabling
in a cluster
FT
theHowever,
primaryany serversuchhas failed,
failure or it may
detection just mean
method that all
is susceptible of servers
on a running thatVM haveis access
an easy, to non-disruptive
shared storage, operation.
so all VMs can
network
to connectivity
a split-brain problem. has Ifbeen lost between
the backup serverstill
stopsfunctioning
receiving typically
Another run on anyofserver
aspect starting in the cluster. VM
a backup Thisisflexibility
choosingal- a
servers.
heartbeats If the
from backup VM then
the primary goesthat
server, live while the primary
may indicate that lows
serverVMware
on whichvSphere
to run it. to Fault-tolerant
restore FT redundancy
VMs run in even when
a cluster
the primary server has failed, or it may just mean that all of servers that have access to shared storage, so all VMs can
network connectivity has been lost between still functioning typically run on any server in the cluster. This flexibility al-
servers. If the backup VM then goes live while the primary lows VMware vSphere to restore FT redundancy even when

33

33
must be able to replay an execution at roughly the same
Primary Backup
speed as the primary VM is recording the execution. Fortu-
VM VM
nately, the overhead of recording and replaying in VMware
hypervisor hypervisor deterministic
must be able replay to replay is roughly the same.
an execution However,
at roughly the ifsame
the
Primary Logging channel Backup
server
speed as hosting
the primarythe backupVM isVM is heavily
recording the loaded
execution.withFortu-
other
VM
log buffer VM
log buffer VMs
nately,(and
the hence
overhead overcommitted
of recording on andresources),
replaying the backup
in VMware
hypervisor Acks hypervisor VM may not replay
deterministic be ableistoroughlyget enough
the same.CPU and memory
However, re-
if the
Logging channel sources to execute
server hosting as fast VM
the backup as the primaryloaded
is heavily VM, despite
with otherthe
log buffer log buffer best
VMs efforts
(and hence of theovercommitted
backup hypervisor’s VM scheduler.
on resources), the backup
Acks
VM Beyond
may not avoiding
be able unexpected
to get enough pauses CPUif theandlogmemory
buffers re-fill
up, theretoisexecute
sources anotheras reason
fast as why weprimary
the don’t wish VM, thedespite
execution
the
lag
besttoefforts
become of too
the large.
backupIf hypervisor’s
the primary VM fails, the backup
scheduler.
VM must “catch
Beyond avoidingup” by replaying
unexpected pauses all if
thetheloglogentries
buffersthatfill
it
up,has already
there is anotheracknowledged
reason why before
we don’tit goes
wishlivetheand starts
execution
communicating
lag to become too with theIfexternal
large. world.
the primary VM The time
fails, thetobackup
finish
Figure 3: FT Logging Buffers and Channel. replaying
VM must is“catch basicallyup”the by execution
replaying lag timelog
all the at the point
entries of
that
the failure,
it has already so the time for thebefore
acknowledged backup it to
goesgo live
live and
is roughly
starts
equal to the failure
communicating withdetection
the externaltime plus
world. theThecurrent
timeexecution
to finish
one Figure
or more3:servers have failed.
FT Logging BuffersVMware vSphere imple-
and Channel. lag time. Hence,
replaying is basicallywe don’t wish the execution
the execution lag time at lagthetime to be
point of
ments a clustering service that maintains management and
large (more so
the failure, than
the atime
second),
for the since
backupthatto will
go add
live significant
is roughly
resource information. When a failure happens and a primary
time
equalto to the
the failover time.
failure detection time plus the current execution
VM
one nowor moreneedsservers
a new backup
have failed.VM toVMware
re-establish redundancy,
vSphere imple-
the primary VM informs lagTherefore,
time. Hence, we have an additional
we don’t wish the mechanism
execution lag to time
slow down
to be
ments a clustering servicethe clustering
that maintainsservice that it needs
management and
the
largeprimary
(more than VM to prevent since
a second), the backup
that will VMadd from getting
significant
a new backup.
resource The clustering
information. When a failure service determines
happens and a the best
primary
too
timefarto behind.
the failover In our protocol for sending and acknowledg-
time.
server
VM now onneeds
which to run
a new the backup
backup VM based redundancy,
VM to re-establish on resource
ingTherefore,
log entries, wewe send
have anadditional
additionalinformation
mechanism to slow determine
down
usage and other
the primary VM constraints
informs theand invokesservice
clustering an FTthat VMotion
it needsto
the real-time
primary VM execution lag between
to prevent the primary
the backup VM from and backup
getting
create
a new the new backup
backup. VM. Theservice
The clustering result determines
is that VMware FT
the best
VMs.
too farTypically
behind. Inthe ourexecution
protocol lag is less than
for sending and 100 millisec-
acknowledg-
typically
server oncan re-establish
which to run the VMbackup
redundancy withinon
VM based minutes
resource of
onds.
ing logIfentries,
the backup VM additional
we send starts having a significant
information execution
to determine
a server
usage andfailure, all without and
other constraints any invokes
noticeable
an FT interruption
VMotion to in
lag (say, moreexecution
the real-time than 1 second),
lag between VMware FT starts
the primary and slowing
backup
the
createexecution
the newofbackup
a fault-tolerant
VM. The VM. result is that VMware FT
down
VMs. the primary
Typically theVM by informing
execution lag is lessthe than
scheduler to give
100 millisec-
typically can re-establish VM redundancy within minutes of
3.2
a server Managing the Logging
failure, all without Channel
any noticeable interruption in
it a slightly
onds. smallerVM
If the backup amount
startsof the CPU
having (initially
a significant by just
execution
a
lagfew percent).
(say, more than We 1use a slowVMware
second), feedbackFT loop,
startswhich will
slowing
theThere are a of
execution number of interesting
a fault-tolerant VM.implementation details
try
down to the
gradually
primary pinpoint
VM bythe appropriate
informing the CPU limit to
scheduler forgive
the
in managing the traffic on the logging channel. In our im-
3.2 Managing
plementation, the Logging
the hypervisors Channel
maintain a large buffer for
primary
it a slightlyVM smaller
that will allow of
amount thethebackup
CPU VM to match
(initially its
by just
execution.
a few percent). If theWe backup
use aVM slowcontinues
feedbacktoloop, lag behind,
which willwe
logging
Thereentries for the ofprimary
are a number interestingand implementation
backup VMs. As the
details
continue to gradually
try to gradually pinpoint reduce
the the primary CPU
appropriate VM’slimit CPUfor limit.
the
primary
in managing VM theexecutes,
traffic it onproduces
the logginglog channel.
entries into the im-
In our log
Conversely,
primary VMif that the backup
will allow VMthe catches
backup up,VM we gradually
to match in- its
buffer, and similarly,
plementation, the backup
the hypervisors VM consumes
maintain a large logbufferentries
for
crease the primary
execution. If the backupVM’s CPU limit untiltothe
VM continues lag backup
behind,VM we
from
logging its entries
log buffer.for theThe primary
contents and of the primary’s
backup VMs.logAs buffer
the
returns
continuetotohavinggradually a slight
reducelag.the primary VM’s CPU limit.
are flushed
primary VMout to the logging
executes, it produceschannel
log as soon into
entries as possible,
the log
Note thatifsuch
Conversely, slowdowns
the backup VMofcatches
the primary
up, we VM are very
gradually in-
and
buffer,log and
entries are readthe
similarly, into the backup’s
backup log buffer
VM consumes logfrom the
entries
rare,
creaseandthe typically
primary happenVM’s CPU onlylimit
whenuntil the the
system is under
backup VM
logging
from itschannel
log buffer. as soon
Theas they arrive.
contents of theThe backuplog
primary’s sends ac-
buffer
extreme
returns tostress.
havingAll the performance
a slight lag. numbers of Section 5
knowledgments
are flushed out back to thetologging
the primary
channeleachas time
soonthat it reads
as possible,
include
Note thethatcost such of slowdowns
any such slowdowns.
of the primary VM are very
some
and log logentries
entriesarefrom
readthe network
into into itslog
the backup’s logbuffer
buffer. fromThese
the
acknowledgments rare, and typically happen only when the system is under
logging channel as allow soon as VMware FT toThe
they arrive. determine when ac-
backup sends an
output that is delayed
knowledgments back tobythe theprimary
Outputeach Ruletime
can bethatsent. Fig-
it reads 3.3
extreme Operation
stress. All on the FT VMs numbers of Section 5
performance
ure include the cost of any such slowdowns.
Another practical matter is dealing with the various con-
some3 logillustrates
entries this
fromprocess.
the network into its log buffer. These
If the backup VM
acknowledgments encounters
allow VMware an FTempty log buffer
to determine when whenan trol operations that may be applied to the primary VM.
it needsthat
output to read the next
is delayed by the logOutput
entry, Rule
it willcanstop execution
be sent. Fig- 3.3 example,
For Operation if the on FT VMs
primary VM is explicitly powered off,
until
ure 3 aillustrates
new log entry is available. Since the backup VM is
this process. theAnother
backup practical
VM should be stopped
matter as well,
is dealing with and not attempt
the various con-
notIf communicating
the backup VMexternally, encounters thisanpause
empty will
lognot affectwhen
buffer any to
trolgooperations
live. As another
that mayexample, any resource
be applied management
to the primary VM.
clients
it needsofto theread
VM.the Similarly,
next logif entry,
the primary
it will VMstopencounters
execution change on the primary
For example, (such asVM
if the primary increased CPU share)
is explicitly should
powered off,
a full alog
until newbuffer when is
log entry it available.
needs to writeSincea the
log backup
entry, itVM must is also be applied
the backup VMto the backup.
should For these
be stopped kind
as well, andofnot
operations,
attempt
stop execution until externally,
not communicating log entries can this be flushed
pause will out. This stop
not affect any special control
to go live. entries are
As another sent onany
example, theresource
logging channel from
management
in execution
clients of theisVM. a natural flow-control
Similarly, mechanism
if the primary that slows
VM encounters the
changeprimary toprimary
on the the backup,
(suchinasorder to effect
increased CPU theshare)
appropriate
should
down
a full logthe buffer
primary whenVMitwhen needsittoiswrite
producing log entries
a log entry, it mustat operation on thetobackup.
also be applied the backup. For these kind of operations,
too
stopfast a rate. until
execution However, this pause
log entries can becan affectout.
flushed clients
Thisofstop
the In general,
special controlmost operations
entries are sentononthe
theVM shouldchannel
logging be initiated
from
VM, since the
in execution is primary
a naturalVM will be completely
flow-control mechanism stopped
that slowsand only on the primary
the primary VM. VMware
to the backup, in order toFTeffect
thenthesends any nec-
appropriate
unresponsive
down the primary until VM it canwhenlog its
it isentry and continue
producing execu-
log entries at essary
operationcontrol
on theentry to cause the appropriate change on the
backup.
tion.
too fast Therefore, our implementation
a rate. However, this pause can must
affect beclients
designed to
of the backup VM. The
In general, mostonly operation
operations thatVM
on the canshould
be donebe indepen-
initiated
minimize
VM, sincethe thepossibility
primary VM thatwill
thebe primary log buffer
completely stopped fills and
up. dently
only onon the
the primary
primary andVMware
VM. backup FT VMsthen
is VMotion.
sends anyThat
nec-
One reason until
unresponsive that it thecanprimary
log itslog buffer
entry andmay fill upexecu-
continue is be- is, the primary
essary and backup
control entry VMs
to cause thecan be VMotioned
appropriate changeindepen-
on the
cause
tion. the backup our
Therefore, VM implementation
is executing too must slowlybeand therefore
designed to dently
backupto VM.other
Thehosts. Note that that
only operation VMware
can beFTdone
ensures that
indepen-
consuming
minimize the logpossibility
entries toothat slowly.
the In general,
primary logthe backup
buffer fills VM
up. neither
dently on VM theis primary
moved toand thebackup
server VMs
whereisthe other VM
VMotion. is,
That
One reason that the primary log buffer may fill up is be- is, the primary and backup VMs can be VMotioned indepen-
cause the backup VM is executing too slowly and therefore dently to other hosts. Note that VMware FT ensures that
consuming log entries too slowly. In general, the backup VM neither VM is moved to the server where the other VM is,

34

34
since that situation would no longer provide fault tolerance. for the newly-promoted primary VM to be sure if the disk
VMotion of a primary VM adds some complexity over IOs were issued to the disk or completed successfully. In
a normal VMotion, since the backup VM must disconnect addition, because the disk IOs were not issued externally on
from that
since the source primary
situation wouldand re-connect
no longer to the
provide faultdestination
tolerance. the the
for backup VM, there will
newly-promoted be no VM
primary explicit
to beIOsure
completion for
if the disk
primary
VMotion VMofatathe appropriate
primary VM addstime.some
VMotion of a backup
complexity over them as the
IOs were newly-promoted
issued to the disk primary VM continues
or completed to run,
successfully. In
VM
a has a VMotion,
normal similar issue,
sincebuttheadds an additional
backup VM mustcomplexity.
disconnect which would
addition, eventually
because the diskcause
IOsthewereguest operating
not issued system on
externally in
For a the
from normal VMotion,
source primaryweand require that alltooutstanding
re-connect the destinationdisk the VM
backupto start
VM, an abort
there willorbe
reset procedure.
no explicit We could send
IO completion for
IOs be quiesced
primary VM at the(i.e.appropriate
completed)time.just as the finalofswitchover
VMotion a backup an error
them completion
as the that indicates
newly-promoted primary thatVM
each IO failed,
continues tosince
run,
on the
VM hasVMotion
a similaroccurs.
issue, butForadds
a primary VM, this
an additional quiescing
complexity. it is acceptable
which to returncause
would eventually an error even if
the guest the IO completed
operating system in
is easily
For handled
a normal by waiting
VMotion, until the
we require thatphysical IOs complete
all outstanding disk successfully.
the VM to start However,
an aborttheorguest
resetOS might not
procedure. We respond well
could send
and be
IOs delivering
quiescedthese completionsjust
(i.e. completed) to the VM.
as the However,
final switchoverfor to errors
an from its local
error completion disk.
that Instead,
indicates we each
that re-issue
IO the pending
failed, since
a backup
on VM, there
the VMotion is noFor
occurs. easy way to cause
a primary VM, thisall IOs to be
quiescing IOs
it is during the go-live
acceptable process
to return of theeven
an error backup VM.
if the IOBecause
completedwe
completed
is at any by
easily handled required
waitingpoint,
until since the backup
the physical IOs VM must
complete have eliminated
successfully. all races
However, theand
guestall OS
IOsmight
specifynotdirectly
respond which
well
replay
and the primary
delivering theseVM’s executiontoand
completions thecomplete IOs at the
VM. However, for memory
to errors and
fromdisk blocks
its local areInstead,
disk. accessed, wethese diskthe
re-issue operations
pending
same
a execution
backup point.is The
VM, there primary
no easy way VM may all
to cause be running
IOs to be a can during
IOs be re-issued evenprocess
the go-live if theyofhave alreadyVM.
the backup completed
Becausesuc-
we
workload inatwhich
completed there are
any required always
point, diskthe
since IOs in flight
backup VMduring
must cessfully
have (i.e. they
eliminated allare idempotent).
races and all IOs specify directly which
normalthe
replay execution.
primaryVMware FT has aand
VM’s execution unique method
complete IOstoatsolve
the memory and disk blocks are accessed, these disk operations
this
sameproblem.
executionWhenpoint.a backup VM is VM
The primary at the
may final
be switchover
running a 3.5
can beImplementation
re-issued even if they Issues for Network
have already completed IOsuc-
point for in
workload a which
VMotion,thereitare
requests
always via
diskthe
IOslogging
in flightchannel
during cessfully (i.e. they are idempotent).
VMware vSphere provides many performance optimiza-
that theexecution.
normal primary VM temporarily
VMware FT hasquiesce
a unique allmethod
of its IOs. The
to solve tions for VM networking. Some of these optimizations are
backup
this VM’s When
problem. IOs will then naturally
a backup VM is at be thequiesced as well
final switchover 3.5
based on Implementation Issues for Network
the hypervisor asynchronously updating the IOstate
at a single
point for a execution
VMotion,point as it replays
it requests via the thelogging
primary VM’s
channel of VMware
the virtual machine’s
vSphere network
provides many device. For example,
performance optimiza- re-
execution
that of the quiescing
the primary operation.
VM temporarily quiesce all of its IOs. The ceive buffers
tions for VMcan be updatedSome
networking. directly by theoptimizations
of these hypervisor while are
backup VM’s IOs will then naturally be quiesced as well the VM
based on istheexecuting.
hypervisor Unfortunately
asynchronouslythese updatingasynchronous
the state
3.4 Implementation
at a single execution point Issues for Disk
as it replays IOs VM’s
the primary updates
of to a VM’s
the virtual state network
machine’s add non-determinism.
device. For example, Unless re- we
execution of the quiescing operation.
There are a number of subtle implementation issues re- can guarantee that
ceive buffers can be all updates
updated happen
directly by at
thethe same point
hypervisor whilein
lated to disk IO. First, given that disk operations are non- the instruction
VM is executing. stream onUnfortunately
the primary and these theasynchronous
backup, the
3.4 Implementation
blocking and so can executeIssues for Disk
in parallel, IOs disk
simultaneous backup’s to
updates execution
a VM’s can statediverge from that of the primary.
add non-determinism. Unless we
operations that access the same
There are a number of subtle implementation disk location can lead re-
issues to canThe biggest that
guarantee changeall to the networking
updates happen atemulation
the same code pointfor in
non-determinism.
lated to disk IO. First,Also, given
our implementation
that disk operationsof diskareIO non-
uses FT instruction
the is the disabling streamof the
on theasynchronous
primary and network optimiza-
the backup, the
DMA directly
blocking and so to/from the memory
can execute of thesimultaneous
in parallel, virtual machines, disk tions.
backup’s The code can
execution thatdiverge
asynchronously
from that updates VM ring
of the primary.
so simultaneous
operations that disk
accessoperations
the samethat diskaccess the can
location sameleadmem- to buffers with incoming
The biggest change packets has been modified
to the networking emulation to force
code the for
ory pages can also Also,
non-determinism. lead to non-determinism.
our implementation of Our solution
disk IO uses is guest
FT is tothetrap to the hypervisor,
disabling where it can
of the asynchronous log theoptimiza-
network updates
generally
DMA to detect
directly to/fromany the
such IO races
memory of (which are rare),
the virtual and
machines, and then The
tions. applycode them to the
that VM. Similarly,
asynchronously code that
updates VM nor- ring
force
so such racingdisk
simultaneous diskoperations
operationsthat to execute
access thesequentially
same mem- in mally pulls
buffers with packets
incomingout of transmit
packets has been queues
modifiedasynchronously
to force the
the pages
ory same way on the
can also primary
lead and backup. Our solution is
to non-determinism. is disabled
guest to trap fortoFT,
the and instead where
hypervisor, transmits
it canarelog
done
the through
updates
Second,to
generally a disk
detect operation
any suchcan IOalso
racesrace with are
(which a memory
rare), andac- a
andtrap thento the
applyhypervisor
them to (except
the VM.asSimilarly,
noted below).code that nor-
cess by
force an racing
such application (or OS) intoa execute
disk operations VM, because the disk
sequentially in The pulls
mally elimination
packetsofout theofasynchronous
transmit queues updates of the net-
asynchronously
operations
the same way directly
on theaccess
primary the andmemory
backup.of a VM via DMA. work
is deviceforcombined
disabled FT, and withinstead thetransmits
delaying are of sending
done through pack-
For example,
Second, there
a disk could becan
operation a non-deterministic
also race with a memoryresult if ac-
an etstrap
a described in Section 2.2
to the hypervisor has provided
(except as noted somebelow).performance
application/OS
cess in a VM
by an application (oris OS)
readingin aaVM,memory block
because atdisk
the the challenges for networking.
The elimination We’ve takenupdates
of the asynchronous two approaches
of the net- to
same time adirectly
operations disk read is occurring
access the memory to that
of ablock.
VM via ThisDMA.
situ- improving
work device VM network performance
combined with the delaying while running
of sendingFT. pack-
First,
ationexample,
For is also unlikely,
there couldbut be we amust detect it and deal
non-deterministic resultwith it
if an we implemented
ets described in Sectionclustering2.2optimizations
has providedto reduce
some VM traps
performance
if it happens. One
application/OS in solution
a VM is isreading
to set up page protection
a memory block attem- the and interrupts.
challenges When the VM
for networking. We’veis streaming
taken twodata at a suffi-
approaches to
porarily
same timeonapages
disk readthatisare targets to
occurring of that
disk block.
operations. The
This situ- cient bit rate,
improving VM the hypervisor
network can dowhile
performance one transmit
running FT. trapFirst,
per
page protections
ation resultbut
is also unlikely, in we
a trap
must if the VMithappens
detect and dealtowith make it group
we of packets clustering
implemented and, in theoptimizations
best case, zerototraps,
reducesince
VMittraps can
anitaccess
if to a One
happens. pagesolution
that is also
is totheset target
up page of protection
an outstandingtem- transmit
and the packets
interrupts. Whenasthe part VMof receiving
is streaming newdatapackets.
at a Like-
suffi-
disk operation,
porarily on pages and theare
that VM can be
targets of paused until the disk
disk operations. The wise, the
cient hypervisor
bit rate, can reduce
the hypervisor canthedonumber of interrupts
one transmit trap per to
operation
page completes.
protections result Because
in a trapchanging
if the VMMMU happens protections
to make the VM
group of for incoming
packets and, inpackets
the best bycase,
onlyzero
posting
traps,thesince
interrupt
it can
on access
an pages tois aanpage
expensive
that is operation,
also the target we choose instead to
of an outstanding for a group
transmit theofpackets
packets. as part of receiving new packets. Like-
use bounce
disk buffers.
operation, and A thebounce
VM can buffer is a temporary
be paused until thebufferdisk Ourthe
wise, second performance
hypervisor can reduceoptimization
the number for of
networking
interrupts in- to
that has the
operation same sizeBecause
completes. as the memory
changingbeingMMU accessed by a
protections volves
the VMreducing the delay
for incoming for transmitted
packets by only posting packets. As noted
the interrupt
diskpages
on operation.
is an A disk readoperation,
expensive operation iswemodified
choose to read the
instead to earlier,
for a groupthe ofhypervisor
packets. must delay all transmitted packets
specified
use bouncedata to the A
buffers. bounce
bounce buffer,
buffer andis the data is copied
a temporary bufferto until
Ourit second
gets an performance
acknowledgment from the for
optimization backup for the ap-
networking in-
guest
that has memory
the sameonly size
as theas IO
thecompletion
memory being is delivered.
accessed Simi-
by a propriate
volves reducing log entries. The for
the delay keytransmitted
to reducing the transmit
packets. delay
As noted
larly,operation.
disk for a disk A write
disk operation,
read operation the data to be sent
is modified is first
to read the is to reduce
earlier, the time required
the hypervisor must delay to sendall atransmitted
log messagepackets to the
copied todata
specified the bounce buffer,buffer,
to the bounce and the anddisk
thewrite modified
data is copied to backup
until and an
it gets getacknowledgment
an acknowledgment. fromOur primaryfor
the backup optimiza-
the ap-
to write
guest data from
memory onlythe bounce
as the buffer. Theisuse
IO completion of the bounce
delivered. Simi- tions in this
propriate log area involve
entries. The ensuring that sending
key to reducing and receiv-
the transmit delay
bufferfor
larly, cana slow
disk down
write disk operations,
operation, but to
the data webehave
sentnotis seen
first ingtolog
is entries
reduce theand acknowledgments
time required to send cana all
logbe done without
message to the
it causetoany
copied thenoticeable
bounce buffer,performance
and theloss.disk write is modified any thread
backup andcontext switch. The VMware
get an acknowledgment. OurvSphere
primaryhypervisor
optimiza-
to Third, therefrom
write data are some issues buffer.
the bounce associated
Thewith
use ofdisk
theIOs that
bounce allowsinfunctions
tions this areatoinvolve
be registered
ensuringwith that the TCP and
sending stack that
receiv-
are outstanding
buffer can slow down(i.e. not
diskcompleted)
operations,onbut theweprimary
have not whenseen a will log
ing be entries
called from a deferred-execution
and acknowledgments context
can all be done (similar
without to
failure
it causehappens, and theperformance
any noticeable backup takesloss. over. There is no way a
anytasklet
thread incontext
Linux) switch.
whenever TheTCP data isvSphere
VMware received. This al-
hypervisor
Third, there are some issues associated with disk IOs that allows functions to be registered with the TCP stack that
are outstanding (i.e. not completed) on the primary when a will be called from a deferred-execution context (similar to
failure happens, and the backup takes over. There is no way a tasklet in Linux) whenever TCP data is received. This al-

35

35
they must be explicitly resynced when the backup VM is
Primary Backup restarted after a failure. That is, FT VMotion must not only
VM VM sync the running state of the primary and backup VMs, but
hypervisor hypervisor also their
they mustdisk state.
be explicitly resynced when the backup VM is
Primary Backup In the after
restarted non-shared-disk
a failure. That configuration,
is, FT VMotion theremust maynot beonly
no
VM Logging channel VM shared
sync thestorage
runningtostate use of forthedealing
primary withand a backup
split-brain VMs, situa-
but
hypervisor hypervisor tion.their
also In thisdisk case,
state.the system could use some other external
tiebreaker, such as a third-party
In the non-shared-disk server thatthere
configuration, bothmay servers be can
no
Logging channel
Disk
talk to. storage
shared If the servers
to use are for part
dealing of awith
cluster with moresitua-
a split-brain than
reads & Disk two nodes,
tion. In thisthe system
case, could alternatively
the system could use some use other
a majority
external al-
writes writes
gorithm based
tiebreaker, suchon as cluster membership.
a third-party server that In this
bothcase,
servers a VMcan
Disk
would
talk to.only be allowed
If the servers are to go liveofifaitcluster
part is runningwith on morea server
than
reads & Disk thatnodes,
two is partthe of asystem
communicating sub-cluster
could alternatively usethat containsal-
a majority a
writes writes majority basedof the on original
gorithm clusternodes.
membership. In this case, a VM
would only be allowed to go live if it is running on a server
Initially 4.2
that is Executing Disk Reads
part of a communicating on the Backup
sub-cluster that contains VMa
synced
disks In our default
majority design, the
of the original backup VM never reads from its
nodes.
virtual disk (whether shared or non-shared). Since the disk
Initially 4.2 isExecuting
read considered anDisk input,Reads on the
it is natural to Backup
send the resultsVM
syncedDisk Configuration.
Figure 4: FT Non-shared
disks of In
theourdisk read to
default the backup
design, the backup VM VM via theneverlogging
readschannel.
from its
An alternate
virtual disk (whether designsharedis to have the backupSince
or non-shared). VM the execute
disk
disk reads
read and therefore
is considered an input, eliminate the logging
it is natural to sendofthe disk read
results
lows us to quickly
Figure 4: FT handle
Non-sharedany incoming log messages on the
Disk Configuration. data. This read
of the disk approach
to thecan backupgreatlyVMreduce
via thethe trafficchannel.
logging on the
backup and any acknowledgments received by the primary
logging channel design
An alternate for workloads
is to have thatthe dobackup
a lot ofVM diskexecute
reads.
without any thread context switches. In addition, when the
However,
disk readsthis andapproach
thereforehas a number
eliminate the of subtleties.
logging of disk It read
may
primary
lows us to VM enqueues
quickly handlea packet to be transmitted,
any incoming log messages weonforce
the slow down
data. This the backupcan
approach VM’sgreatlyexecution,
reduce the sincetraffic
the backup
on the
an immediate
backup and any logacknowledgments
flush of the associated output
received by log
the entry
primary(as
VM must
logging execute
channel forall disk reads
workloads thatand dowait
a lotif ofthey
diskare not
reads.
described
without any in Section 2.2) by switches.
thread context schedulingIna addition,
deferred-execution
when the physically this
However, completed
approach whenhas ita reaches
number the point in the
of subtleties. VM
It may
context to
primary VMdoenqueues
the flush.a packet to be transmitted, we force
execution
slow downwhere they completed
the backup on the primary.
VM’s execution, since the backup
an immediate log flush of the associated output log entry (as Also, some extraall work
VM must execute diskmust
reads beand
donewait to deal
if theywitharefailed
not
described in Section 2.2) by scheduling a deferred-execution
4. DESIGN ALTERNATIVES
context to do the flush.
disk read operations.
physically completed If whena disk read by the primary
it reaches point in succeeds
the VM
In our implementation of VMware FT, we have explored but the corresponding
execution where they completeddisk read by onthethebackup
primary. fails, then the
a number of interesting design alternatives. In this section, disk readsome
Also, by theextra backup
work must must be be retried
done tountil deal itwith
succeeds,
failed
4. explore
we DESIGN some of ALTERNATIVES
these alternatives. sinceread
disk the operations.
backup must If get theread
a disk same by data in memory
the primary that
succeeds
In our implementation of VMware FT, we have explored the the
but primary has. Conversely,
corresponding disk readif by a disk read byfails,
the backup the then
primarythe
4.1
a number Shared vs. Non-shared
of interesting Disk In this section,
design alternatives. fails, then by
disk read thethecontents
backupofmust the target memory
be retried untilmust be sent
it succeeds,
weInexplore some design,
of thesethe alternatives. to thethe
since backupbackup via must
the logging
get thechannel,
same data sinceinthe contents
memory of
that
our default primary and backup VMs share
memory
the primarywill has.
be undetermined
Conversely, if and not necessarily
a disk read by thereplicated
primary
the same virtual disks. Therefore, the content of the shared
4.1 isShared
disks naturallyvs. Non-shared
correct and available Disk if a failover occurs.
by a successful
fails, disk readofbythe
then the contents thetarget
backup VM. must be sent
memory
to Finally,
the backup thereviais thea subtlety
logging ifchannel,
this disk-read
since the alternative
contents of is
Essentially,
In our defaultthe shared
design,disk
the isprimary
consideredand external
backup VMs to theshare
pri-
used withwill
memory thebeshared disk configuration.
undetermined If the primary
and not necessarily replicatedVM
mary
the sameandvirtual
backupdisks.
VMs,Therefore,
so any write to the shared
the content disk is
of the shared
does
by a asuccessful
read to adisk particular
read by disk
thelocation,
backup VM. followed fairly soon
considered a communication
disks is naturally correct andtoavailable
the external world. occurs.
if a failover There-
byFinally,
a writethere to the is asame disk iflocation,
subtlety then the
this disk-read disk write
alternative is
fore, only the
Essentially, theprimary
shared diskVMisdoes actual external
considered writes totothethedisk,
pri-
must be delayed
used with the shared untildiskthe backup VM hasIfexecuted
configuration. the primary the first
VM
and writes
mary to the shared
and backup VMs, disk must
so any be delayed
write in accordance
to the shared disk is
disk a
does read.
read to This dependence
a particular diskcan be detected
location, followed and handled
fairly soon
with the Output
considered Rule.
a communication to the external world. There-
correctly,
by a writebut to adds
the sameextra disk
complexity
location, to the
thenimplementation.
the disk write
An only
fore, alternative designVM
the primary is fordoes
the actual
primarywrites
and backup
to the VMsdisk,
In Section
must be delayed 5.1,until
we givethe some
backup performance
VM has executed results theindicat-
first
to have
and writesseparate
to the (non-shared)
shared disk must virtual disks. In
be delayed in this design,
accordance
ing that
disk read. executing disk reads on
This dependence canthe bebackup
detected canandcause some
handled
the backup
with the OutputVM doesRule.do all disk writes to its virtual disks,
slightly
correctly, reduced
but adds throughput (1-4%) for
extra complexity toreal
the applications,
implementation. but
andAninalternative
doing so, itdesign
naturally
is forkeeps the contents
the primary of its virtual
and backup VMs
canInalso
Sectionreduce5.1,the we logging
give some bandwidth
performance noticeably. Hence,
results indicat-
disks
to havein separate
sync with(non-shared)
the contents virtual
of the primary
disks. InVM’s virtual
this design,
executing
ing disk reads
that executing diskonreads
the backup
on the backupVM may canbe useful
cause somein
disks.
the Figure
backup VM 4 illustrates
does do allthis diskconfiguration.
writes to its In the case
virtual of
disks,
cases where
slightly reduced the throughput
bandwidth (1-4%) of the for logging channel is quite
real applications, but
non-shared
and in doingdisks,
so, itthe virtual keeps
naturally disks the
are essentially
contents ofconsidered
its virtual
limited.
can also reduce the logging bandwidth noticeably. Hence,
part ofinthe
disks syncinternal
with the state of eachofVM.
contents the Therefore,
primary VM’s diskvirtual
writes
executing disk reads on the backup VM may be useful in
of the Figure
disks. primary4 do not havethis
illustrates to configuration.
be delayed according to the
In the case of
Output
non-shared Rule. Thethe
disks, non-shared
virtual disksdesign
areisessentially
quite useful in cases
considered 5. PERFORMANCE
cases where the bandwidth ofEVALUATION the logging channel is quite
limited.
whereof shared
part storage
the internal is not
state accessible
of each to the primary
VM. Therefore, and
disk writes In this section, we do a basic evaluation of the performance
backup
of VMs. This
the primary mayhave
do not be thetocase because according
be delayed shared storage
to theis of VMware FT for a number of application workloads and
unavailable
Output Rule.or The
too expensive,
non-sharedordesign
because the servers
is quite useful running
in cases 5. PERFORMANCE
networking benchmarks. For EVALUATION these results, we run the pri-
the primary
where sharedand backup
storage VMsaccessible
is not are far apart
to the(“long-distance
primary and mary andsection,
In this backupwe VMs do aon identical
basic servers,
evaluation each
of the with eight
performance
FT”). One
backup VMs.disadvantage
This may be of the
the case
non-shared
becausedesign
sharedisstorage
that the is Intel
of Xeon 2.8
VMware FTGhz for CPUs
a number and 8ofGbytes of RAM.
application The servers
workloads and
two copies oforthe
unavailable toovirtual disksormust
expensive, be explicitly
because the servers synced
runningup are connected
networking via a 10 Gbit/s
benchmarks. crossover
For these network,
results, we run though
the pri-as
in some
the primary manner when fault
and backup VMstolerance is first
are far apart enabled. In
(“long-distance will beand
mary seen in all cases,
backup VMs on much less than
identical 1 Gbit/s
servers, each of withnetwork
eight
addition,
FT”). Onethe disks can of
disadvantage gettheout of sync after
non-shared designa is
failure,
that theso bandwidth
Intel Xeon 2.8 is used.
Ghz CPUs Both and servers accessoftheir
8 Gbytes RAM. shared virtual
The servers
two copies of the virtual disks must be explicitly synced up are connected via a 10 Gbit/s crossover network, though as
in some manner when fault tolerance is first enabled. In will be seen in all cases, much less than 1 Gbit/s of network
addition, the disks can get out of sync after a failure, so bandwidth is used. Both servers access their shared virtual

36

36
performance logging base FT logging
(FT / non-FT) bandwidth bandwidth bandwidth bandwidth
SPECJbb2005 0.98 1.5 Mbits/sec Receive (1Gb) 940 604 730
Kernel Compile performance
0.95 3.0 logging
Mbits/sec Transmit (1Gb) base
940 FT
855 logging
42
Oracle Swingbench (FT /0.99
non-FT) 12bandwidth
Mbits/sec Receive (10Gb) bandwidth
940 bandwidth
860 bandwidth
990
SPECJbb2005
MS-SQL DVD Store 0.98
0.94 1.5 Mbits/sec
18 Mbits/sec Receive (1Gb)
Transmit (10Gb) 940 604
935 730
60
Kernel Compile 0.95 3.0 Mbits/sec Transmit (1Gb) 940 855 42
Oracle Table
Swingbench 0.99
1: Basic Performance 12 Mbits/sec
Results Receive (10Gb) 940
Table 2: Performance of Network Transmit and Re- 860 990
MS-SQL DVD Store 0.94 18 Mbits/sec Transmit
ceive to a(10Gb)
Client (all 940 in Mbit/s) 935 for 1Gb and6010Gb
Logging Channels
disks from an EMC
Table Clariion
1: Basic connected through
Performance Results a standard Table 2: Performance of Network Transmit and Re-
4 Gbit/s Fibre Channel network. The client used to drive ceive to a Client (all in Mbit/s) for 1Gb and 10Gb
some of the workloads is connected to the servers via a 1 Logging
higher than Channels
those measured in Table 1 for applications that
Gbit/sfrom
disks network.
an EMC Clariion connected through a standard have very high network receive or disk read bandwidth. For
The applications
4 Gbit/s Fibre Channel that we evaluate
network. Thein our
clientperformance
used to drive re- these kinds of applications, the bandwidth of the logging
sults are
some as follows.
of the workloads SPECJbb2005
is connectedistoanthe industry-standard
servers via a 1 channel
higher thancould be ameasured
those bottleneck, especially
in Table 1 for ifapplications
there are other that
Java application
Gbit/s network. benchmark that is very CPU- and memory- uses
have of thehigh
very logging
networkchannel.
receive or disk read bandwidth. For
intensive and does very
The applications that little IO. Kernel
we evaluate Compile
in our is a work-
performance re- Thekinds
these relatively low bandwidth
of applications, needed over
the bandwidth of the logging
load that
sults are as runs a compilation
follows. SPECJbb2005of the Linux
is an kernel. This work-
industry-standard channel forcouldmanybe real applications
a bottleneck, makes replay-based
especially if there are other fault
load does
Java some disk
application reads and
benchmark thatwrites,
is veryandCPU-is very
andCPU-memory-and tolerance
uses of theveryloggingattractive
channel. for a long-distance configuration
MMU-intensive,
intensive and does because of the
very little IO.creation and destruction
Kernel Compile is a work- of using
The non-shared
relatively low disks. For long-distance
bandwidth needed over configurations
the logging
manythat
load compilation processes.ofOracle
runs a compilation the LinuxSwingbench
kernel. This is a work- where
channeltheforprimary
many real andapplications
backup might makesbe separated
replay-based by 1-100
fault
load in
doeswhich
some andisk
Oracle 11gand
reads database
writes,isand driven by the
is very CPU- Swing-
and kilometers,
tolerance very optical fiber can
attractive for easily support bandwidths
a long-distance configuration of
bench OLTP (online
MMU-intensive, transaction
because processing)
of the creation andworkload.
destruction This
of 100-1000 Mbit/s with
using non-shared disks. latencies of less than 10configurations
For long-distance milliseconds.
workload
many does substantial
compilation processes. diskOracle
and networking
SwingbenchIO, is and has
a work- For
wherethetheapplications
primary and inbackup
Table 1, mighta bandwidth
be separated of 100-1000
by 1-100
eighty
load insimultaneous
which an Oracle database sessions.is MS-SQL
11g database driven byDVD Store
the Swing- Mbit/s should
kilometers, be sufficient
optical fiber can for easilygood performance.
support bandwidths Note,of
is a workload
bench in whichtransaction
OLTP (online a Microsoftprocessing)
SQL Serverworkload.
2005 database This however,
100-1000 that
Mbit/s thewith
extralatencies
round-trip latency
of less than between the pri-
10 milliseconds.
is driven by
workload does thesubstantial
DVD Store benchmark,
disk and networking which IO, hasand sixteen
has mary
For theandapplications
backup mayincause Table network and disk outputs
1, a bandwidth to be
of 100-1000
simultaneous
eighty simultaneousclients.database sessions. MS-SQL DVD Store delayed
Mbit/s by up tobe
should 20sufficient
milliseconds. for goodThe long-distance
performance.configu- Note,
is a workload in which a Microsoft SQL Server 2005 database ration
however,willthat
onlythe be extra
appropriate for applications
round-trip latency between whosethe clients
pri-
5.1
is drivenBasic
by the Performance
DVD Store benchmark, Results which has sixteen can
marytolerate
and backupsuch an mayadditional
cause network latency and ondisk
eachoutputs
request. to be
simultaneous
Table 1 gives clients.
basic performance results. For each of the For the
delayed bytwoup most
to 20 disk-intensive
milliseconds. applications,
The long-distance we have mea-
configu-
applications listed, the second column gives the ratio of the sured
rationthe
willperformance
only be appropriate impact of forexecuting
applications diskwhose
reads clients
on the
5.1 BasicofPerformance
performance the application Resultswhen FT is enabled on the backup VM such
can tolerate (as described
an additional in Section
latency4.2) on vs.
eachsending
request.disk
VM running
Table 1 gives the basic
serverperformance
workload vs.results. the performance
For each of when
the read
Fordata overmost
the two the logging channel.
disk-intensive For Oracle we
applications, Swingbench,
have mea-
FT is not enabled
applications listed, onthethe samecolumn
second VM. The givesperformance
the ratio of the ra- throughput is about 4%
sured the performance lowerof when
impact executing executing
disk readsdisk onreads
the
tios are calculated
performance of thesoapplication
that a value whenless FT
thanis 1enabled
indicates onthat
the on the backup
backup VM (asVM; for MS-SQL
described in Section DVD4.2) Store,
vs. throughput
sending disk is
the
VM FT workload
running is slower.
the server workload Clearly,
vs. thetheperformance
overhead forwhen en- about 1% over
read data lower. theMeanwhile,
logging channel. the logging
For Oraclebandwidth
Swingbench,is de-
abling
FT is not FT enabled
on theseonrepresentative
the same VM.workloads is less than
The performance ra- creased
throughputfromis12aboutMbits/sec4% lowerto 3 Mbits/sec
when executing for Oracle diskSwing-
reads
10%.
tios areSPECJbb2005
calculated so is completely
that a value less compute-bound
than 1 indicates andthat
has bench, and from
on the backup VM; 18 forMbits/sec
MS-SQLtoDVD 8 Mbits/sec for MS-SQL
Store, throughput is
no
theidle
FTtime,workloadbut performs
is slower.wellClearly,
becausethe it has minimal
overhead fornon-
en- DVD
about Store.
1% lower. Clearly, the bandwidth
Meanwhile, the loggingsavings could be is
bandwidth much
de-
deterministic
abling FT on events beyond timer interrupts.
these representative workloads isThe less other
than greater
creased for
from applications
12 Mbits/sec withto much
3 Mbits/secgreaterfor disk readSwing-
Oracle band-
workloads do disk IOisand
10%. SPECJbb2005 have some
completely idle time, so and
compute-bound somehas of width. As mentioned
bench, and in Sectionto4.2,
from 18 Mbits/sec it is expected
8 Mbits/sec that the
for MS-SQL
the FTtime,
no idle overhead may be hidden
but performs well becauseby the fact minimal
it has that thenon- FT performance might bethe
DVD Store. Clearly, somewhat
bandwidth worse whencould
savings disk reads
be muchare
VMs have lessevents
deterministic idle time.beyondHowever,
timerthe general conclusion
interrupts. The other is executed
greater foronapplications
the backup with VM. much
However, greaterfor disk
casesreadwhere the
band-
that VMware
workloads do FT diskisIO able
and to have
support some fault-tolerant
idle time, so VMssome with
of bandwidth of the logging
width. As mentioned channel
in Section 4.2,is itlimited (for example,
is expected that the
a
thequite
FT low performance
overhead may beoverhead.
hidden by the fact that the FT a long-distance
performance mightconfiguration),
be somewhatexecuting worse when diskdiskreads on the
reads are
VMsIn the
have third
lesscolumn
idle time.of the table, we
However, thegive the average
general conclusion band-is backup
executedVM on may be useful.
the backup VM. However, for cases where the
width of data FT
that VMware sentison thetologging
able support channel when these
fault-tolerant VMsappli-
with bandwidth of the logging channel is limited (for example,
cations
a quite loware performance
run. For these applications, the logging band-
overhead. 5.2 Network
a long-distance Benchmarks
configuration), executing disk reads on the
width
In theis third
quite column
reasonableof theand easily
table, wesatisfied
give the by a 1 Gbit/s
average band- backup VM may
Networking be useful. can be quite challenging for our
benchmarks
network.
width of data In fact,
sentthe low logging
on the bandwidth requirements
channel when these indicate
appli- system for a number of reasons. First, high-speed network-
that
cationsmultiple
are run. FTFor workloads can share the
these applications, thesame
logging 1 Gbit/s
band- 5.2 canNetwork
ing have a very Benchmarks
high interrupt rate, which requires the
network
width is without any negative
quite reasonable performance
and easily satisfied effects.
by a 1 Gbit/s logging and replaying
Networking benchmarks of asynchronous
can be quiteevents at a very
challenging forhigh
our
For VMs
network. In that
fact,runthe common
low bandwidthguest operating
requirements systems like
indicate rate.
systemSecond,
for a numberbenchmarksof reasons.that First,
receivehigh-speed
packets atnetwork-a high
Linux and Windows,
that multiple we have can
FT workloads found thatthe
share the same
typical1 logging
Gbit/s rate willhave
ing can cause a high
a very highrateinterrupt
of logging rate, traffic,
whichsince all such
requires the
bandwidth
network without while anythe negative
guest OSperformance
is idle is 0.5-1.5effects. Mbits/sec. packets mustreplaying
logging and be sent to of the backup viaevents
asynchronous the logging
at a verychannel.
high
TheFor“idle”
VMsbandwidth
that run common is largelyguestthe result
operating of recording
systems likethe Third, benchmarks
rate. Second, that sendthat
benchmarks packetsreceive willpackets
be subject at atohigh
the
delivery
Linux and of Windows,
timer interrupts.
we haveFor founda VM thatwith
thean activelogging
typical work- Output
rate willRule,
causewhicha highdelaysrate of thelogging
sending of network
traffic, since allpackets
such
load, the logging
bandwidth while the bandwidth
guest OS is is
dominated
idle is 0.5-1.5by the network
Mbits/sec. until
packetsthemust
appropriate
be sent toacknowledgment
the backup via the fromlogging
the backup
channel. is
and
The disk
“idle” inputs that must
bandwidth be sent
is largely thetoresult
the backup – the net-
of recording the received. This delaythat
Third, benchmarks willsend
increase
packets thewill measured
be subject latency to
to the
work packets
delivery of timer thatinterrupts.
are received Foranda VM the with
disk anblocks
active that are
work- a client. Rule,
Output This delay
whichcould delaysalso thedecrease
sending network
of network bandwidth
packets
read
load, from disk. Hence,
the logging bandwidththe logging bandwidth
is dominated by can
the be much
network to a client,
until since network
the appropriate protocols (suchfrom
acknowledgment as TCP) may have
the backup is
and disk inputs that must be sent to the backup – the net- received. This delay will increase the measured latency to
work packets that are received and the disk blocks that are a client. This delay could also decrease network bandwidth
read from disk. Hence, the logging bandwidth can be much to a client, since network protocols (such as TCP) may have

37

37
to decrease the network transmission rate as the round-trip inputs and non-deterministic operations on a logging chan-
latency increases. nel. Like Bressoud, they do not appear to focus on detecting
Table 2 gives our results for a number of measurements failure and re-establishing fault tolerance after a failure. In
made by thethe
to decrease standard
network netperf benchmark.
transmission rate asInthe
all these mea-
round-trip addition,
inputs andtheir implementationoperations
non-deterministic is limitedon to aproviding
logging chan- fault
surements, the client VM and primary VM are connected
latency increases. tolerance for applications
nel. Like Bressoud, they do that
notrun in a Java
appear virtual
to focus machine.
on detecting
viaTable
a 1 Gbit/s
2 givesnetwork. The for
our results firsta two rows of
number give send and re-
measurements These
failure systems attempt to deal
and re-establishing faultwith issues after
tolerance of multi-threaded
a failure. In
ceive
made performance
by the standard when the primary
netperf and backup
benchmark. hostsmea-
In all these are Java applications,
addition, but require is
their implementation either that
limited to all data is fault
providing cor-
connected
surements, by thea client
1 Gbit/s
VM logging
and primarychannel.
VM The third and
are connected rectly
toleranceprotected by locks or
for applications thatenforce
run inaaserialization
Java virtual on access
machine.
fourth
via a 1 rows
Gbit/s give the sendThe
network. andfirst
receive
two performance
rows give send when
and the
re- to shared
These memory.
systems attempt to deal with issues of multi-threaded
primary and backup
ceive performance servers
when are connected
the primary by a 10
and backup Gbit/s
hosts are Dunlap
Java et al. [6] describe
applications, but require an implementation
either that all of datadetermin-
is cor-
logging
connected channel,
by a 1whichGbit/snotlogging
only has higher bandwidth,
channel. The third and but istic
rectlyreplay targeted
protected towards
by locks debugging
or enforce applicationon
a serialization software
access
also
fourth lower
rowslatency
give thethan
sendtheand1 receive
Gbit/s performance
network. Aswhen a rough
the on a paravirtualized
to shared memory. system. Our work supports arbitrary
measure,
primary and the ping
backuptimeservers
between arehypervisors
connected for by the
a 101 Gbit/s operating
Dunlap et systems
al. [6] running
describe inside virtual machines
an implementation and im-
of determin-
connection
logging channel, is about
which150notmicroseconds,
only has higher whilebandwidth,
the ping time but plements
istic replayfault tolerance
targeted support
towards for these
debugging VMs, which
application software re-
for
alsoa lower
10 Gbit/s
latencyconnection
than theis1aboutGbit/s90network.
microseconds.
As a rough quires much higher levels
on a paravirtualized system.of stability
Our work andsupports
performance. arbitrary
When FT
measure, is nottime
the ping enabled,
between thehypervisors
primary VM can1achieve
for the Gbit/s Cully etsystems
operating al. [5] describe
running an alternative
inside approach and
virtual machines for sup-
im-
close (940 Mbit/s)
connection is about to 150themicroseconds,
1 Gbit/s linewhile rate the
for ping
transmits
time porting
plementsfault-tolerant
fault tolerance VMssupport
and its for implementation
these VMs, which in a proj-re-
and
for areceives.
10 Gbit/s When FT is enabled
connection is aboutfor90receive workloads, the
microseconds. ect called
quires much Remus.
higher With
levels this approach,
of stability andthe state of a pri-
performance.
logging
Whenbandwidth
FT is notisenabled,
very large,thesince all incoming
primary VM can network
achieve mary
CullyVM et isal.repeatedly
[5] describe checkpointed
an alternative during execution
approach and
for sup-
packets
close (940 must be sent
Mbit/s) to on
thethe loggingline
1 Gbit/s channel.
rate forThe logging
transmits sent to afault-tolerant
porting backup server, VMs which
and collects the checkpoint
its implementation in ainfor-
proj-
channel
and receives.can therefore
When FT become a bottleneck,
is enabled for receiveasworkloads,
shown for the mation.
ect calledThe Remus.checkpoints
With thismustapproach,
be executed the very
statefrequently
of a pri-
results
logging for the 1 Gbit/s
bandwidth logging
is very large,network.
since all The effect network
incoming is much (many
mary VM times per second),
is repeatedly since external
checkpointed duringoutputs
execution mustand be
less for the
packets must10 be
Gbit/s
sent logging
on the network. When FT
logging channel. Theis enabled
logging delayed
sent to auntil
backup a following checkpoint
server, which collectshas thebeen sent and
checkpoint ac-
infor-
for transmit
channel workloads,
can therefore the data
become of the transmitted
a bottleneck, as shownpackets
for the knowledged.
mation. The The advantage
checkpoints mustof bethisexecuted
approach veryis that it ap-
frequently
is not logged,
results for the 1but network
Gbit/s interrupts
logging network. mustThestill be islogged.
effect much plies
(many equally
times wellper to uni-processor
second), and multi-processor
since external outputs must VMs. be
The logging
less for the 10bandwidth is much
Gbit/s logging lower, so
network. the achievable
When FT is enabled net- The main
delayed issue
until a is that thischeckpoint
following approach has very been highsent network
and ac-
work transmit
for transmit bandwidths
workloads, the are
datahigher
of thethan the network
transmitted packetsre- bandwidth
knowledged.requirements
The advantage to send
of thisthe approach
incremental changes
is that it ap-to
ceive
is notbandwidths.
logged, but Overall,
network we see that FT
interrupts must canstill
limit
benetwork
logged. memory statewell
plies equally at to
each checkpoint.andThe
uni-processor results for Remus
multi-processor VMs.
bandwidths significantlyis at
The logging bandwidth muchverylower,
high so
transmit and receive
the achievable net- presented
The main issuein [5]is show 100%
that this to 225%has
approach slowdown
very highfornetwork kernel
rates, but high bandwidths
work transmit absolute rates areare still achievable.
higher than the network re- compile
bandwidth andrequirements
SPECweb benchmarks,
to send the when attempting
incremental changes to doto
ceive bandwidths. Overall, we see that FT can limit network 40 checkpoints
memory state at pereach
second using a 1 Gbit/s
checkpoint. network
The results for connec-
Remus
6. RELATED
bandwidths WORK
significantly at very high transmit and receive tion for transmitting
presented in [5] showchanges 100% to in 225%
memory state. There
slowdown for kernel are
rates, but high absolute rates are still achievable. a number
compile andof SPECweb
optimizations that maywhen
benchmarks, be useful in decreas-
attempting to do
Bressoud and Schneider [3] described the initial idea of im-
ing the requiredper
40 checkpoints network
secondbandwidth,
using a 1 Gbit/s but it network
is not clear that
connec-
plementing fault tolerance for virtual machines via software
6. RELATED
contained completelyWORK at the hypervisor level. They demon-
reasonable performance
tion for transmitting can be
changes in achieved with a There
memory state. 1 Gbit/s are
connection.
a number of In contrast, ourthat
optimizations approach
may bebased usefulonindetermin-
decreas-
strated the and
Bressoud feasibility
Schneiderof keeping a backup
[3] described virtual
the initial machine
idea of im-
istic
ing thereplay can network
required achieve less than 10%
bandwidth, butoverhead,
it is not clearwith thatless
in sync withfault
plementing a primary
tolerancevirtual machine
for virtual via a prototype
machines via software for
than 20 Mbit/s
reasonable bandwidth
performance can required
be achieved between
with the a 1primary
Gbit/s
servers
containedwith HP PA-RISC
completely at the processors.
hypervisorHowever,
level. They due demon-
to lim-
and backup hosts
connection. for several
In contrast, our real applications.
approach based on determin-
itations of the
strated the PA-RISC
feasibility of architecture,
keeping a backupthey could
virtual not imple-
machine
ment fully secure, isolated virtual machines. istic replay can achieve less than 10% overhead, with less
in sync with a primary virtual machine via aAlso, they did
prototype for
not implement
servers with HPany methodprocessors.
PA-RISC of failure detection
However, due or attempt
to lim- 7.
than CONCLUSION
20 Mbit/s bandwidth AND requiredFUTUREbetween WORK the primary
to address andWebackup hosts forand several real applications.
itations of any of the practical
the PA-RISC issues described
architecture, they could in not
Section
imple-3. have designed implemented an efficient and com-
More importantly,
ment fully they imposed
secure, isolated virtuala machines.
number of Also,
constraints
they didon plete system in VMware vSphere that provides fault tol-
their FT protocol
not implement anythat
methodwereofunnecessary. First,or they
failure detection attemptim- 7. CONCLUSION
erance AND FUTURE
(FT) for virtual machines running onWORK servers in a
posed a notion
to address any ofoftheepochs,
practicalwhere
issuesasynchronous
described in events
Sectionare 3. cluster.
We have Our design and
designed is based on replicating
implemented the execution
an efficient and com-
delayed until the end
More importantly, theyofimposed
a set interval.
a numberThe notion of on
of constraints an of a primary
plete system VM via a backup
in VMware vSphere VMthaton another
provideshost
faultusing
tol-
epoch
their FTis unnecessary
protocol that – they
were may have imposed
unnecessary. First,ittheybecause
im- VMware
erance (FT)deterministic
for virtualreplay. If therunning
machines server running
on serversthe in
pri-a
they
posedcould not replay
a notion individual
of epochs, where asynchronous
asynchronous eventsevents effi-are mary VMOur
cluster. fails,design
the backup VMon
is based takes over immediately
replicating with
the execution
ciently
delayedenough.
until theSecond,
end ofthey required
a set thatThe
interval. the notion
primaryofVM an no
of ainterruption
primary VM or via
lossaofbackup
data. VM on another host using
stop
epochexecution essentially
is unnecessary – theyuntil
maythe have
backup has received
imposed and
it because Overall,deterministic
VMware the performance of fault-tolerant
replay. If the server VMs under
running theVM-
pri-
acknowledged
they could notallreplay previous log entries.
individual However,
asynchronous only effi-
events the ware
mary FTVM on commodity
fails, the backuphardware
VM takesisoverexcellent, and shows
immediately with
output itself (such
ciently enough. as a network
Second, packet)
they required that must
the be delayed
primary VM – less than 10% overhead
no interruption or loss offor some typical applications. Most
data.
the
stopprimary
execution VM itself mayuntil
essentially continue executing.
the backup has received and of Overall,
the performance cost ofof VMware
the performance FT comes
fault-tolerant fromVM-
VMs under the
Bressoud [4] all
acknowledged describes
previousa system that implements
log entries. However, faultonly tol-
the overhead
ware FT of onusing VMware
commodity deterministic
hardware replay to
is excellent, keep
and the
shows
erance
output in the(such
itself operating system (Unixware),
as a network packet) mustand therefore
be delayed – primary
less than and
10% backup
overhead VMs in sync.
for some The
typical low overhead
applications. Most of
provides
the primaryfaultVMtolerance for all
itself may applications
continue that run on that
executing. VMware FT thereforecost
of the performance derives from the efficiency
of VMware FT comesoffromVMware the
operating
Bressoudsystem. The asystem
[4] describes systemcallthatinterface
implements becomes the
fault tol- deterministic replay.
overhead of using In addition,
VMware the logging
deterministic bandwidth
replay to keep the re-
set of operations
erance that must
in the operating be replicated
system (Unixware), deterministically.
and therefore quired
primarytoandkeepbackup
the primary
VMs in andsync.
backup in low
The syncoverhead
is typically of
This work
provides hastolerance
fault similar limitations and design
for all applications thatchoices
run on asthat
the quite
VMware small, often lessderives
FT therefore than 20fromMbit/s. Because
the efficiency the log-
of VMware
hypervisor-based
operating system.work. The system call interface becomes the ging bandwidth
deterministic is quite
replay. small in the
In addition, most cases,bandwidth
logging it seems fea- re-
setNapper et al. [9]that
of operations and must
Friedman and Kama
be replicated [7] describe im-
deterministically. sible
quiredtotoimplement configurations
keep the primary whereinthe
and backup syncprimary and
is typically
plementations
This work has of fault-tolerant
similar limitationsJava virtual
and design machines.
choices as They
the backup VMs often
quite small, are separated
less thanby20 long distances
Mbit/s. (1-100 the
Because kilome-
log-
follow a similar design
hypervisor-based work.to ours in sending information about ters). Thus, VMware
ging bandwidth is quiteFTsmall
couldin be
mostused in scenarios
cases, it seems that
fea-
Napper et al. [9] and Friedman and Kama [7] describe im- sible to implement configurations where the primary and
plementations of fault-tolerant Java virtual machines. They backup VMs are separated by long distances (1-100 kilome-
follow a similar design to ours in sending information about ters). Thus, VMware FT could be used in scenarios that

38

38
also protect against disasters in which entire sites fail. It is [2] AMD Corporation. AMD64 Architecture
worthwhile to note that the log stream is typically quite com- Programmer’s Manual. Sunnyvale, CA.
pressible, and simple compression techniques can decrease [3] Bressoud, T., and Schneider, F. Hypervisor-based
the
also logging
protect bandwidth significantly
against disasters in which with a small
entire sitesamount
fail. It ofis [2] AMD Corporation.
Fault Tolerance. AMD64 Architecture
In Proceedings of SOSP 15 (Dec.
extra CPU to
worthwhile overhead.
note that the log stream is typically quite com- Programmer’s Manual. Sunnyvale, CA.
1995).
Our results
pressible, andwith VMware
simple FT havetechniques
compression shown that canandecrease
efficient [3] Bressoud, T.
[4] T., C.
and Schneider,
TFT: A Software F. System
Hypervisor-based
for
implementation
the logging bandwidth of fault-tolerant
significantly VMs canabe
with builtamount
small upon de- of Fault Tolerance. In Proceedings
Application-Transparent of SOSP 15
Fault Tolerance. In (Dec.
terministic
extra CPU replay. overhead.Such a system can transparently provide 1995).
Proceedings of the Twenty-Eighth Annual
fault
Ourtolerance for VMs
results with VMware running any operating
FT have shown thatsystems and
an efficient InternationalT.Symposium
[4] Bressoud, C. TFT: AonSoftware
Fault-Tolerance
System for
applications
implementation withofminimal overhead.
fault-tolerant VMsHowever,
can be built for upon
a system
de- Computing (June 1998), pp.
Application-Transparent Fault128–137.
Tolerance. In
of fault-tolerant
terministic replay.VMsSuchtoa be useful
system canfor customers, it
transparently must
provide Proceedings
[5] Cully, of the Twenty-Eighth
B., Lefebvre, G., Meyer, Annual
D., Feeley, M.,
also
faultbe robust, easy-to-use,
tolerance for VMs running and highly automated.
any operating A usable
systems and International N.,
Hutchison, Symposium on Fault-Tolerance
and Warfield, A. Remus: High
system
applicationsrequires
withmany other
minimal components
overhead. beyond
However, forreplicated
a system Computing (June
Availability 1998), pp. 128–137.
via Asynchronous Virtual Machine
execution of VMs.VMs
of fault-tolerant In particular,
to be useful VMware FT automatically
for customers, it must [5] Replication. In Proceedings
Cully, B., Lefebvre, G., of the Fifth
Meyer, D.,USENIX
Feeley, M.,
restores redundancy
also be robust, after a failure,
easy-to-use, and highlyby finding an appropriate
automated. A usable Symposium
Hutchison,on N.,Networked SystemsA.
and Warfield, Design
Remus: andHigh
server
systeminrequires
the localmany
cluster andcomponents
other creating a new backup
beyond VM on
replicated Implementation (Apr. 2008), pp.
Availability via Asynchronous 161–174.
Virtual Machine
that server.
execution of By
VMs.addressing all theVMware
In particular, necessary FTissues, we have
automatically Replication.
[6] Dunlap, G. InW.,Proceedings
King, S. of T.,the Fifth S.,
Cinar, USENIX
Basrai,
demonstrated
restores redundancy a systemafterthat is usable
a failure, for real
by finding an applications
appropriate Symposium
M., and Chen,on Networked
P. M. ReVirt:Systems DesignIntrusion
Enabling and
in customer’s
server datacenters.
in the local cluster and creating a new backup VM on Implementation
Analysis through(Apr. 2008),
Virtual pp. 161–174.
Machine Logging and
One
that of tradeoffs
server. with implementing
By addressing fault tolerance
all the necessary issues, we viahave
de-
[6] Replay.
Dunlap,InG. Proceedings
W., King,ofS.the 2002
T., Symposium
Cinar, on
S., Basrai,
terministic
demonstrated replay is thatthat
a system currently deterministic
is usable replay has
for real applications Operating Systems
M., and Chen, P. Design and Implementation
M. ReVirt: Enabling Intrusion (Dec.
only been implemented
in customer’s datacenters.efficiently for uni-processor VMs. 2002). through Virtual Machine Logging and
Analysis
However, uni-processors
One of tradeoffs VMs are more
with implementing faultthan sufficient
tolerance for
via de-
[7] Replay.
Friedman,In Proceedings
R., and Kama, of theA.2002 Symposium on
Transparent
a wide variety
terministic replayof workloads,
is that currentlyespecially since physical
deterministic replaypro-
has Operating Systems
Fault-Tolerant JavaDesign
Virtualand Implementation
Machine. (Dec.of
In Proceedings
cessors
only been are implemented
continually getting
efficientlymorefor powerful. In addition,
uni-processor VMs. 2002). Distributed System (Oct. 2003), pp. 319–328.
Reliable
many
However, workloads can be scaled
uni-processors VMsout arebymore
usingthanmanysufficient
uni-processor
for
[7] Intel
[8] Friedman, R., and Kama,
Corporation. IntelÂA.
R 64Transparent
and IA-32
VMs
a wide instead
varietyof scaling up by using
of workloads, one larger
especially sincemulti-processor
physical pro-
Fault-Tolerant Java Virtual
Architectures Software Machine.
Developer’s In Proceedings
Manuals. Santa of
VM.
cessors High-performance replay for
are continually getting multi-processor
more powerful. InVMs is an
addition,
Reliable Distributed System (Oct. 2003), pp. 319–328.
Clara, CA.
active area of research,
many workloads and can
can be scaled outpotentially
by using many be enabled with
uni-processor
some extra hardware
VMs instead of scalingsupport
up by usingin microprocessors. One inter-
one larger multi-processor [8]
[9] Intel
Napper,Corporation.
J., Alvisi, L., Intel
andÂ 64 and
RVin, H. AIA-32
esting direction might be
VM. High-performance to extend
replay transactionalVMs
for multi-processor memory
is an Architectures Software
Fault-Tolerant Developer’s
Java Virtual Machine. Manuals. Santa
In Proceedings
models
active area to facilitate multi-processor
of research, replay. be enabled with
and can potentially Clara, CA.
of the International Conference on Dependable
In the
some extrafuture, we are
hardware also interested
support in extending
in microprocessors. Oneourinter-
sys- [9] Systems
Napper,and J., Networks
Alvisi, L.,(June
and 2002),
Vin, H. pp.A425–434.
tem
esting to direction
deal withmight
partialbehardware
to extend failure. By partial
transactional hard-
memory Fault-Tolerant
[10] Nelson, M., Lim,JavaB.-H.,
VirtualandMachine. In Proceedings
Hutchins, G. Fast
ware
models failure, we mean
to facilitate a partial lossreplay.
multi-processor of functionality or re- of the International
Transparent Migration Conference
for Virtualon Machines.
DependableIn
dundancy in a server
In the future, we arethat doesn’t
also causeincorruption
interested extendingor loss
our of
sys- Systems andofNetworks
Proceedings the 2005(June
Annual 2002),
USENIXpp. 425–434.
Technical
data.
tem toAn dealexample wouldhardware
with partial be the loss of all By
failure. network
partialconnec-
hard- [10] Conference
Nelson, M., (Apr.
Lim,2005).
B.-H., and Hutchins, G. Fast
tivity to the VM,
ware failure, we meanor the loss of aloss
a partial redundant power supply
of functionality or re- Transparent Migration
[11] Nightingale, for Virtual Machines.
E. B., Veeraraghavan, K., InChen,
in the physical
dundancy server.
in a server If doesn’t
that a partial hardware
cause failure
corruption or occurs
loss of Proceedings
P. M., and of the 2005
Flinn, Annual the
J. Rethink USENIX
Sync. Technical
In
on
data.a server runningwould
An example a primary
be theVM, lossinofmany cases (but
all network not
connec- Conference (Apr.
Proceedings of the 2005).
2006 Symposium on Operating
all)
tivityit would
to the be VM,advantageous
or the loss to of fail over to thepower
a redundant backup VM
supply [11] Systems DesignE.
Nightingale, andB.,Implementation
Veeraraghavan, (Nov.K.,2006).
Chen,
immediately.
in the physicalSuch a failover
server. could immediately
If a partial hardware failure restore full
occurs P. M., and Flinn,
[12] Schlicting, R., and J. Schneider,
Rethink the F. Sync. In
B. Fail-stop
service
on a server for arunning
critical VM, and ensure
a primary VM, in thatmanythe cases
VM is(butquickly
not Proceedings An
Processors: of the 2006 Symposium
Approach to Designing on Operating
Fault-tolerant
moved off of be
all) it would a potentially
advantageous unreliable server.
to fail over to the backup VM Systems Design
Computing and Implementation
Systems. ACM Computing (Nov. 2006).
Surveys 1, 3
immediately. Such a failover could immediately restore full [12] (Aug. 1983), 222–238.
Schlicting, R., and Schneider, F. B. Fail-stop
Acknowledgments
service for a critical VM, and ensure that the VM is quickly Processors: An
[13] Schneider, F. Approach to Designing
B. Implementing Fault-tolerant
fault-tolerance
moved off of a potentially unreliable server. Computing
services usingSystems.
the state ACM Computing
machine Surveys
approach: 1, 3
A tutorial.
We would like to thank Krishna Raja, who generated many
of the performance results. There were numerous people (Aug. Computing
ACM 1983), 222–238.Surveys 22, 4 (Dec. 1990), 299–319.
Acknowledgments
involved in the implementation of VMware FT. Core im- [13] Stratus
[14] Schneider, F. B. Implementing
Technologies. Benefitfault-tolerance
from Stratus
plementors
We would like of to
deterministic
thank Krishna replay,
Raja, (including
who generatedsupport manyfor services using
Continuing the stateTechnology:
Processing machine approach:
Automatic A tutorial.
a
of variety of virtual results.
the performance devices) Thereand the werebasenumerous
FT functional-
people ACM Computing
99.999% Uptime for Surveys 22, 4Windows
Microsoft (Dec. 1990), 299–319.
Server
ity included
involved Lanimplementation
in the Huang, Eric Lowe, Slava Malyugin,
of VMware FT. CoreAlex im- [14] Environments. At http://www.stratus.com/˜/media/-
Stratus Technologies. Benefit from Stratus
Mirgorodskiy,
plementors of KaustubhdeterministicPatil, Boris(including
replay, Weissman,supportPetr Van- for Stratus/Files/Resources/WhitePapers/continuous-
Continuing Processing Technology: Automatic
drovec,
a varietyand of Min Xu.devices)
virtual In addition,
and thethere base are
FTmany other
functional- processing-for-windows.pdf,
99.999% Uptime for Microsoft JuneWindows Server
people
ity includedinvolved
LaninHuang,
the higher-level
Eric Lowe,management
Slava Malyugin, of FT Alexin 2009.
Environments. At http://www.stratus.com/˜/media/-
VMware
Mirgorodskiy, vCenter. KarynPatil,
Kaustubh RitterBoris
did an excellentPetr
Weissman, job man-
Van- Stratus/Files/Resources/WhitePapers/continuous-
[15] Xu, M., Malyugin, V., Sheldon, J.,
aging
drovec,much and ofMinthe Xu.
work.In addition, there are many other processing-for-windows.pdf,
Venkitachalam, June
G., and Weissman, B. ReTrace:
people involved in the higher-level management of FT in 2009.
Collecting Execution Traces with Virtual Machine
8.
VMware REFERENCES
vCenter. Karyn Ritter did an excellent job man- [15] Deterministic
Xu, M., Malyugin,Replay.V., In Proceedings
Sheldon, J., of the 2007
[1] Alsberg,
aging much of the P., work.
and Day, J. A Principle for Resilient Workshop on Modeling,
Venkitachalam, G., andBenchmarking,
Weissman, and B. ReTrace:
Sharing of Distributed Resources. In Proceedings of Simulation (June 2007).
Collecting Execution Traces with Virtual Machine
8. the REFERENCES
Second International Conference on Software Deterministic Replay. In Proceedings of the 2007
EngineeringP.,(1976),
[1] Alsberg, pp. 627–644.
and Day, J. A Principle for Resilient Workshop on Modeling, Benchmarking, and
Sharing of Distributed Resources. In Proceedings of Simulation (June 2007).
the Second International Conference on Software
Engineering (1976), pp. 627–644.

39

39

You might also like