0% found this document useful (0 votes)

23 views35 pages

CheckpointingRecovery ds14

Uploaded by

Rashi Gupta

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

23 views35 pages

CheckpointingRecovery ds14

Uploaded by

Rashi Gupta

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 35

Fault Tolerant Systems

Checkpointing and
Rollback Recovery Protocols

1
Introduction
• Rollback recovery treats a distributed system as a
collection of processes that communicate through a
network

• Fault tolerance is achieved by periodically using stable

storage to save the processes’ states during the
failure-free execution.

• Upon a failure, a failed process restarts from one of its

saved states, thereby reducing the amount of lost
computation.

• Each of the saved states is called a checkpoint

2
Different Rollback Recovery Schemes

Rollback Recovery Schemes

Checkpoint based Log based

Uncoordinated Pessimistic
check pointing Logging

Coordinated Optimistic
check pointing Logging

Comm. induced Casual

check pointing Logging

3
Checkpoint based Recovery: Overview
• Uncoordinated checkpointing: Each process takes its
checkpoints independently

• Coordinated checkpointing: Process coordinate their

checkpoints in order to save a system-wide consistent
state. This consistent set of checkpoints can be used
to bound the rollback

• Communication-induced checkpointing: It forces each

process to take checkpoints based on information
piggybacked on the application messages it receives
from other processes.

4
System Model
• System consists of a fixed number (N) of processes
which communicate only through messages.
• Processes cooperate to execute a distributed
application program and interact with outside world by
receiving and sending input and output messages,
respectively. Output
Input
Messages Messages
Outside World
Message-passing
system
P0
m1
P1
m2
P2

5
Consistent System State
• A consistent system state is one in which if a
process’s state reflects a message receipt, then the
state of the corresponding sender reflects sending
that message.

• A fundamental goal of any rollback-recovery protocol

is to bring the system into a consistent state when
inconsistencies occur because of a failure.

6
Example
Consistent state
P0
m1
P1
m2
P2

Inconsistent
state
P0
m1
P1
m2
P2
“m2” becomes
the orphan
message

7
Checkpointing protocols
• Each process periodically saves its state on stable storage.
• The saved state contains sufficient information to restart
process execution.

• A consistent global checkpoint is a set of N local

checkpoints, one from each process, forming a consistent
system state.

• Any consistent global checkpoint can be used to restart

process execution upon a failure.
• The most recent consistent global checkpoint is termed as
the recovery line.
• In the uncoordinated checkpointing paradigm, the search
for a consistent state might lead to domino effect.

8
Domino effect: example
Recovery Line

P0
m2 m7
m0 m3 m5

m4 m6
m1

Domino Effect: Cascaded rollback which causes the

system to roll back to too far in the computation (even
to the beginning), in spite of all the checkpoints

9
Interactions with outside world

• A message passing system often interacts with the

outside world to receive input data or show the
outcome of a computation. If a failure occurs the
outside world cannot be relied on to rollback.

• For example, a printer cannot rollback the effects of

printing a character, and an automatic teller machine
cannot recover the money that it dispensed to a
customer.

• It is therefore necessary that the outside world

perceive a consistent behavior of the system despite
failures.

10
Interactions with outside world (contd.)

• Thus, before sending output to the outside world, the

system must ensure that the state from which the
output is sent will be recovered despite of any future
failure

• Similarly, input messages from the outside world may

not be regenerated, thus the recovery protocols must
arrange to save these input messages so that they
can be retrieved when needed.

11
Garbage Collection
• Checkpoints and event logs consume storage
resources.

• As the application progresses and more recovery

information collected, a subset of the stored
information may become useless for recovery.

• Garbage collection is the deletion of such useless

recovery information.

• A common approach to garbage collection is to

identify the recovery line and discard all information
relating to events that occurred before that line.

12
Checkpoint-Based Protocols
• Uncoordinated Check pointing
– Allows each process maximum autonomy in
deciding when to take checkpoints

– Advantage: each process may take a checkpoint

when it is most convenient

– Disadvantages:
• Domino effect
• Possible useless checkpoints
• Need to maintain multiple checkpoints
• Garbage collection is needed
• Not suitable for applications with outside world
interaction (output commit)

13
Recovery Line Calculation

• Recovery line can be calculated from the checkpoint

schedule either using a rollback-dependency graph or
a checkpoint graph.

• Rollback Dependency Graph:

– Construct the graph
– Perform reachability analysis from the failure states
– The recent states which are unreachable from the failed
state form the recovery line.

14
Rollback Dependency Graph
C00 C01
P0

P1
C10
C11
P2

C00 C01
P0 Initially
marked nodes
P1
C10 C11
P2 Recovery
Line

15
Recovery Line Calculation with checkpoint graph

• Checkpoint graphs are very similar to the rollback-

dependency graphs except that, when a message is sent
from I(i,x), and received in I(j,y), a directed edge is drawn
from C(i,x-1) to C(j,y) (instead of C(i,x) to C(j,y))

C00 C01
P0

P1
C10
C11

C00
C00 P0 C01
P0 C01
Rollback- Checkpoint graph
dependency P1
P1
graph C10 C11
C10 C11

16
The Rollback Propagation Algorithm

• Step 1: Include last checkpoint of each failed process

as an element in set “RootSet”;

• Step 2: Include current state of each surviving process

as an element in “RootSet”;

• While(at least one member of RootSet is marked)

– Replace each marked element in RootSet by the last
unmarked checkpoint of the same process;
– Mark all checkpoints reachable by following at least one edge
from any member of RootSet
• End While

• RootSet is the recovery line

17
Checkpoint Graph
C00 C01
P0

P1
C10
C11
P2

C00 C01
P0

P1
C10 C11
P2 Recovery
Line

18
Garbage Collection
• Any checkpoint that precedes the recovery
lines for all possible combinations of process
failures can be garbage-collected.

• The garbage collection algorithm based on a

rollback dependency graph works as follows:

– Mark all volatile checkpoints and remove all edges

ending in a marked checkpoint, producing a non-
volatile rollback dependency graph.

– Use reachability analysis to determine the worst-

case recovery line for this graph, called the global
recovery line.

19
Example

Global Recovery Line

Obsolete
checkpoints

C00 C01
P0

P1
C10 C11
P2

20
Coordinated Checkpointing
• Coordinated checkpointing requires processes to
orchestrate their checkpoints in order to form a consistent
global state.

• It simplifies recovery and is not susceptible to the domino

effect, since every process always restarts from its most
recent checkpoint.

• Only one checkpoint needs to be maintained and hence

less storage overhead.

• No need for garbage collection.

• Disadvantage is that a large latency is involved in

committing output, since a global checkpoint is needed
before output can be committed to the outside world.

21
Different Coordinated Checkpointing Schemes

Coordinated Checkpointing Schemes

Blocking Non-blocking

22
Blocking Coordinated Checkpointing
• Phase 1: A coordinator takes a checkpoint and broadcasts a
request message to all processes, asking them to take a
checkpoint.

• When a process receives this message, it stops its execution and

flushes all the communication channels, takes a tentative
checkpoint, and sends an acknowledgement back to the
coordinator.

• Phase 2: After the coordinator receives all the acknowledgements

from all processes, it broadcasts a commit message that
completes the two-phase checkpointing protocol.

• After receiving the commit message, all the processes remove

their old permanent checkpoint and make the tentative checkpoint
permanent.

• Disadvantage: Large Overhead due to large block time

23
Non-blocking Checkpoint Coordination
• The objective of the coordinated checkpointing
is to prevent a process from receiving
application messages that could make the
checkpoint inconsistent.

• General framework:
– Checkpoint coordinator / initiator broadcasts the
checkpointing message to every other node.

– Each node upon receiving this message should

take a checkpoint.

• However, this approach could lead to

checkpoint inconsistency

24
Non-blocking coordinated checkpointing

Initiator Initiator

P0 P0
C0,x m m
C0,x
P1 P1
Checkpoint C1,x C1,x
Inconsistency With FIFO
channels
Initiator

P0
C0,x m

C1,x

Non FIFO channels (small dashed line represents

the piggybacked checkpoint request)

25
Synchronized Checkpoint Clocks
• Loosely synchronized clocks can facilitate checkpoint
coordination.

• Loosely synchronized clocks can trigger the local checkpointing

actions of all participating processes at approximately the same
time without a checkpoint initiator.

• A process takes a checkpoint and waits for a period that equals

the sum of the maximum deviation between clocks and the
maximum time to detect a failure in another process in the
system.

• The process can be assured that all checkpoints belonging to the

same coordination session have been taken without the need of
exchanging any messages.

26
Minimal checkpoint coordination

• Coordinated checkpointing requires all processes to

participate in every checkpoint. This approach is not
scalable.

• Basic Idea: Only those processes which

communicated with the initiator directly or indirectly
since the last checkpoint need to take new
checkpoints.

27
Minimal checkpoint coordination (contd.)
• Two Phase Protocol:

• Phase 1:
– Initiator sends a checkpoint request to all the processes which
communicated with it since the last checkpoint.

– Each process (Pi) which received this request forwards it to all

the processes which communicated with (Pi) since the last
checkpoint, and so on until no more processes can be
identified.

• Phase 2:
– All processes identified in the first phase take a checkpoint

28
Communication-induced checkpointing

• Avoids the domino effect while allowing processes to

take some of their checkpoints independently.

• However, process independence is constrained to

guarantee the eventual progress of the recovery line,
and therefore processes may be forced to take
additional checkpoints.

• The checkpoints that a process takes independently

are local checkpoints while those that a process is
forced to take are called forced checkpoints.

29
Communication-induced checkpoint (contd.)
• Protocol related information is piggybacked to the
application messages.

• Receiver of each application message uses the

piggybacked information to determine if it has to take a
forced checkpoint to advance the global recovery line.

• The forced checkpoint must be taken before the

application may process the contents of the message,
possibly incurring high latency and overhead.

• Therefore, reducing the number of forced checkpoints

is important.

• No special coordination messages are exchanged.

30
Communication-induced Checkpointing schemes

Communication induced Schemes

Model-based checkpointing Index-based coordination

31
Model based checkpointing

• Model-based checkpointing relies on preventing

patterns of communications and checkpoints that
could result in inconsistent states among the existing
checkpoints.

• A model is set up to detect the possibility that such

patterns could be forming within the system, according
to some heuristic.

• A checkpoint is usually forced to prevent the

undesirable patterns from occurring.

32
Index-based communication-induced checkpointing

• Index-based checkpointing works by assigning

monotonically increasing indexes to checkpoints, such
that the checkpoints having the same index at different
processes form a consistent state.

• The indices are piggybacked on application messages

to help receivers decide when they should force a
checkpoint.

33
Log-based Recovery: Overview
• It combines checkpointing with logging of nondeterministic
events.

• It relies on the piecewise deterministic (PWD) assumption,

which postulates that all nondeterministic events that a
process executes can be identified and that the information
necessary to replay each event during recovery can be
logged in the event’s determinant (all info. necessary to
replay the event).

• By logging and replaying the nondeterministic events in

their exact original order, a process can deterministically
recreate its pre-failure state even if this state has not been
checkpointed.

• Log-based rollback recovery is in general attractive for

applications that frequently interact with the outside world
which consists of input and output logged to stable storage.
34
Log-based recovery schemes
• Schemes differ in the way the determinants are logged into
the stable storage.

• Pessimistic Logging: The application has to block waiting

for the determinant of each nondeterministic event to be
stored on stable storage before the effects of that event can
be seen by other processes or the outside world. It
simplifies recovery but hurts the failure-free performance.

• Optimistic Logging: The application does not block, and the

determinants are spooled to stable storage
asynchronously. It reduces failure free overhead, but
complicates recovery.

• Casual Logging: Low failure free overhead and simpler

recovery are combined by striking a balance between
optimistic and pessimistic logging.

SH - Fall of Troy Semi Fiction PDF
No ratings yet
SH - Fall of Troy Semi Fiction PDF
11 pages
Checkpoint Recovery in Distributed Systems
100% (1)
Checkpoint Recovery in Distributed Systems
26 pages
Failure Recovery in Distributed Systems
No ratings yet
Failure Recovery in Distributed Systems
24 pages
DS NOTES Unit 4 PDF
No ratings yet
DS NOTES Unit 4 PDF
36 pages
Chapter Shutdown
No ratings yet
Chapter Shutdown
31 pages
Distributed System Recovery Guide
No ratings yet
Distributed System Recovery Guide
119 pages
Lm3 Checkpointing Algorithm
No ratings yet
Lm3 Checkpointing Algorithm
40 pages
Unit-3 Part2
No ratings yet
Unit-3 Part2
74 pages
System Recovery
No ratings yet
System Recovery
38 pages
Lm2-Rollback & Recovery
No ratings yet
Lm2-Rollback & Recovery
34 pages
Distributed Systems Recovery Guide
No ratings yet
Distributed Systems Recovery Guide
15 pages
Distributed Failure Recovery
No ratings yet
Distributed Failure Recovery
30 pages
Consensus
No ratings yet
Consensus
77 pages
4th Unit Topics Recovery
No ratings yet
4th Unit Topics Recovery
73 pages
Concurrent Checkpointing and Recovery in Distributed Systems
No ratings yet
Concurrent Checkpointing and Recovery in Distributed Systems
61 pages
15-440 Distributed Systems: Fault Tolerance, Logging and Recovery Thursday Oct 8, 2015
No ratings yet
15-440 Distributed Systems: Fault Tolerance, Logging and Recovery Thursday Oct 8, 2015
30 pages
Fault Tolerance:-: Introduction, Process Resilience, Distributed Commit, Recovery
No ratings yet
Fault Tolerance:-: Introduction, Process Resilience, Distributed Commit, Recovery
52 pages
Dc-3551 Unit IV Notes
No ratings yet
Dc-3551 Unit IV Notes
32 pages
I6079-NATM Tunnel Reinforcement Quantity Details
100% (1)
I6079-NATM Tunnel Reinforcement Quantity Details
1 page
Checkpointing and Rollback Recovery For Distributed Systems 5cvcuy5txm
No ratings yet
Checkpointing and Rollback Recovery For Distributed Systems 5cvcuy5txm
23 pages
Fault Tolerance in Distributed Systems
No ratings yet
Fault Tolerance in Distributed Systems
21 pages
Unit Iv Recovery
No ratings yet
Unit Iv Recovery
27 pages
Fault Tolerant Systems: Part 17 - Checkpointing II Chapter 6 - Checkpointing
No ratings yet
Fault Tolerant Systems: Part 17 - Checkpointing II Chapter 6 - Checkpointing
34 pages
Learnhive - CBSE Grade 5 Science Human Body - Lessons, Exercises, and Practice Tests
No ratings yet
Learnhive - CBSE Grade 5 Science Human Body - Lessons, Exercises, and Practice Tests
9 pages
DC Unit4
No ratings yet
DC Unit4
32 pages
11 Distributed1
No ratings yet
11 Distributed1
42 pages
CS8603 U.iv
No ratings yet
CS8603 U.iv
33 pages
Chapte Four DS
No ratings yet
Chapte Four DS
37 pages
Chapter 8-Fault Tolerance
No ratings yet
Chapter 8-Fault Tolerance
30 pages
Getting - Started With Cisco Intersight
No ratings yet
Getting - Started With Cisco Intersight
12 pages
N - Channel Enhancement Mode " Single Feature Size " Power Mosfet
No ratings yet
N - Channel Enhancement Mode " Single Feature Size " Power Mosfet
9 pages
Cs3551 Unit IV Notes
No ratings yet
Cs3551 Unit IV Notes
34 pages
Distributed Algorithms Explained
No ratings yet
Distributed Algorithms Explained
23 pages
Log
No ratings yet
Log
476 pages
Chapter 8 Fault Tolerance
No ratings yet
Chapter 8 Fault Tolerance
20 pages
Unit 4 Part 3
No ratings yet
Unit 4 Part 3
33 pages
Unit 4
No ratings yet
Unit 4
94 pages
Unit 4 Part 2
No ratings yet
Unit 4 Part 2
21 pages
vx55 4wd
No ratings yet
vx55 4wd
24 pages
Distributed Checkpointing Guide
No ratings yet
Distributed Checkpointing Guide
33 pages
Rollback Slides
No ratings yet
Rollback Slides
22 pages
Unit - Iv
No ratings yet
Unit - Iv
10 pages
7 - 5250 - 01880 - 01E Nachrüstsatz
No ratings yet
7 - 5250 - 01880 - 01E Nachrüstsatz
11 pages
Recovery DC
No ratings yet
Recovery DC
6 pages
Session 33
No ratings yet
Session 33
4 pages
Assignment 4 - 044
No ratings yet
Assignment 4 - 044
4 pages
Distributed Computing Techniques
No ratings yet
Distributed Computing Techniques
3 pages
Ovi R
No ratings yet
Ovi R
2 pages
16 - Issues in Failure Recovery
No ratings yet
16 - Issues in Failure Recovery
5 pages
Design Patterns For Checkpoint-Based Rollback Recovery
No ratings yet
Design Patterns For Checkpoint-Based Rollback Recovery
26 pages
Exercise Getting Started v1 0
No ratings yet
Exercise Getting Started v1 0
3 pages
Chapter 8-Fault Tolerance
No ratings yet
Chapter 8-Fault Tolerance
37 pages
Mobile Application User Guide
No ratings yet
Mobile Application User Guide
13 pages
A 161126
No ratings yet
A 161126
26 pages
CASE REPORT ON BMVSS - Changing Lives .
No ratings yet
CASE REPORT ON BMVSS - Changing Lives .
5 pages
DS Chapter 5 Synchronizations
No ratings yet
DS Chapter 5 Synchronizations
34 pages
1 Info Packet 1 (April 2022)
No ratings yet
1 Info Packet 1 (April 2022)
10 pages
CM Bc9000-Eng-Int-B-Catalogue
No ratings yet
CM Bc9000-Eng-Int-B-Catalogue
20 pages
DistributedComputing (University) PartA
No ratings yet
DistributedComputing (University) PartA
19 pages
TDS DLSF Series
No ratings yet
TDS DLSF Series
3 pages
Lead - Security Operations and Monitoring JD
No ratings yet
Lead - Security Operations and Monitoring JD
2 pages
Activity
No ratings yet
Activity
5 pages
Define The Terms: Rollback Propagation.: Coordinated Checkpointing
No ratings yet
Define The Terms: Rollback Propagation.: Coordinated Checkpointing
5 pages
DS-K1T341CMF Datasheet 20231227
No ratings yet
DS-K1T341CMF Datasheet 20231227
4 pages
HTML Income Tax Form & Tic Tac Toe
No ratings yet
HTML Income Tax Form & Tic Tac Toe
7 pages
DAA Q Bank CAE2
No ratings yet
DAA Q Bank CAE2
9 pages
Lithium Battery Specs & Data
No ratings yet
Lithium Battery Specs & Data
1 page
4.1.4. Checkpoint Based Recovery-1
No ratings yet
4.1.4. Checkpoint Based Recovery-1
10 pages
Module 4 - Distributed Shared Memory and Failure Recovery - Sreerag Sanilkumar
No ratings yet
Module 4 - Distributed Shared Memory and Failure Recovery - Sreerag Sanilkumar
14 pages
Class 12 Communication Skills Q&A
No ratings yet
Class 12 Communication Skills Q&A
5 pages
Distributed Computing Module 4 Important Topics PYQs
No ratings yet
Distributed Computing Module 4 Important Topics PYQs
23 pages
Fault Tolerant Message Passing Systems
No ratings yet
Fault Tolerant Message Passing Systems
26 pages
Python Note 5
No ratings yet
Python Note 5
10 pages
Unit 4 Part 3
No ratings yet
Unit 4 Part 3
21 pages
Distributed Computing Series 2 Important Topics
No ratings yet
Distributed Computing Series 2 Important Topics
24 pages
DC Unit4
No ratings yet
DC Unit4
33 pages
CST402 Scheme
No ratings yet
CST402 Scheme
9 pages
2024 - 10 - 14 - ASEAN ITU GovStack - Brunei Country Update FINAL
No ratings yet
2024 - 10 - 14 - ASEAN ITU GovStack - Brunei Country Update FINAL
16 pages
Checkpoints Recovery
No ratings yet
Checkpoints Recovery
35 pages
Proof
No ratings yet
Proof
1 page
Unit 4
No ratings yet
Unit 4
32 pages
BCA Course Module
No ratings yet
BCA Course Module
11 pages
Data Sheet: PRO MAX 240W 24V 10A
No ratings yet
Data Sheet: PRO MAX 240W 24V 10A
8 pages
Xuv300 Accessories
No ratings yet
Xuv300 Accessories
2 pages
Baidu Summary
No ratings yet
Baidu Summary
4 pages
Chapter 7-Fault Tolerance
No ratings yet
Chapter 7-Fault Tolerance
71 pages
Unit 4
No ratings yet
Unit 4
32 pages
Chapter 8 - Fault Tolerance
No ratings yet
Chapter 8 - Fault Tolerance
19 pages

CheckpointingRecovery ds14

Uploaded by

CheckpointingRecovery ds14

Uploaded by

Fault Tolerant Systems

• Fault tolerance is achieved by periodically using stable

• Upon a failure, a failed process restarts from one of its

• Each of the saved states is called a checkpoint

Rollback Recovery Schemes

Checkpoint based Log based

Comm. induced Casual

• Coordinated checkpointing: Process coordinate their

• Communication-induced checkpointing: It forces each

• A fundamental goal of any rollback-recovery protocol

• A consistent global checkpoint is a set of N local

• Any consistent global checkpoint can be used to restart

Domino Effect: Cascaded rollback which causes the

• A message passing system often interacts with the

• For example, a printer cannot rollback the effects of

• It is therefore necessary that the outside world

• Thus, before sending output to the outside world, the

• Similarly, input messages from the outside world may

• As the application progresses and more recovery

• Garbage collection is the deletion of such useless

• A common approach to garbage collection is to

– Advantage: each process may take a checkpoint

• Recovery line can be calculated from the checkpoint

• Rollback Dependency Graph:

• Checkpoint graphs are very similar to the rollback-

• Step 1: Include last checkpoint of each failed process

• Step 2: Include current state of each surviving process

• While(at least one member of RootSet is marked)

• RootSet is the recovery line

• The garbage collection algorithm based on a

– Mark all volatile checkpoints and remove all edges

– Use reachability analysis to determine the worst-

Global Recovery Line

• It simplifies recovery and is not susceptible to the domino

• Only one checkpoint needs to be maintained and hence

• No need for garbage collection.

• Disadvantage is that a large latency is involved in

Coordinated Checkpointing Schemes

• When a process receives this message, it stops its execution and

• Phase 2: After the coordinator receives all the acknowledgements

• After receiving the commit message, all the processes remove

• Disadvantage: Large Overhead due to large block time

– Each node upon receiving this message should

• However, this approach could lead to

Non FIFO channels (small dashed line represents

• Loosely synchronized clocks can trigger the local checkpointing

• A process takes a checkpoint and waits for a period that equals

• The process can be assured that all checkpoints belonging to the

• Coordinated checkpointing requires all processes to

• Basic Idea: Only those processes which

– Each process (Pi) which received this request forwards it to all

• Avoids the domino effect while allowing processes to

• However, process independence is constrained to

• The checkpoints that a process takes independently

• Receiver of each application message uses the

• The forced checkpoint must be taken before the

• Therefore, reducing the number of forced checkpoints

• No special coordination messages are exchanged.

Communication induced Schemes

Model-based checkpointing Index-based coordination

• Model-based checkpointing relies on preventing

• A model is set up to detect the possibility that such

• A checkpoint is usually forced to prevent the

• Index-based checkpointing works by assigning

• The indices are piggybacked on application messages

• It relies on the piecewise deterministic (PWD) assumption,

• By logging and replaying the nondeterministic events in

• Log-based rollback recovery is in general attractive for

• Pessimistic Logging: The application has to block waiting

• Optimistic Logging: The application does not block, and the

• Casual Logging: Low failure free overhead and simpler

You might also like