Thanks to visit codestin.com
Credit goes to www.scribd.com

0% found this document useful (0 votes)
8 views21 pages

Unit 4 Part 2

Uploaded by

menakababu
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
8 views21 pages

Unit 4 Part 2

Uploaded by

menakababu
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 21

CS 3551 DISTRIBUTED

COMPUTING
Checkpoi
nt
A/C balance = 20000

ATM Pin Entry

Amount =

10000

Update Balance =

10000 Cash dispense


Checkpoint in Distributed
System
What is Domino
Effect?
● To see why rollback propagation occurs, consider the situation
where the sender of a message m rolls back to a state that
precedes the sending of m.
● The receiver of m must also roll back to a state that precedes m’s
receipt; otherwise, the states of the two processes would be
inconsistent because they would show that message m was
received without being sent, which is impossible in any correct
failure-free execution.
● This phenomenon of cascaded rollback is called the domino
effect.
● In some situations, rollback propagation may extend back to the
initial state of the computation, losing all the work performed
Domino effect
continued…
● Independent or uncoordinated checkpointing : - If each participating process
takes its checkpoints independently, then the system is susceptible to the
domino effect.
How to avoid domino effect?
● Coordinated checkpointing :
○ processes coordinate their checkpoints to form a system-wide consistent state.
○ In case of a process failure, the system state can be restored to such a consistent set of
checkpoints, preventing the rollback propagation.
● Communication-induced checkpointing :
○ forces each process to take checkpoints based on information piggybacked on the application
messages it receives from other processes.
○ Checkpoints are taken such that a system-wide consistent state always exists on stable
storage, thereby
avoiding the domino effect.
● Logbased rollback recovery:
○ combines checkpointing with logging of nondeterministic events.
○ Log-based rollback recovery relies on the piecewise deterministic (PWD) assumption, which
postulates that all non-deterministic events that a process executes can be identified and that
the information necessary to replay each event during recovery can be logged in the event’s
determinant.
○ By logging and replaying the non-deterministic events in their exact original order, a process can
Key
Points
● Rollback recovery treats a distributed system application as a
collection of processes that communicate over a network.
● It achieves fault tolerance by periodically saving the state of a
process during the failure-free execution, enabling it to restart
from a saved state upon a failure to reduce the amount of lost
work.
● The saved state is called a checkpoint, and the procedure of
restarting from a previously checkpointed state is called rollback
recovery.
● A checkpoint can be saved on either the stable storage or the
volatile storage depending on the failure scenarios to be tolerated.
● Challenges for Recovery:
○ on a failure of one or more processes in a system, these dependencies may force
some of the processes that did not fail to roll back, creating what is commonly
called a rollback propagation
Background and
Definitions
1. System Model
2. Local Checkpoint
3. Consistent system states
4. Interactions with the outside
world
5. Different types of messages
1. System
Model

● A distributed system consists of a fixed number of processes, P1, P2


PN , which communicate only through messages.
● Processes cooperate to execute a distributed application and interact
with the outside world by receiving and sending input and output
messages, respectively.
● Some protocols assume that the communication subsystem delivers
messages reliably, in first-in-first-out (FIFO) order, while other
protocols assume that the communication subsystem can lose,
duplicate, or reorder messages.
● a system recovers correctly if its internal state is consistent with the
2. Local Checkpoint - @ each process
level
1. A local checkpoint is a snapshot of the state of the process
at a given instance and the event of recording the state of a
process is called local checkpointing.
2. The contents of a checkpoint depend upon the application
context and the checkpointing method being used.
3. Depending upon the checkpointing method used, a process may
keep
several local checkpoints or just a single checkpoint at any time
4. a process stores all local checkpoints on the stable storage so that
they are available even if the process crashes.
5. We also assume that a process is able to roll back to any of its
existing local checkpoints and thus restore to and restart from
the corresponding state
3. Consistent vs Inconsistent System
States
4. Interactions with Outside World
(OWP)
● a printer cannot roll back the effects of printing a character, and an
automatic teller machine cannot recover the money that it
dispensed to a customer
● A distributed application often interacts with the outside world to
receive input data or deliver the outcome of a computation. If a
failure occurs, the outside world cannot be expected to roll back.
● the outside world see a consistent behavior of the system despite
failures
● Output Commit- before sending output to the OWP, the system must
ensure that the state from which the output is sent will be
recovered despite any future failure.
● Input messages :
○ Received messages from the OWP may not be reproducible during recovery,
because it may not be possible for the outside world to regenerate them.
○ Thus, recovery protocols must arrange to save these input messages so that
they can be
retrieved when needed for execution replay after a failure
Types of
Messages
Types of
Messages
Key
Points
1. In-transit (m1,m2)
a. Messages that has been sent but not yet received
b. When in-transit messages are part of a global system state, these messages do
not cause any inconsistency.
c. For reliable communication channels, a consistent state must include in-transit
messages
because they will always be delivered to their destinations in any legal
execution of the system.
d. On the other hand, if a system model assumes lossy communication
channels, then in-transit
messages can be omitted from system state.
2. Lost Messages(m1)
a. Messages whose send is not undone but receive is undone due to rollback are
called lost messages.
b. This type of messages occurs when the process rolls back to a checkpoint
prior to reception of
the message while the sender does not rollback beyond the send operation of
the message
Key
Points….
3. Delayed Messages (m2,m5)
a. Messages whose receive is not recorded because the receiving process was
either down or the message arrived after the rollback of the receiving process
4. Orphan Messages
a. Messages with receive recorded but message send not recorded are called
orphan messages.
b. For example, a rollback might have undone the send of such messages, leaving
the receive event intact at the receiving process.
c. Orphan messages do not arise if processes roll back to a consistent global state.
5. Duplicate Message(m4,m5)
a. Duplicate messages arise due to message logging and replaying during process
recovery
Issues in Failure
Recovery
A

J
Key
Points

You might also like