Fault Tolerance
Introduction to fault tolerance
Four types of fault tolerance:
Process resilience
Failure detection and reliable multicasting
Failure Detection
General Background
a. Basic concept
Failures can happen due to variety of reasons they are, hardware faults, bugs, operator
errors, network errors/outage.
A characteristics features of DS that distinguish them from single machine system in the
motion in a partial failure
b. Failure Models
c. Failure masking by redundancy
d. Reliable communication
There are two types of reliable communications they are:
Reliable request-reply communication
It is designed to support the roads and message exchange in typical client server
interaction
Classes of failure in request-reply communication there are five classes:
The client is unable to request a server
The request message from the client to server is lost
The server crashes receiving a request
The reply message from the server to client is lost
The client crashes after spending a request
Reliable group communication
As we considered reliable request reply communication we also need to consider
reliable multicasting service. There are three groups of reliable group
communication they are:
The basic reliable multicasting scheme
Scalability in reliable multicasting
Atomic multicast
Goal and Fault Tolerance
//
An overall goal in DS is to construct the system in such as way that it can automatically recover
from partial failure
Figure page no 8
Fault tolerance is the property that enables the system to continue operating properly in the event
of failure.
Faults, errors and failure
Figure
A system is said is said to be fault tolerance if it can provide its service even in the present of
fault tolerance
Fault tolerance requirement
Robust fault tolerance system request they are,
No single point of failure
Fault isolation
Availability of revision modes
Figure
Failure Models
Figure
Failure masking by redundancy
The key technique for masking is to use redundancy.
Usually, extra bits are added to allow recovery from garbled bits.
Figure
Process resilience
The key approach to tolerating a faulty process is to organize several identical process
into a group
R P
If one process in a group fails, hopefully some other process can take over.
Caveats
A process can join a group or leave one during system operation.
A process can be a member of separate groups at a same time.
Flat VS Hierarchical
Flat Group Hierarchical Group
(+) Symmetrical (+) Decision making is simple
(+) No single point of failure (-)Asymmetrical
An important distinguish between different groups has to be with their internal structure.
How can we achieve k – fault – tolerance system?
This would require an agreement protocol applied to a processed group.
Agreement in faulty system 1
Electing a coordinator
Deciding whether or not to commit a trans-section
Diving task among workers
Synchronization
Agreement in faulty system 2
Goal has all non-faulty process which concerns on some issue and establish that concerns within
a finite number of steps.
Synchronization VS A synchronization system
Communication delay is bounded or not
Message delivered is ordered or not
Message transmission is done through unicasting and multicasting
Agreement in faulty system 3
Page number 25
Agreement in faulty system 4
Process behave asynchronization
Message transmission is unicast
Communication delay are unbounded
Byzantine agreement problem 1
Lampod assumption they are, synchronization
Delay is bounded
Process Failure Detection
Before we mask failure, we generally need to detect them
Time out mechanism
In failure detection a time out mechanism is usually involved
Specify a timer after a period of time, trigger a time out
Page number 37
Failure Detection
Distributed Commit
Atomic multicasting is an example of a more general problem of distributed commit
Distributed commit is often established by the means of a coordinator and participants.
There are two phase of commit protocol they are
One phase commit protocol
In a simple scheme, coordinator can tell all participants whether or not to locally
perform the operation in question.
Two phase commit protocol
Assuming that no failures occur, the two phase commit protocol (two PC) consist
of the following two phase
Recovery
So far we have mainly concentrated on algorithms that allows us to tolerate faults. There are
three points of recovery:
Error recovery
Check pointing
Message logging
Questions
What is fault tolerance? Briefly explain
Process resilience.