Thanks to visit codestin.com
Credit goes to www.scribd.com

0% found this document useful (0 votes)
14 views6 pages

Unit 8

The document discusses fault tolerance in distributed systems, outlining its importance and various types, including process resilience and failure detection. It explains failure models, reliable communication methods, and the necessity for redundancy to mask failures. Additionally, it covers the mechanisms for detecting failures and recovery strategies to ensure system reliability despite faults.

Uploaded by

hike.praji
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
14 views6 pages

Unit 8

The document discusses fault tolerance in distributed systems, outlining its importance and various types, including process resilience and failure detection. It explains failure models, reliable communication methods, and the necessity for redundancy to mask failures. Additionally, it covers the mechanisms for detecting failures and recovery strategies to ensure system reliability despite faults.

Uploaded by

hike.praji
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 6

Fault Tolerance

Introduction to fault tolerance


Four types of fault tolerance:
Process resilience
Failure detection and reliable multicasting
Failure Detection

General Background
a. Basic concept
Failures can happen due to variety of reasons they are, hardware faults, bugs, operator
errors, network errors/outage.

A characteristics features of DS that distinguish them from single machine system in the
motion in a partial failure

b. Failure Models
c. Failure masking by redundancy
d. Reliable communication
There are two types of reliable communications they are:
 Reliable request-reply communication
It is designed to support the roads and message exchange in typical client server
interaction
Classes of failure in request-reply communication there are five classes:
 The client is unable to request a server
 The request message from the client to server is lost
 The server crashes receiving a request
 The reply message from the server to client is lost
 The client crashes after spending a request

 Reliable group communication


As we considered reliable request reply communication we also need to consider
reliable multicasting service. There are three groups of reliable group
communication they are:
 The basic reliable multicasting scheme
 Scalability in reliable multicasting
 Atomic multicast
Goal and Fault Tolerance
//
An overall goal in DS is to construct the system in such as way that it can automatically recover
from partial failure

Figure page no 8
Fault tolerance is the property that enables the system to continue operating properly in the event
of failure.

Faults, errors and failure


Figure
A system is said is said to be fault tolerance if it can provide its service even in the present of
fault tolerance

Fault tolerance requirement


Robust fault tolerance system request they are,
 No single point of failure
 Fault isolation
 Availability of revision modes

Figure

Failure Models
Figure

Failure masking by redundancy


The key technique for masking is to use redundancy.
Usually, extra bits are added to allow recovery from garbled bits.
Figure
Process resilience
 The key approach to tolerating a faulty process is to organize several identical process
into a group

R P

 If one process in a group fails, hopefully some other process can take over.

Caveats
 A process can join a group or leave one during system operation.
 A process can be a member of separate groups at a same time.

Flat VS Hierarchical

Flat Group Hierarchical Group

(+) Symmetrical (+) Decision making is simple

(+) No single point of failure (-)Asymmetrical


An important distinguish between different groups has to be with their internal structure.
How can we achieve k – fault – tolerance system?
This would require an agreement protocol applied to a processed group.
Agreement in faulty system 1
 Electing a coordinator
 Deciding whether or not to commit a trans-section
 Diving task among workers
 Synchronization
Agreement in faulty system 2
Goal has all non-faulty process which concerns on some issue and establish that concerns within
a finite number of steps.
 Synchronization VS A synchronization system
 Communication delay is bounded or not
 Message delivered is ordered or not
 Message transmission is done through unicasting and multicasting

Agreement in faulty system 3


Page number 25

Agreement in faulty system 4


 Process behave asynchronization
 Message transmission is unicast
 Communication delay are unbounded

Byzantine agreement problem 1


 Lampod assumption they are, synchronization
 Delay is bounded

Process Failure Detection


Before we mask failure, we generally need to detect them
Time out mechanism
 In failure detection a time out mechanism is usually involved
 Specify a timer after a period of time, trigger a time out
Page number 37
Failure Detection
Distributed Commit
 Atomic multicasting is an example of a more general problem of distributed commit
 Distributed commit is often established by the means of a coordinator and participants.
 There are two phase of commit protocol they are
 One phase commit protocol
In a simple scheme, coordinator can tell all participants whether or not to locally
perform the operation in question.

 Two phase commit protocol


Assuming that no failures occur, the two phase commit protocol (two PC) consist
of the following two phase
Recovery
So far we have mainly concentrated on algorithms that allows us to tolerate faults. There are
three points of recovery:
 Error recovery
 Check pointing
 Message logging

Questions
What is fault tolerance? Briefly explain
Process resilience.

You might also like