Thanks to visit codestin.com
Credit goes to www.scribd.com

0% found this document useful (0 votes)
5 views85 pages

Mid Sem Prep Distributed System

The document discusses distributed systems where multiple computers work together as a single entity, highlighting features like horizontal scaling, enhanced reliability, and fault tolerance, while also addressing challenges such as unreliable communication and concurrency control. It explores consensus problems, clock synchronization algorithms like Cristian's and Berkeley's, and the Network Time Protocol (NTP) for maintaining accurate time across systems. Additionally, it introduces logical clocks for event ordering in distributed systems, emphasizing the importance of causality and consistency in managing processes.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
5 views85 pages

Mid Sem Prep Distributed System

The document discusses distributed systems where multiple computers work together as a single entity, highlighting features like horizontal scaling, enhanced reliability, and fault tolerance, while also addressing challenges such as unreliable communication and concurrency control. It explores consensus problems, clock synchronization algorithms like Cristian's and Berkeley's, and the Network Time Protocol (NTP) for maintaining accurate time across systems. Additionally, it introduces logical clocks for event ordering in distributed systems, emphasizing the importance of causality and consistency in managing processes.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 85

Defn: A group of computers working together to appear as a single computer to the end-

user.

Features:
Lots of computers, Perform Concurrently, Fail independently, Don't share a global clock.

Horizontal Scaling (Scalability)


Inherently distributed. Ex: Banking system across cities
Data Sharing. Ex Libraries, certain files, Complete datasets
Enhanced Reliability
Increased Performance/cost ratio (parallelisation)
Fault Tolerance

Challenges
Unreliable communication, Lack of global knowledge (no shared memory), Lack of
synchronisation (different local clocks), Concurrency control (implement Mutual exclusion or
critical sections), Failure and recovery, Deadlocks, Termination Detection, File systems are
other concerns.

Puzzle

Consensus: All non-faulty processes must agree on a common value. If the initial value of
all non-faulty processes is v, then the agreed upon value must be v.

The village without a mirror – consensus

In a village with at least one person with blue eyes, there is not mirror in village and people
don't interact with each other, the task is that all blue eyed people are asked to leave the
town, in how many days all blue eyed people will realise that they have blue eyes and leave
the village? Each day the villagers come to a common place see each other and go back
and repeat this every day.

Suppose the number of people in the village is n.


Suppose the number of people in the village with blue eyes is k.
1<=n;
1<=k<=n;

War Preparation with messengers

There are 2 teams who want to battle each other but the condition is that they want to start
the battle together, i.e. start battle only when one team is sure that the other team is ready
for battle.

● Team 1 sends a messenger to Team 2 but he cannot start because Team 1 is not
sure if Team 2 has received this message or not.
● Team 2 receives this messenger, then Team 2 sends a messenger to Team 1 (telling
that he has received the messenger) but he cannot start because Team 2 is not sure
if Team 1 has received this message or not.
● Team 1 receives this messenger, then Team 1 sends a messenger to Team 2 (telling
that he has received the messenger) but he cannot start because Team 1 is not sure
if Team 2 has received this message or not.
● Team 2 receives this messenger, then Team 2 sends a messenger to Team 1 (telling
that he has received the messenger) but he cannot start because Team 2 is not sure
if Team 1 has received this message or not.
● Team 1 receives this messenger, then Team 1 sends a messenger to Team 2 (telling
that he has received the messenger) but he cannot start because Team 1 is not sure
if Team 2 has received this message or not.
……..

So the problem is finding when they will come to an agreement that both are ready for battle
and can start the battle.

Consensus with traitors included - the general too may be malicious (Byzantine
Agreement Problem)

Why is time important


To determining the order of events. Maintain consistency in replicated databases. Design
correct deadlock detection algorithms to avoid phantom and undetected deadlocks.
Tracking of dependent events. In debugging, it helps construct a consistent state
for resuming reexecution; in failure recovery, it helps build a checkpoint. In replicated
databases, it aids in the detection of file inconsistencies in case of a network partitioning.

Note: Clocks can easily drift seconds per day, accumulating significant errors over time.

Cristian’s (Time Server) Algorithm

Cristian’s Algorithm is a clock synchronisation algorithm used to synchronise time with a


time server by client processes. This algorithm works well with low-latency networks where
Round Trip Time(RTT) is short as compared to accuracy while redundancy-prone
distributed systems/applications do not go hand in hand with this algorithm.

Round Trip Time refers to the time duration between the start of a Request and the end of
the corresponding Response.
___________________
Algorithm:

1) The process on the client machine sends the request for fetching clock time(time at the
server) to the Clock Server at time T0.

2) The Clock Server listens to the request made by the client process and returns the
response in the form of clock server time.

3) The client process fetches the response from the Clock Server at time T1 and calculates
the synchronised client clock time using the formula given below.

TCLIENT = TSERVER + (T1 -T0)/2

where TCLIENT refers to the synchronised clock time,

TSERVER refers to the clock time returned by the server,

T0 refers to the time at which request was sent by the client process,

T1 refers to the time at which response was received by the client process

Working/Reliability of the above formula:

T1 -T0 refers to the combined time taken by the network and the server to transfer the
request to the server, process the request, and return the response back to the client
process, assuming that the network latency T0 and T1 are approximately equal.
The time at the client-side differs from actual time by at most (T1 -T0)/2 seconds. Using
the above statement we can draw a conclusion that the error in synchronisation can be at
most (T1 -T0)/2 seconds.

Hence, Error ∈ [-(T1 -T0)/2, (T1 -T0)/2]

Improvisation in Clock Synchronisation:

Using iterative testing over the network, we can define a minimum transfer time using which
we can formulate an improved synchronisation clock time(less synchronisation error).
Here, by defining a minimum transfer time, with a high confidence, we can say that the
server time will always be generated after T0 -Tmin and the TSERVER will always be

generated before T1 -Tmin,

where Tmin is the minimum transfer time which is the minimum value of TREQUEST and
TRESPONSE during several iterative tests. Here synchronisation error can be formulated as
follows:
Error ∈ [-(T1 -T0)/2-Tmin, (T1 -T0)/2-Tmin]

Similarly, if TREQUEST and TRESPONSE differ by a considerable amount of time, we may

substitute Tminby Tmin1 and Tmin2, where Tmin1 is the minimum observed request

time and Tmin2 refers to the minimum observed response time over the network.
The synchronized clock time in this case can be calculated as:
TCLIENT = TSERVER + (T1 -T0)/2 + (Tmin2 -Tmin1)/2

So, by just introducing response and request time as separate time latencies, we can

improve the synchronization of clock time and hence decrease the overall synchronization

error. A number of iterative tests to be run depends on the overall clock drift observed.

From Slides
Uses a time server to synchronise clocks. Time server keeps the reference time (say UTC)
A client asks the time server for time, the server responds with its current time, and the client
uses the received value T to set its clock
But network round-trip time introduces errors…
Let RTT = response-received-time – request-sent-time (measurable at client),
If we know
(a) min = minimum client-server one-way transmission time
and
(b) that the server timestamped the message at the last possible instant before
sending it back

Then, the actual time could be between [T+min,T+RTT-min]


where RTT-min is the max time it took for the response to return which happens when
request took min time to go.

Berkeley UNIX algorithm

Berkeley’s Algorithm is a clock synchronisation technique used in distributed systems.

Assumption: The algorithm assumes that

1) each machine node in the network either doesn’t have an accurate time source or

2) doesn’t possess a UTC server.

Algorithm

1) An individual node is chosen as the master node from a pool node in the network. This
node is the main node in the network which acts as a master and the rest of the nodes act
as slaves. The master node is chosen using an election process/leader election algorithm.

2) Master node periodically pings slave nodes and fetches clock time at them using
Cristian’s algorithm.

The diagram below illustrates how the master sends requests to slave nodes.
The diagram below illustrates how slave nodes send back time given by their system clock.
3) Master node calculates the average time difference between all the clock times received
and the clock time given by the master’s system clock itself. This average time difference is
added to the current time at the master’s system clock and broadcasted over the network.

The diagram below illustrates the last step of Berkeley’s algorithm.


Scope of Improvement

● Improvision inaccuracy of Cristian’s algorithm.


● Ignoring significant outliers in the calculation of average time difference
● In case the master node fails/corrupts, a secondary leader must be ready/pre-chosen
to take the place of the master node to reduce downtime caused due to the master’s
unavailability.
● Instead of sending the synchronised time, master broadcasts relative inverse time
difference, which leads to a decrease in latency induced by traversal time in the
network while the time of calculation at slave node.

Decentralised Averaging Algorithm


● Each machine has a daemon without UTC
● Periodically, at fixed agreed-upon times, each machine broadcasts its local time.
● Each of them calculates the average time by averaging all the received local times.

Network Time Protocol (NTP) is a protocol that helps the computers clock times to be
synchronised in a network. This protocol is an application protocol that is responsible for the
synchronisation of hosts on a TCP/IP network. NTP was developed by David Mills in 1981 at
the University of Delaware. This is required in a communication mechanism so that a
seamless connection is present between the computers.

Features of NTP :

Some features of NTP are –

● NTP servers have access to highly precise atomic clocks and GPU clocks
● It uses Coordinated Universal Time (UTC) to synchronise CPU clock time.
● Avoids even having a fraction of vulnerabilities in information exchange
communication.
● Provides consistent timekeeping for file servers

Working of NTP :

NTP is a protocol that works over the application layer, it uses a hierarchical system of time
resources and provides synchronisation within the stratum servers. First, at the topmost
level, there are highly accurate time resources’ ex. atomic or GPS clocks. These clock
resources are called stratum 0 servers, and they are linked to the below NTP server called
Stratum 1,2 or 3 and so on. These servers then provide the accurate date and time so that
communicating hosts are synced to each other.

Architecture of Network Time Protocol :

Applications of NTP :

● Used in a production system where the live sound is recorded.


● Used in the development of Broadcasting infrastructures.
● Used where file system updates needed to be carried out across multiple computers
depending on synchronized clock times.
● Used to implement security mechanism which depend on consistent time keeping
over the network.
● Used in network acceleration systems which rely on timestamp accuracy to calculate
performance.

Advantages of NTP :

● It provides internet synchronization between the devices.


● It provides enhanced security within the premises.
● It is used in the authentication systems like Kerberos.
● It provides network acceleration which helps in troubleshooting problems.
● Used in file systems that are difficult in network synchronization.

Disadvantages of NTP :

● When the servers are down the sync time is affected across a running
communication.
● Servers are prone to error due to various time zones and conflict may occur.
● Minimal reduction of time accuracy.
● When NTP packets are increased synchronization is conflicted.
● Manipulation can be done in synchronization.

We don’t even need physical time, logical time is sufficient.

Logical Clocks refer to implementing a protocol on all machines within your distributed
system, so that the machines are able to maintain consistent ordering of events within some
virtual timespan. A logical clock is a mechanism for capturing chronological and causal
relationships in a distributed system. Distributed systems may have no physically
synchronous global clock, so a logical clock allows global ordering on events from different
processes in such systems.

Example :

If we go outside then we have made a full plan that at which place we have to go first,
second and so on. We don’t go to second place at first and then the first place. We always
maintain the procedure or an organisation that is planned before. In a similar way, we should
do the operations on our PCs one by one in an organised way.

Suppose, we have more than 10 PCs in a distributed system and every PC is doing it’s own
work but then how we make them work together. There comes a solution to this i.e.
LOGICAL CLOCK.

Method-1:

To order events across process, try to sync clocks in one approach.

This means that if one PC has a time 2:00 pm then every PC should have the same time
which is quite not possible. Not every clock can sync at one time. Then we can’t follow this
method.

Method-2:

Another approach is to assign Timestamps to events.


Taking the example into consideration, this means if we assign the first place as 1, second
place as 2, third place as 3 and so on. Then we always know that the first place will always
come first and then so on. Similarly, If we give each PC their individual number than it will be
organized in a way that 1st PC will complete its process first and then second and so on.

BUT, Timestamps will only work as long as they obey causality.

What is causality ?

Causality is fully based on HAPPEN BEFORE RELATIONSHIP.

● Taking single PC only if 2 events A and B are occurring one by one then TS(A) <
TS(B). If A has timestamp of 1, then B should have timestamp more than 1, then only
happen before relationship occurs.
● Taking 2 PCs and event A in P1 (PC.1) and event B in P2 (PC.2) then also the
condition will be TS(A) < TS(B). Taking example- suppose you are sending message
to someone at 2:00:00 pm, and the other person is receiving it at 2:00:02 pm.Then
it’s obvious that TS(sender) < TS(receiver).

Properties Derived from Happen Before Relationship –

● Transitive Relation –
If, TS(A) <TS(B) and TS(B) <TS(C), then TS(A) < TS(C)
● Causally Ordered Relation –
a->b, this means that a is occurring before b and if there is any changes in a it will
surely reflect on b.
● Concurrent Event –
This means that not every process occurs one by one, some processes are made to
happen simultaneously i.e., A || B.
Two events are logically concurrent if and only if the events do not causally affect
each other. In other words, ei || ej ↔ Not(ei → ej) and Not(ej → ei).

Note that for logical concurrency of two events, the events may not occur at the same time.

Two events are physically concurrent iff the events occur at the same physical time.

Using Above we can determine

To check if event x and y are concurrent or not, verify that (x→y) does not holds
neither (y→x) does.

How to check if x→y holds or not? If there is a directed path from x to y then x→y
holds else it does not.

Three ways to implement logical time -


● scalar time,
● vector time, and
● matrix time

Logical Clock
Home Work
Elements of T form a partially ordered or totally ordered set over a relation < ?

Solution)
Partial order defn: A binary relation is a partial order if and only if the relation is
reflexive(R), antisymmetric(A) and transitive(T).
Transitive(T) holds.
Antisymmetric(A): if (a,b) ∈ R and (b,a) ∈ R, then a=b, holds?
Reflexive(R): (a, a) ∈ R ∀ a ∈ X or as I ⊆ R where I is the identity relation on A.
does it holds?

Therefore it is not a partial order.

Consistent: When T and C satisfy the condition: for two events ei and ej ,
(ei → ej) ⇾ (C(ei) < C(ej)).
This property is called the clock consistency condition.

Strongly Consistent: When T and C satisfy the condition: for two events ei and ej ,
(ei → ej) ⇿ (C(ei) < C(ej)).
then the system of clocks is said to be strongly consistent.

i.e. it should not happen that (ei || ej) and (C(ei) < C(ej)).

Scalar Time given by Lamport.


The logical local clock of a process pi and its local view of the global time are combined into
one integer variable Ci.
Scaler time properties:
Monotonicity: For two events ei and ej ,
(ei → ej) ⇾ (C(ei) < C(ej)).

Total Ordering: Using Logical time to find the total order.

Note: that the timestamps alone do not induce a total order. Two events at different
processors can have an identical timestamp.

Tie breaking mechanism.Requests are timestamped and served according to the total order
based on these timestamp.

t denotes timestamp
i the identity of the process.
But the tuple (t, i) for each event with (t1,i1) < (t2, i2)
if either
t1 < t2
or
( (t1 == t2) and i1<i2) is a total order. In (t,i),

This total order is consistent with the relation →.


Note: According to the total order above:
For events e1 and e2, e1 < e2 ( i.e. C(e1) < C(e2)) Either
e1 → e2
or
e1 || e2 (and hence it is not strongly consistent).

Note: In the vector Method this case won’t even appear that C(e1) < C(e2) and e1||e2,
because if e1||e2 then C(e1) and C(e2) vectors won’t even be comparable.

Event Counting: Set the increment d to 1 always. If some event e has a timestamp t, then e
is dependent on t – 1 other events to occur. This can be called the height of event e.

No Strong Consistency: Note that scalar time does not provide strong
consistency. [Strong consistency requires that ei → ej ⇿ C(ei) < C(ej).]
Example suffices. Refer to the timeline again , look at e33 and e42

Limitation:
Strong consistency not achieved as scalar time uses a single time that represents the logical
local clock and the logical global clock. This means that the causality of events across
processors is lost. That is, by looking at the times we can not make out which two events
are causally related and which two are not.
Vector Time
● vti[j] represents process pi’s latest knowledge of process pj local time.
● If vti[j]=x, then process pi knows that local time at process pj has progressed till x.
● The entire vector vti constitutes pi ’s view of the global logical time and is used to
timestamp events.

For process i, for d=1, are the events internal to the process ordered as a sequence
1,2,3,4,....
That is for process i: will Vi[i] always be increasing as 1,2,3,4,5,…and so on or can there be
values skipped?
Answer: Yes. prove it in cases
1) No Communication bwn processes.
2) Send Only
3) Send and Recieves in most complex order.

Ordering using the vector

Using vector clocks, two timestamps vector vh and vk are compared as follows.
● vh == vk iff for all indices i, vh[i] == vk[i]
● vh <= vk iff for all indices i, vh[i] <= vk[i]
● vh < vk iff vh vk and there exists an index i where vh[i] < vk[i].
● vh || vk iff not(vh < vk) and not(vk < vh)
How many events causally precede e in the distributed computation ?
(⅀j vh[j] ) − 1 , -1 because we want number of events preceding it.
Note: The Formula won’t work for concurrent processes.

Are Vector clocks strongly consistent ?


Yes,
Proof:
Suppose not then ei->ej not double implies vi< vj
1- Show ei->ej implies vi< vj
2- Show vi< vj implies ei->ej

1. Suppose not then vi not < vi: However ei->ej either in same process or a send msg or
because of ez such that ei->ez->ej
a) If same process then both are vectors of the same process and Vki[k]<Vkj[k]
b) Let two vectors be Vim and Vjn. if send msg then Vim[i]=Vjn[i] and Vim[j]< Vjn[j] and for all
other k, Vim[k]>=Vjn[k]
c) By a and b we have vi<vj

2)Suppose not then vi<vj but ei not -> ej


For two events of the same process vectors can only increase.
For two events of different process assume vi<vj but ei not -> ej. That is
vi<vj but ej -> ei or ei|| ej.
If ej -> ei: Consider Vim[i] and Vjn[i]. Since ej -> ei we have Vjn[i] <Vim[i] so vi not <vj.
Contradiction.
If ei || ej: Consider Vim[i] and Vjn[i]. Since ei ||ej; Vjn would have no update of Vim and
hence Vim[i]>Vjn[i]. Similarly Vim[j]<Vin[j]=> vi not <vj. Contradiction.

What are the limitations of the vector clock?

Large message sizes owing to the vector being piggybacked on each message.
The message overhead grows linearly with the number of processors in the system and
when there are thousands of processors in the system, the message size becomes huge
even if there are only a few events occurring in a few processors.

What can we do to reduce the load?

Read Kshemkalyani differential technique

Examples:
Teaser 1
Causal Message Ordering (Not Causality)
If
Send(M1) –> Send(M2) [M1, M2 are messages] (Source of M1 and M2 may
or may not be same process)
then
for all processes which receive the messages M1 and M2 should receive M1
before M2 [irrespective of number of intermediates through which the message
reaches the destination].

OR
If two messages causally ordered between them are sent (even if by two different
processes) to a process P; then process P should receive the msg also by the same causal
order.

In short: The message which is sent first should be received first (irrespective of who is at
destination).

Consider:
● A sends msg (M1) to B.
● A sends msg (M2) to C.
● B sends msg (M1) to C.

At C: C receives two messages. One from A (M1) and other from B (M2).
However, the msg sent by B to C is causally ordered after the msg sent by A to C.
Hence, msg sent by B to C should be received after msg sent by A to C.

Slides solution
No. Suppose B received a msg from C before it sent a msg to me then

B updates as
VB[D]= max(VB[D], VC[D]).
Hence we do not know if VB[D] is an update from C or an update B already had from D.

Teaser 2
Same as Teaser 1, but there is Causal Message Ordering.

Yes. Because if Broadcasts happen at time t, so it will never happen that 4 reaches K before
5 reaches K.

Slides solution
Let us assume that B got D’s news through C.
Now, that means there were two msgs with recipient as B. One Is the msg from D and
another is a msg from C due to which VB[D]=VC[D].

Now by causual order if two msgs are intended for the same destination; then the one sent
first should reach first.
Can the msg from C to B with D’s update be sent before D’s msg to B?
No as D had broadcast all its msgs. So C can be sending the update on D to B only after D
sent a msg to B.

Hence D’s msg to B will reach first and the update could not have come through anyone else

Teaser 3
I want to know if D’s msg has reached everyone.
Instead of receiving a msg from everyone if I receive a msg from B such that VB[x] >= t for
all x (every element of vector VB is more than t) then can I conclude that all have received
the broadcast from D?
t is time of broadcast.

Matrix Time
Continue from Here
In the system of matrix clocks, the time is represented by a set of n x n matrices of non-
negative integers.

A process pi maintains a matrix mti[1..n, 1..n] where, mti [i, i] denotes the local logical
clock of pi and tracks the progress of the computation at process pi.

mti[i, j] denotes the latest knowledge that process pi has about the local logical clock, mtj[j, j],
of process pj. Mt[i,*] is vector vti

mti[j, k] represents the knowledge that process pi has about the latest knowledge that pj has
about the local logical clock, mtk[k, k], of pk .

The entire matrix mti denotes pi’s local view of the global logical time.

Teaser 4

Global snapshot

(Doubt in above)
Messages in Transit

For a channel Cij , the following set of messages can be defined as in transit based on the
local states of the processes pi and pj.
Transit: transit(LSi, LSj) = {mij |send(mij) ∊ LSi Ʌ rec(mij) ∉ LSj }

Notation wise, a global state (Global State) GS is defined as,


– GS = { Ui (LSi) , Ui,j (SCij) }

Condition for consistent global state


A global state GS is a consistent global state iff it satisfies the following two conditions :
● C1: Law of Conservation of Messages
Every message that is recorded as sent in the local state of some process is either
captured in the state of the channel or is captured in the local state of the receiver.
send(mij) ∊ LSi ⇒ mij ∊ SCij ⊕ rec(mij) ∊ LSj ( ⊕: Exclusive-OR
operator)
● C2: For every effect, its cause must be present.
If a message is not recorded as sent in the local state of a process Pi, then the
message cannot be included in the state of the channel Cij or be captured as
received by Pj.
send(mij) ∉ LSi ⇒ mij ∉ SCij Ʌ rec(mij) ∉ LSj

Difficulty of Taking a Snapshot

Issue 1: How to distinguish between the messages to be recorded in the snapshot from
those not to be recorded.
● Any message that is sent by a process before recording its snapshot, must be
recorded in the global snapshot (from C1).
● Any message that is sent by a process after recording its snapshot, must not be
recorded in the global snapshot (from C2).
Issue 2: How to determine the instant when a process takes its snapshot.
● A process pj must record its snapshot before processing a message mij that was
sent by process pi after recording its (pi) snapshot.

Snapshots in a FIFO Channels


Chandy-Lamport Algorithm (Anyone can initiate)
1. Use a control message called marker to separate messages in the channels.
2. After a site has recorded its snapshot, it sends a marker along all of its outgoing
channels before sending out any more messages.
3. A marker separates the messages in the channel into those to be included in the
snapshot from those not to be recorded.
4. A process must record its snapshot no later than when it receives a marker on any of
its incoming channels.
5. The algorithm terminates after each process has received a marker on all of its
incoming channels

Example
Correctness
Msg sent before probe msg can be recorded in its local snapshot or in the channel(as we
saw above).
● Any message that is sent by a process before recording its snapshot, must be
recorded in the global snapshot (from C1).

● Any message that is sent by a process after recording its snapshot, must not be
recorded in the global snapshot (from C2).

Note:
1. The recorded global state may not correspond to any of the global states that
occurred during the computation
2. The recorded global state is a valid state in an equivalent execution.

Variations of the Chandy–Lamport algorithm

1) Spezialetti–Kearns algorithm:
a) Snapshots concurrently initiated by multiple processes into a single snapshot.
b) A process needs to take only one snapshot, irrespective of the number of
concurrent initiators and all processes are not sent the global snapshot.
Idea: A marker carries the identifier of the initiator of the algorithm. Each process has a
variable master to keep track of the initiator of the algorithm. When a process executes the
“marker sending rule” on the receipt of its first marker, it records the initiator’s identifier
carried in the received marker in the master variable.

a process does not take a snapshot or propagate a snapshot request initiated by a process if
it has already taken a snapshot in response to some other snapshot initiation.
Snapshot recording at a process is complete after it has received a marker along each of its
channels. After every process has recorded its snapshot, the system is partitioned into as
many regions as the number of concurrent initiations of the algorithm.

Snapshots in a non-FIFO Channels


Lai–Yang algorithm
1. Initially white
2. White → Red (take snapshot before turning red) . a white (red) message is a
message that was sent before (after) the sender of that message recorded its local
snapshot.
3. white process records a history of all white messages sent or received by it along
each channel.
4. When a process turns red, it sends these histories along with its snapshot to the
initiator process that collects the global snapshot.
5. Transit ( LSi , LSj ) = SC ij = { white messages sent by p i on Cij } −
{ white messages received by pj on C i }

C2 holds because a red message is not included in the snapshot of the recipient process
and a channel state is the difference of two sets of white messages.
C1 holds because a white message m ij is included in the snapshot of process pj if p j
receives m ij before taking its snapshot. Otherwise, mij is included in the state of channel C ij
.
The Lai-Yang Algorithm works in case of non FIFO messages?
It does. Channel transmit messages are white msgs sent before the snapshot and have not
reached by the time other process has taken its snapshot. They are hence truly in transit and
can reach in any order.

The white messages has the history of msg sen/recieved before being sent, so it does not
matter if they are reordered in the channel, before after any other msg in channel. As the info
it holds cannot change.
Storage?
Storage is heavy as each process has to remember all msgs he has sent received until
snapshot.

Li et al.’s algorithm
Markers are tagged so as to generalize the red/white colors of the Lai–Yang algorithm to
accommodate repeated invocations of the algorithm and multiple initiators.

Snapshots in a Causal Channels


Two global snapshot recording algorithms, namely, Acharya-Badrinath and Alagar-
Venkatesan, exist that assume that the underlying system supports causal message
delivery.
Algo:
In both these algorithms the recording of process state is identical and proceeds as follows :
● An initiator process broadcasts a token, denoted as token, to every process including
itself.
● Let the copy of the token received by process pi be denoted tokeni.
● A process pi records its local snapshot LSi on receiving tokeni and sends the
recorded snapshot to the initiator.
● The algorithm terminates when the initiator receives the snapshot recorded by each
process.

Correctness:
For any two processes pi and pj, the following property is satisfied:
send(mij) ∉ LSi ⇒ rec(mij) ∉ LSj

If you haven't sent anything at Lsi then it won’t happen that you will see something at LSj
because of causal ordering.

Let a message mij be such that rec(tokeni) → send(mij)


Then send(tokenj) → send(mij) and the underlying causal ordering property ensures
that rec(tokenj), at which instant process pj records LSj, happens before rec(mij).
Thus, mij, whose send is not recorded in LSi, is not recorded as received in LSj.

Using the above principal we will see both the algo,


See slides kishore sir’s slide slide number 6. Pg :13 onwards.

Acharya-Badrinath
Because of causal ordering we don’t have to keep track of which process is sent and which
is received , just the count of it is sufficient.

1. Each process pi maintains arrays SENTi[1, ...N] and RECDi[1, ..., N].
2. SENTi[j] is the number of messages sent by process pi to process pj.
3. RECDi [j] is the number of messages received by process pi from process pj.
4. Sent and received do not contribute to space complexity because they are used for
underlying causal ordering protocol.
5. When a process pi records its local snapshot LSi on the receipt of token, it includes
arrays RECDi and SENTi in its local state before sending the snapshot to the
initiator.
6. When the algorithm terminates, the initiator determines the state of channels as
follows:
a. The state of each channel from the initiator to each process is empty.
b. The state of the channel from process pi to process pj is the set of messages
whose sequence numbers are given by {RECDj[i] + 1, . . ., SENTi[j]}.
c. That is, send 10 msgs but receive 5 msgs so 6,7,8,9,10 are not received. Due
to causal order they will appear at the receiving end in that order itself.

Will Acharya Badrinath Algorithm work for FIFO Channels? No.

Correctness Proof:
Let a message mij be such that rec(token i) → send(m ij). Clearly, send(tokenj)→
send(m ij) and the sequence number of m ij is greater than SENTi [j]. Therefore,
m ij is not recorded in SC ij . Thus, send(m ij ) ∉ LS i ⇒ m ij ∉ SC ij . This in
conjunction with property P1 implies that the algorithm satisfies condition C2.
Consider a message m ij which is the k th message from process p i to process pj
before pi takes its snapshot. The two possibilities below imply that condition C1
is satisfied:
● Process p j receives m ij before taking its snapshot. In this case, m ij is recorded in pj
’s snapshot.
● Otherwise, RECD j [i] ≤ k ≤ SENT i [j] and the message m ij will be
included in the state of channel Cij .

Complexity:
This algorithm requires 2n messages and 2 time units for recording and assembling the
snapshot, where one time unit is required for the delivery of a message. If the contents of
messages in channels state are required, the algorithm requires 2n messages and 2 time
units additionally.

Alagar–Venkatesan algorithm
A message is referred to as old if the send of the message causally precedes the send of the
token. Otherwise, the message is referred to as new. Whether a message is new or old can
be determined by examining the vector timestamp in the message, which is needed to
enforce causal ordering among messages.
1. When a process receives the token, it takes its snapshot, initializes the state of all
channels to empty, and returns Done message to the initiator. Now onwards, a
process includes a message received on a channel in the channel state only if it is an
old message.
2. After the initiator has received Done message from all processes, it broadcasts a
Terminate message.
3. A process stops the snapshot algorithm after receiving a Terminate message.

Correctness proof:
An interesting observation is that a process receives all the old messages in its
incoming channels before it receives the Terminate message. This is ensured by
the underlying causal message delivery property. The causal ordering property
ensures that no new message is delivered to a process prior to the token and
only old messages are recorded in the channel states. Thus, send(m ij ) ∉ LS i ⇒
m ij ∉ SC ij . This together with property P1 implies that condition C2 is satisfied.
Condition C1 is satisfied because each old message mij is delivered either before
the token is delivered or before the Terminate is delivered to a process and thus
gets recorded in LS i or SC ij , respectively.

Complexity analysis of all:


Exercise:
Exercise 4.1 Consider the following simple method to collect a global snapshot (it
may not always collect a consistent global snapshot): an initiator process takes its
snapshot and broadcasts a request to take snapshot. When some other process receives
this request, it takes a snapshot. Channels are not FIFO.
Prove that such a collected distributed snapshot will be consistent iff the following
holds (assume there are n processes in the system and Vti denotes the vector timestamp
of the snapshot taken process p i ):

Don’t worry about channel states.

Exercise 4.2 What good is a distributed snapshot when the system was never in
the state represented by the distributed snapshot? Give an application of distributed
snapshots.

Exercise 4.3 Consider a distributed system where every node has its physical clock
and all physical clocks are perfectly synchronized. Give an algorithm to record global
state assuming the communication network is reliable. (Note that your algorithm
should be simpler than the Chandy–Lamport algorithm.)

Exercise 4.4 What modifications should be done to the Chandy–Lamport snapshot


algorithm so that it records a strongly consistent snapshot (i.e., all channel states are
recorded empty).
Mutual Exclusion
Mutual exclusion ensures that concurrent access of processes to a shared
resource or data is serialized.

1. Token-based approach. (PRIVILEGE message)


2. Non-token-based approach. (two or more successive rounds of messages are
exchanged among the sites to determine which site will enter the CS next. A site
enters the critical section (CS) when an assertion, defined on its local variables,
becomes true)
3. Quorum-based approach. (each site requests permission to execute the CS from a
subset of sites (called a quorum))

Important consideration:

A process wishing to enter the CS requests all other or a subset of processes by sending
REQUEST messages, and waits for appropriate replies before entering the CS.
While waiting the process is not allowed to make further requests to enter the CS.

A site can be in one of the following three states:


1. requesting the CS,
2. executing the CS,
3. or neither requesting nor executing the CS (i.e., idle).

In the “requesting the CS” state, the site is blocked and cannot make further requests for the
CS.

N :number of processes or sites involved in invoking the critical section.


T :average message delay
E :average critical section execution time.

Safety (of data): only one process in CS.


Live(not dead)ness: No starvation or deadlock
Fairness: (fair to everyone)

Performance metrics
Message complexity: # messages per CS execution
Synchronisation delay(SD): T(site A enters CS) - T(site B leaves CS)
Response time: T( CS execution complete) - T( CS request made)
time does not include the time a request waits at a site before its request messages
have been sent out.
System throughput: 1 / (SD + E)

Best performance at low load.


Worst performance at high load.

Low timestamp → High priority.

Non-token based approach


Lamport’s Algorithm (FIFO)
● Requests for CS are executed in the increasing order of timestamps(logical clocks).
● Every site Si keeps a queue, request_queuei has mutual exclusion requests ordered
by their timestamps.

Why need both l1 and l2?


Correctness
● Lamport’s algorithm achieves mutual exclusion.
Proof is by contradiction. Suppose two sites S i and S j are executing the CS concurrently.
For this to happen conditions L1 and L2 must hold at both the sites concurrently. This
implies that at some instant in time, say t, both Si and S j have their own requests at the top
of their request_queues and condition L1 holds at them. Without loss of generality, assume
that S i ’s request has smaller timestamp than the request of S j . From condition L1 and
FIFO property of the communication channels, it is clear that at instant t the request of S i
must be present in request_queue j when Sj was executing its CS. This implies that Sj ’s
own request is at the top of its own request_queue when a smaller timestamp request, S i ’s
request, is present in the request_queue j – a contradiction! Hence, Lamport’s algorithm
achieves mutual exclusion.

● Lamport’s algorithm is fair.


Proof A distributed mutual exclusion algorithm is fair if the requests for CS are executed in
the order of their timestamps. The proof is by contradiction. Suppose a site S i ’s request
has a smaller timestamp than the request of another site S j and S j is able to execute the
CS before S i . For S j to execute the CS, it has to satisfy the conditions L1 and L2. This
implies that at some instant in time Sj has its own request at the top of its queue and it has
also received a message with timestamp larger than the timestamp of its request from all
other sites. But request_queue at a site is ordered by timestamp, and according to our
assumption S i has lower timestamp. So S i ’s request must be placed ahead of the Sj ’s
request in the request_queue j . This is a contradiction. Hence Lamport’s algorithm is a fair
mutual exclusion algorithm.

Performance
● For each CS execution, Lamport’s algorithm requires (N − 1) REQUEST
messages, (N − 1) REPLY messages, and (N − 1) RELEASE messages. i.e
3(N − 1) messages per CS invocation.
● Synchronisation delay in the algorithm is T .
Optimisation:

Performance: 2(N − 1) messages per CS execution

Ricart–Agrawala algorithm (Non-FIFO)


Two types of msg REQUEST and REPLY

pi that is waiting for CS and gets REQUEST pj ,


if priority(pj)’s REQUEST < pi then pi defers REPLY to pj and sends a REPLY message to pj
after CS of pending request.

● Processes use Lamport-style logical clocks to assign a timestamp to critical section


requests and timestamps are used to decide the priority of requests.
● Each process pi maintains the Request-Deferred array, RDi , the size of which is the
same as the number of processes in the system.
● Initially, ∀i ∀j: RDi [j]=0. Whenever pi defer the request sent by pj , it sets
RDi [j]=1 and after it has sent a REPLY message to pj , it sets RDi [j]=0.

Explanation:
Sj gets request from Si send reply when
1. Not in CS
2. Not requesting CS
3. Sj is requesting and TS(Sj) > TS(Si)
Else defer reply.

Note: Reply from every site.

Notes:
● Site receives a message, it updates its clock using the timestamp in the message.
● Site takes up a request for the CS for processing, it updates its local clock and
assigns a timestamp to the request.

Ricart–Agrawala algorithm, for every requesting pair of sites, the site with higher priority
request will always defer the request of the lower priority site. At any time only the highest
priority request succeeds in getting all the needed REPLY messages.
Performance
● For each CS execution, Ricart-Agrawala algorithm requires (N − 1)
REQUEST messages and (N − 1) REPLY messages. i.e. 2(N − 1) messages
per CS execution.
● Synchronisation delay in the algorithm is T .(max. message transmission time)

Do I need a heap ? No
Will it work for non FIFO? Yes
Will it fail in case of RE-CS? Every cs request is treated as new request.
What if I am done with my CS and want to re-enter CS... Do I need to release CS and restart
with sending requests or can I be smarter ?

Roucairol-Carvalho Algorithm (Prof A algo)

Once i has received a REPLY from j, it does not need to send a REQUEST to j again to re
enter CS unless it has already sent a REPLY to j after the first CS(in response to a
REQUEST from j)

Message complexity 0 to 2(n – 1) depending on the request pattern , worst case message
complexity still the same.
No starvation: That is the second CS is concurrent to request of Pj so ordering between
them can have either before the other And no two processes are in cs at the same time and
there is no starvation.

Quorum-Based Mutual Exclusion Algorithms


Maekawa’s Algorithm
Unlike Ricard permission is taken from a subset of processes only.

OR

M1 and M2 are necessary for correctness;


M3 and M4 for desirable features to the algorithm.

Condition M3: states that the size of the requests sets of all sites must be equal
(all sites should have to do an equal amount of work to invoke mutual exclusion).
Condition M4: Exactly the same number of sites should request permission from any site,
(all sites have “equal responsibility” in granting permission to other sites)
Message Complexity: 3*√N ,
√N REQUEST, √N REPLY, and √N RELEASE messages
Synchronisation delay = 2*(max message transmission time)

[since a release request then a reply for the next queued request to be done.]
Major problem: DEADLOCK possible
Need three more types of messages (FAILED, INQUIRE, YIELD) to handle deadlock.
Message complexity can be 5*√N

Show deadlock
Makeqva 2
Handling deadlock:

A FAILED message from site S i to site S j indicates that S i cannot grant Sj ’s request
because it has currently granted permission to a site with a higher priority request.
_________________________________________________________________________
__

_________________________________________________________________________
__

OR
_________________________________________________________________________
__

_________________________________________________________________________
__

Solve Problem

Token-Based Algorithms
Suzuki–Kasami’s broadcast algorithm

1) if a site that wants to enter the CS does not have the token, it broadcasts a
REQUEST message for the token to all other sites.
2) A site that possesses the token sends it to the requesting site upon the receipt of its
REQUEST message.
a) If a site receives a REQUEST message when it is executing the CS, it sends
the token only after it has completed the execution of the CS.

Design issue in above simple algo.


1) How to distinguishing an outdated REQUEST message from a current
REQUEST message
If a site cannot determine if the request corresponding to a token request has been satisfied,
it may dispatch the token to a site that does not need it. Violation of correctness also
seriously degrades the performance by wasting messages and increasing the delay at sites
that are genuinely requesting the token.
2) How to determine which site has an outstanding request for the CS
After a site has finished the execution of the CS, it must determine what sites have an
outstanding request for the CS so that the token can be dispatched to one of them. The
problem is complicated because when a site S i receives a token request message from a
site S j , site S j may have an outstanding request for the CS. However, after the
corresponding request for the CS has been satisfied at Sj , an issue is how to inform site S i
(and all other sites) efficiently about it.

Solution of:
1) Outdated Request: We add sequence number to the request i.e. the REQUEST(j) is
changed to REQUEST(j,n) where n is sequence number, indicating that site Sj has
completed nth execution of CS, so any any REQUEST(j,p) is denied where p <=n ,
to check if new request value is old we maintain RNi[1,...,N] where RN i[j] denotes
the largest sequence number received in a REQUEST message so far from site Sj.
RN i[j] = max(RN i[j], n). Thus, when a site Si receives a REQUEST(j, n) message,
the request is outdated if RNi [j] > n.
2) Pending (Outstanding) Request: Rather than maintaining a queue at every site we
maintain a queue Q of sites requesting CS in the token itself. and an array of integers
LN [1, ... ,N ], where LN [j] is the sequence number of the request which site S j
executed most recently. After executing its CS, a site S i updates LN [i] (present in
token) : = RNi[i] to indicate that its request corresponding to sequence number RN
i[i] has been executed.

Token array LN [1, ... ,N ] permits a site to determine if a site has an outstanding
request for the CS. Note that at site S i if RNi[j] = LN [j]+1, then site S j is currently
requesting a token.

After executing the CS, a site checks this condition for all the j’s to determine all the
sites that are requesting the token and places their i.d.’s in queue Q if these i.d.’s are
not already present in Q. Finally, the site sends the token to the site whose i.d. is at
the head of Q.
Correctness
Mutual exclusion is guaranteed because there is only one token in the system
and a site holds the token during the CS execution.
A requesting site enters the CS in finite time.
Proof Token request messages of a site S i reach other sites in finite time. Since
one of these sites will have token in finite time, site Si ’s request will be placed in
the token queue in finite time. Since there can be at most N − 1 requests in front
of this request in the token queue, site S i will get the token and execute the CS
in finite time.

Performance

Does it work for FIFO channels and non FIFO channels ?


• Pi has token. Pj has sent a request for the token. Can Pj enter
CS before Pj’s request reaches Pk?
• Say Pj wants to get into CS again later. Can this request reach
Pk before the earlier one which has not reached ?
• How do I know that an outdated token has reached me?
• Why am I broadcasting my request to everyone ?
• Starvation possible ?
Pi has token. When Pi done he gets request from Pj and gives Pj
the token. Pj’s request to Pk has not reached Pk yet but Pj gets
token and executes. Pj wants to get into CS again later and this
request might reach you before the earlier one which has not
reached....this is possible!
Every node should know my request so that whoever has token can
update my request on token. Thats the only way I can get a
chance.
How to know if an outdated request has reached me? In that case
Rni[j] <= LN[j]. So you just store whatever you get and when you
get the token you allow that request only if Rni[j]=Lni[j]+1 since
that is the next request after last time token was given to j.

Show no starvation
- Process j who is interested broadcasts its request. If it reaches
Pk when token with Pk then token will update the request in its
queue. Else, eventually one of the processes that the token is
with will either already have got the token or will get the token.
So eventually Process j will make it to the tokens queue.

Raymond’s tree-based algorithm


Under light load the algorithm exchanges only O(log N) messages,
Under heavy load approximately four messages to execute the CS, where N is the number
of nodes in the network.
Works for non fifo also..

Assumption:
1) The underlying network guarantees message delivery. (delays allowed)
2) All nodes are reliable.

Network nodes may not be rooted

Here nodes needs a privilege to enter the CS. privilege is exchanged in form of privilege
message.
ME holdes due to token.
Deadlock is Impossible

Starvation is Impossible
Cost and Performance Analysis
Chapter 9 Exercise:
9.1) Consider the following simple method to enforce mutual exclusion: all
sites are arranged in a logical ring fashion and a unique token circulates around the
ring hopping from a site to another site. When a site needs to executes its CS, it waits
for the token, grabs the token, executes the CS, and then dispatches the token to the
next site on the ring. If a site does not need the token on its arrival, it immediately
dispatches the token to the next site (in zero time).
1) What is the response time when the load is low?

When the load is low, the response time will be small as the time taken for the token to reach
a site that needs to execute its critical section (CS) will be minimal and the waiting time for
the token to be available will also be low. The response time will be T + E, as the time
taken for the token to reach the site that needs to execute its critical section (CS) will be T,
and the time for executing the CS will be E.

2) What is the response time when the load is heavy?

When the load is heavy, the response time will increase as the token will need to be passed
through multiple sites before reaching the site that needs to execute its CS. This will result in
increased waiting time and also the time taken for the token to circulate around the ring will
be longer. The response time will be roughly (N-1) * T + E, as the token will need to pass
through all the N-1 sites before reaching the site that needs to execute its CS, resulting in a
total delay of (N-1) * T, and the time for executing the CS will still be E. However, this is an
approximate calculation and the actual response time can vary based on the specific
implementation and system characteristics.

9.2) In Lamport’s algorithm, condition L1 can hold concurrently at several sites. Why
do we need this condition for guaranteeing mutual exclusion?

In Lamport's algorithm, the condition L1 allows multiple sites to hold their local clocks
simultaneously, allowing multiple sites to enter into the critical section (CS) at the same time.
This is necessary for guaranteeing mutual exclusion because without this condition, there
would be a risk of deadlock if a site's entry into the CS was dependent on the release of
another site's hold on the critical section.

By allowing multiple sites to hold the L1 condition at the same time, the algorithm can ensure
that each site can enter the CS when it is ready, without being blocked by another site. This
helps to avoid deadlocks and ensures that mutual exclusion is maintained, as no two sites
can enter the CS simultaneously.

9.3) Show that in Lamport’s algorithm if a site Si is executing the critical section, then
Si ’s request need not be at the top of the request_queue at another site Sj.

In Lamport's algorithm, a site Si's entry into the critical section (CS) is not dependent on its
position in the request queue at another site Sj, regardless of whether there are
messages in transit or not. The algorithm uses Lamport timestamps to ensure mutual
exclusion, not the order of requests in a queue.
Each site in the algorithm maintains a local clock and a request queue, which stores the
timestamps of requests from other sites to enter the CS. When a site wants to enter the CS,
it increments its local clock and sends a request message with the updated timestamp to all
other sites.

When a site receives a request message from another site, it updates its own local clock to
be the maximum of its current value and the value in the received message, and adds the
request to its request queue.

To enter the CS, a site must first ensure that it holds the L1 and L2 conditions, which state
that its local clock value is the smallest among all sites and that it has not received any
requests with a smaller timestamp than its own.

Since a site's entry into the CS is determined by its local clock value and the values of
received requests, and not by its position in a queue, it follows that a site Si can enter the CS
even if its request is not at the top of the request queue at another site Sj, regardless of
whether there are messages in transit or not.

In conclusion, the Lamport's algorithm uses timestamps, not the order of requests in a
queue, to guarantee mutual exclusion and ensure that only one site can enter the CS at a
time, and this is true even when there are no messages in transit.

9.4) What is the purpose of a REPLY message in Lamport’s algorithm? Note that it is
not necessary that a site must always return a REPLY message in response to a
REQUEST message. State the condition under which a site does not have to return
REPLY message. Also, give the new message complexity per critical section
execution in this case.

A REPLY message in Lamport's algorithm is used to acknowledge receipt of a REQUEST


message from another site in a distributed system. A site does not have to return a REPLY
message in response to a REQUEST message if it does not want to enter the critical section
at the same time as the site that sent the REQUEST message.

The message complexity per critical section execution in this case would be reduced, as
there would be one less message exchanged. The message complexity would be 1
REQUEST message and 1 RELEASE message, instead of 1 REQUEST message, 1 REPLY
message, and 1 RELEASE message.
9.5) Show that in the Ricart–Agrawala algorithm the critical section is accessed in
increasing order of timestamp. Does the same hold in Maekawa’s algorithm?

In the Ricart-Agrawala algorithm, the critical section is accessed in increasing order of


timestamp because each site assigns a unique timestamp to each REQUEST message it
sends, based on the value of its local clock. The REQUEST messages are sent to all other
sites and are only granted if the recipient site has not already received a REQUEST
message with a lower timestamp value.

This ensures that the critical section is accessed in the order of increasing timestamps, as
the site with the lowest timestamp value will be granted access to the critical section first,
followed by the site with the next lowest timestamp value, and so on.

In Maekawa's algorithm, the critical section is not necessarily accessed in increasing order
of timestamp, as the decision to grant access to the critical section is made by a coordinator
site, rather than by each individual site as in the Ricart-Agrawala algorithm. The coordinator
site maintains a queue of REQUEST messages from the participating sites and grants
access to the critical section based on a predetermined protocol, which may not necessarily
be based on the timestamps of the REQUEST messages.

Therefore, in Maekawa's algorithm, the critical section may be accessed in a different order
than the order of increasing timestamps, and the order in which access is granted depends
on the specific protocol used by the coordinator site.

9.6) Mutual exclusion can be achieved using the following simple


method in a distributed system (called the “centralized” mutual
exclusion algorithm): to access the shared resource, a site sends the
request to the site that contains the resource. This site executes the
requests using any classical methods for mutual exclusion (like
semaphores). Discuss what prompted Lamport’s mutual exclusion
algorithm even though it requires many more messages (3(N − 1) as
compared to only 3).

The centralized mutual exclusion algorithm is a simple solution to achieving mutual exclusion
in a distributed system, but it has some limitations. In this algorithm, a site sends a request
to the site that contains the shared resource, which is responsible for executing the requests
and ensuring mutual exclusion.

However, this approach has several drawbacks. Firstly, the site that contains the shared
resource becomes a single point of failure, as all other sites depend on it to access the
shared resource. This means that if this site fails, the entire system fails. Secondly, this
approach does not scale well, as the site that contains the shared resource becomes a
bottleneck as the number of sites increases.
Lamport's mutual exclusion algorithm was proposed as a solution to these limitations. It does
not rely on a centralized site, but instead uses a distributed algorithm to achieve mutual
exclusion. The algorithm uses a combination of REQUEST, REPLY, and RELEASE
messages to coordinate access to the shared resource.

Although this algorithm requires more messages than the centralized algorithm, it has
several advantages. Firstly, it is decentralized, meaning that it does not have a single point
of failure. Secondly, it is scalable, as the number of messages required does not increase
with the number of sites. Finally, it provides a more robust solution to mutual exclusion in a
distributed system, as it takes into account the possibility of messages being lost or delayed.

In summary, Lamport's mutual exclusion algorithm was proposed as a solution to the


limitations of the centralized mutual exclusion algorithm, providing a more scalable and
robust solution to mutual exclusion in a distributed system.

9.7) Show that in Lamport’s algorithm the critical section is accessed in increasing
order of timestamp.

In Lamport's algorithm, the critical section is accessed in increasing order of timestamp. This
is because each site assigns a unique timestamp to each REQUEST message it sends,
based on the value of its local clock. The REQUEST messages are sent to all other sites,
and a site will only grant access to the critical section if it has not already received a
REQUEST message with a lower timestamp value.

This ensures that the critical section is accessed in the order of increasing timestamps, as
the site with the lowest timestamp value will be granted access to the critical section first,
followed by the site with the next lowest timestamp value, and so on.

In this way, Lamport's algorithm ensures that the critical section is accessed in a mutually
exclusive manner and in increasing order of timestamp, avoiding any conflicts or race
conditions in accessing the shared resource.

9.8) Show by examples that the staircase configuration among sites is pre-served in
Singhal’s dynamic mutual exclusion algorithm when two or more sites request the CS
concurrently and have executed the CSs.

The staircase configuration among sites is a key property of Singhal's dynamic mutual
exclusion algorithm. This configuration refers to the order in which sites access the critical
section (CS) and is maintained even when two or more sites request the CS concurrently
and have executed their respective CSs.
Here's an example to illustrate this property:

Suppose there are three sites, A, B, and C, and their current timestamps are 10, 20, and 30
respectively.

1. Site A requests the CS, sending a REQUEST message with timestamp 10.
2. Site B requests the CS, sending a REQUEST message with timestamp 20.
3. Site C requests the CS, sending a REQUEST message with timestamp 30.

In Singhal's algorithm, the site with the lowest timestamp value will be granted access to the
CS first. In this case, site A with timestamp 10 will be granted access to the CS first, followed
by site B with timestamp 20, and finally site C with timestamp 30.

This is an example of the staircase configuration among sites being preserved even when
multiple sites request the CS concurrently. In this example, the order of accessing the CS is
A, B, and C, preserving the staircase configuration.

Another example to illustrate this property:

Suppose there are four sites, D, E, F, and G, and their current timestamps are 5, 10, 15, and
20 respectively.

1. Sites D and E request the CS concurrently, sending REQUEST messages with


timestamps 5 and 10 respectively.
2. Site F requests the CS, sending a REQUEST message with timestamp 15.
3. Site G requests the CS, sending a REQUEST message with timestamp 20.

In this case, the sites will be granted access to the CS in the order of D, E, F, and G,
preserving the staircase configuration.

These examples demonstrate that in Singhal's dynamic mutual exclusion algorithm, the
staircase configuration among sites is preserved, even when multiple sites request the CS
concurrently and have executed their respective CSs.

Correctness (proof of correctness of chandy lamport algo for global snapshot)

To prove the correctness of the algorithm, we show that a recorded snapshot satisfies conditions C1
and C2. Since a process records its snapshot when it receives the first marker on any incoming
channel, no messages that follow markers on the channels incoming to it are recorded in the process’s
snapshot.

Process ko jab first marker mil jata hai kisi channel se tab wo kisi bhi incoming message ko
apne snapshot me nhi dalta hai. Aaur ye hona bhi chahiye tha kyu ki, jo bhi message hum
marker msg bhejne ke baad bheja gya ho unko snapshot me nhi include krna chahiye. Due to
FIFO

Thus, condition C2 is satisfied.


When a process p j receives message mij that precedes the marker on channel C ij , it acts as follows:
if process pj has not taken its snapshot yet, then it includes m ij in its recorded snapshot. Otherwise, it
records mij in the state of the channel C ij . Thus, condition C1 is satisfied.

Complexity

The recording part of a single instance of the algorithm requires O(e) messages and O(d) time, where
e is the number of edges in the network and d is the diameter of the network.

Deadlock detection in dist system


Progress (no undetected deadlocks): The algorithm must detect all existing deadlocks in a finite
time. Once a deadlock has occurred, the deadlock detection activity should continuously progress
until the deadlock is detected. In other words, after all wait-for dependencies for a deadlock have
formed, the algorithm should not wait for any more events to occur to detect the deadlock.

Safety (no false deadlocks): The algorithm should not report deadlocks that do not exist (called
phantom or false deadlocks).

a set of processes is deadlocked, the following conditions hold true:

1. Each of the process is the set S is blocked.

2. The dependent set for each process in S is a subset of S.

3. No grant message is in transit between any two processes in set S.


Deadlock detection:

Path pushing-algo: WFG made at each site. Ho and Ramamoorthy

Edge-chasing algorithms: Probe message used to detect deadlock. If you see a probe message again
then deadlock. Only the block process sent the probe further. Probe msg short and small size.
Mitchell

Diffusing computation based algo: uses echo to detect deadlock. Initiator detects deadlock. WFG
not built. Process cannot send a reply until it gets a reply for the query it sent. Chandy–Misra–Haas

algorithm for one OR model.

Global state detection-based algorithms:

(i) the algo should not cause the underlying computation algo to freeze.

(ii) consistent state != system state

Take a global snapshot and check for stable conditions.

Mitchell Merritt for the Single-Resource Model (no phantom deadlock, all deadlock detected)

Only one process in a cycle detects the deadlock.

Private is unique but might change.

Automatic detection of cycle – no one has to keep checking for it


Transmit propagates larger labels in the opposite direction of the edges by sending a probe message.

Whenever a process receives a probe which is less than its public label, then it simply ignores that
probe.

Detect means that the probe with the private label of some process has returned to it, indicating a
deadlock.
Message Complexity:

The worst-case complexity of the algorithm is s(s - 1)/2 Transmit steps, where s is the number of
processes in the cycle.

Note: for all processes u/v: v <= u

Lemma 10.1: For any process u/v, if u > v, then u was set by a Transmit step.

1. Can a node be part of multiple cycle? No

2. Suppose a node has added the last edge of a cycle and hence started a transmit. Can his
private and public label change by the time the transmit reaches him back?
Node added at last has highest public, private value.

3. Does the start of transit imply the presence of cycle? No.

4. Can another node also have your public values though not in a cycle?(doubt)

Yes,

P1 → P2 (due to transit) → P3
5. In case of a cycle, the process that detects the deadlock will have the highest id in the cycle.
T/F?

True, since the last node has the highest value, and it is the one who detects a cycle.
6. Can a process ignore a transmit probe if the value is lesser than its own.

Whenever a process receives a probe which is less than its public label, then it simply ignores that
probe.

H.W Can there be phantom deadlocks?

H.W Is the last process to block same as the one who detects the deadlock?

Proof of correctness:
Using priority:

Working:

If u>v : no transmission

If u<v: should transmit . Let the new public tuple of the blocked node be (a,b). Then,
a=v because transmission should happen.
b should be the lowest in the transmitted chain so that on transmission in a cycle; the lowest priority
finds his own public label and priority number and aborts.

If u==v; then, if p>q that means transmission should continue so that lower priority process aborts.
Because p>e>d>c>b>a
Chandy–Misra–Haas algorithm for the AND model
Performance analysis
In the algorithm, one probe message (per deadlock detection initiation) is sent on every edge of
the WFG which connects processes on two sites. Thus, the algorithm exchanges at most m(n −
1)/2 messages to detect a deadlock that involves m processes and spans over n sites. The size of
messages is fixed and is very small (only three integer words). The delay in detecting a deadlock
is O(n)
Chandy–Misra–Haas algorithm for the OR model
Performance analysis
For every deadlock detection, the algorithm exchanges e query messages and e reply messages,
where e = n(n − 1) is the number of edges.
( c ) here means condition of (f)

Class questions:
Before chandy hass
● What happens if there is no deadlock?
● How will Pi conclude that there is no deadlock?
● Something needs to be done to reset the dependency vector values of a future probe
i. What can be done?
● If a process is deadlocked because it is waiting on a cycle then it wont get probe
back though deadlocked- true/false?
● After the probe passes is it not possible that the edges are removed and hence truly
there is no cycle through the probe msg returns? That is, phantom deadlock
detected?
● Consider the reverse graph(all edges reversed). Now if we run the algorithm is it
possible that we detect a false deadlock as edges may change.
● Number of messages? m processes and n sites.

Exercise question:
Exercise 10.1 Consider the following simple approach to handle deadlocks in distributed
systems by using “time-outs”: a process that has waited for a specified period for a resource
declares that it is deadlocked and aborts to resolve the deadlock. What are the shortcomings of
using this method?
Exercise 10.2 Suppose all the processes in the system are assigned priorities which can be used
to totally order the processes. Modify Chandy et al.’s algorithm for the AND model so that when
a process detects a deadlock, it also knows the lowest priority deadlocked process.
Exercise 10.3 Show that, in the AND model, false deadlocks can occur due to deadlock
resolution in distributed systems [43]. Can something be done about it or they are bound to
happen?

Detecting phantom deadlock, means you think there is a deadlock but there is not any.
Do dry run of both chandy algo

Termination Detection
Condition:
1. Execution of a TD algorithm cannot indefinitely delay the underlying computation;
that is, execution of the termination detection algorithm must not freeze the underlying
computation.
2. The termination detection algorithm must not require addition of new communication
channels between processes by Weight Throwing

A distributed computation is said to be terminated at time instant t0 iff:

Correctness of Algorithm
Write the delay and all other info.

Exercise:
Exercise 7.1 Huang’s termination detection algorithm could be redesigned using a counter to
avoid the need of splitting weights. Present an algorithm for termination detection that uses
counters instead of weights.
Exercise 7.2 Design a termination detection algorithm that is based on the concept of weight
throwing and is tolerant to message losses. Assume that processe do not crash.
Exercise 7.3 Termination detection algorithms assume that an idle process can only be
activated on the reception of a message. Consider a system where an idle process can
become active spontaneously without receiving a message. Do you think a termination
detection algorithm can be designed for such a system? Give reasons for your answer.
Exercise 7.4 Design an efficient termination detection algorithm for a system where the
communication delay is zero.
Exercise 7.5 Design an efficient termination detection algorithm for a system where the
computation at a process is instantaneous (that is, all proceses are always in the idle state.)

Exercise 3.1 Why is it difficult to keep a synchronized system of physical clocks in


distributed systems?
Exercise 3.2 If events corresponding to vector timestamps Vt1 , Vt 2 , ) ) ) ., Vtn are
mutually concurrent, then prove that
%Vt 1 #1$! Vt2 #2$! ) ) ) "Vtn #n$& = max%Vt1 ! Vt2 ! ) ) ) "! Vtn & "
Exercise 3.3 If events e i and e j respectively occurred at processes pi and p j and are
assigned vector timestamps VT ei and VTej , respectively, then show that
e i → ej ⇔ VT ei #i$ < VTej #i$"
Exercise 3.4 The size of matrix clocks is quadratic with respect to the system size.
Hence the message overhead is likely to be substantial. Propose a technique for matrix
clocks similar to that of Singhal–Kshemkalyani to decrease the volume of information
transmitted in messages and stored at processes.
An Efficient Causal Order Algorithm for Message Delivery in Distributed System, Jangt,
Park, Cho, and Yoon explain this

"An Efficient Causal Order Algorithm for Message Delivery in Distributed System" is a
research paper that proposes a new algorithm for achieving causal message ordering in
distributed systems. The authors of the paper are Jangt, Park, Cho, and Yoon.

In a distributed system, messages are sent between nodes, and the order in which these
messages are received can impact the correctness of the system. Causal ordering is a type
of ordering that preserves causality between events. Specifically, if event A causes event B,
then any message that carries information about event A should be received before any
message that carries information about event B.

The algorithm proposed in the paper is based on vector clocks, which are used to track the
causal relationships between events in a distributed system. Each node maintains a vector
clock, which is a vector of integers that represents the node's current knowledge of the state
of the system. When a node sends a message, it attaches its current vector clock to the
message.

The receiving node uses the vector clock to determine whether the message should be
delivered immediately or held until other messages are received. If the vector clock indicates
that the message depends on other messages that have not yet been received, the
message is held until those messages arrive.

The authors of the paper demonstrate that their algorithm is more efficient than previous
causal ordering algorithms. Specifically, they show that their algorithm reduces the number
of messages that need to be held for delivery, which reduces message latency and improves
system performance.

Overall, the paper presents a new algorithm for achieving causal message ordering in
distributed systems that is more efficient than previous algorithms. The algorithm is based on
vector clocks and reduces message latency, which improves system performance.

_________________________________________________________________________
__
Shreyash
In regards to Chandy-Hass algorithm, answer the following questions with proper reasons:
1. What happens if there is no deadlock?
2. How will Pi conclude that there is no deadlock?
3. Something needs to be done to reset the dependency vector values of a future probe i.
What can be done?
4. If a process is deadlocked because it is waiting on a cycle then it won't get probe back
though deadlocked- true/false?
5. After the probe passes is it not possible that the edges are removed and hence truly there
is no cycle through the probe msg returns? That is, phantom deadlock detected?
1. If there is no deadlock, then the Chandy-Hass algorithm does not affect the
system's normal operation, and no action is taken.
2. To conclude that there is no deadlock, each process Pi maintains a local wait-
for graph, and periodically initiates a probe message that travels along a path
in the wait-for graph, through a subset of processes. If the probe returns to Pi,
then there is no deadlock in the subset of processes probed. This is because
if there were a cycle, the probe message would have been blocked at some
point, preventing it from returning.
3. To reset the dependency vector values of a future probe i, the process can
either initialize the vector values to zero, or set them to the current snapshot
of the process's wait-for graph.
4. False. If a process is deadlocked because it is waiting on a cycle, it will
eventually receive a probe message that travels along the cycle, and will be
able to detect the deadlock.
5. It is possible that the edges are removed after the probe message passes,
and hence there is no cycle when the probe message returns. However, the
Chandy-Hass algorithm is designed to handle this situation. Each process
maintains a local snapshot of the wait-for graph when it initiates a probe
message. If a process receives a probe message that passes through it, it can
compare the snapshot with its current wait-for graph. If there is a change in
the wait-for graph, the process sends an additional message to update the
dependency vector of the initiating process. This ensures that the initiating
process has an accurate snapshot of the wait-for graph, even if there are
changes after the probe message passes. Therefore, phantom deadlock can
be avoided.

You might also like