Mid Sem Prep Distributed System
Mid Sem Prep Distributed System
user.
Features:
Lots of computers, Perform Concurrently, Fail independently, Don't share a global clock.
Challenges
Unreliable communication, Lack of global knowledge (no shared memory), Lack of
synchronisation (different local clocks), Concurrency control (implement Mutual exclusion or
critical sections), Failure and recovery, Deadlocks, Termination Detection, File systems are
other concerns.
Puzzle
Consensus: All non-faulty processes must agree on a common value. If the initial value of
all non-faulty processes is v, then the agreed upon value must be v.
In a village with at least one person with blue eyes, there is not mirror in village and people
don't interact with each other, the task is that all blue eyed people are asked to leave the
town, in how many days all blue eyed people will realise that they have blue eyes and leave
the village? Each day the villagers come to a common place see each other and go back
and repeat this every day.
There are 2 teams who want to battle each other but the condition is that they want to start
the battle together, i.e. start battle only when one team is sure that the other team is ready
for battle.
● Team 1 sends a messenger to Team 2 but he cannot start because Team 1 is not
sure if Team 2 has received this message or not.
● Team 2 receives this messenger, then Team 2 sends a messenger to Team 1 (telling
that he has received the messenger) but he cannot start because Team 2 is not sure
if Team 1 has received this message or not.
● Team 1 receives this messenger, then Team 1 sends a messenger to Team 2 (telling
that he has received the messenger) but he cannot start because Team 1 is not sure
if Team 2 has received this message or not.
● Team 2 receives this messenger, then Team 2 sends a messenger to Team 1 (telling
that he has received the messenger) but he cannot start because Team 2 is not sure
if Team 1 has received this message or not.
● Team 1 receives this messenger, then Team 1 sends a messenger to Team 2 (telling
that he has received the messenger) but he cannot start because Team 1 is not sure
if Team 2 has received this message or not.
……..
So the problem is finding when they will come to an agreement that both are ready for battle
and can start the battle.
Consensus with traitors included - the general too may be malicious (Byzantine
Agreement Problem)
Note: Clocks can easily drift seconds per day, accumulating significant errors over time.
Round Trip Time refers to the time duration between the start of a Request and the end of
the corresponding Response.
___________________
Algorithm:
1) The process on the client machine sends the request for fetching clock time(time at the
server) to the Clock Server at time T0.
2) The Clock Server listens to the request made by the client process and returns the
response in the form of clock server time.
3) The client process fetches the response from the Clock Server at time T1 and calculates
the synchronised client clock time using the formula given below.
T0 refers to the time at which request was sent by the client process,
T1 refers to the time at which response was received by the client process
T1 -T0 refers to the combined time taken by the network and the server to transfer the
request to the server, process the request, and return the response back to the client
process, assuming that the network latency T0 and T1 are approximately equal.
The time at the client-side differs from actual time by at most (T1 -T0)/2 seconds. Using
the above statement we can draw a conclusion that the error in synchronisation can be at
most (T1 -T0)/2 seconds.
Using iterative testing over the network, we can define a minimum transfer time using which
we can formulate an improved synchronisation clock time(less synchronisation error).
Here, by defining a minimum transfer time, with a high confidence, we can say that the
server time will always be generated after T0 -Tmin and the TSERVER will always be
where Tmin is the minimum transfer time which is the minimum value of TREQUEST and
TRESPONSE during several iterative tests. Here synchronisation error can be formulated as
follows:
Error ∈ [-(T1 -T0)/2-Tmin, (T1 -T0)/2-Tmin]
substitute Tminby Tmin1 and Tmin2, where Tmin1 is the minimum observed request
time and Tmin2 refers to the minimum observed response time over the network.
The synchronized clock time in this case can be calculated as:
TCLIENT = TSERVER + (T1 -T0)/2 + (Tmin2 -Tmin1)/2
So, by just introducing response and request time as separate time latencies, we can
improve the synchronization of clock time and hence decrease the overall synchronization
error. A number of iterative tests to be run depends on the overall clock drift observed.
From Slides
Uses a time server to synchronise clocks. Time server keeps the reference time (say UTC)
A client asks the time server for time, the server responds with its current time, and the client
uses the received value T to set its clock
But network round-trip time introduces errors…
Let RTT = response-received-time – request-sent-time (measurable at client),
If we know
(a) min = minimum client-server one-way transmission time
and
(b) that the server timestamped the message at the last possible instant before
sending it back
1) each machine node in the network either doesn’t have an accurate time source or
Algorithm
1) An individual node is chosen as the master node from a pool node in the network. This
node is the main node in the network which acts as a master and the rest of the nodes act
as slaves. The master node is chosen using an election process/leader election algorithm.
2) Master node periodically pings slave nodes and fetches clock time at them using
Cristian’s algorithm.
The diagram below illustrates how the master sends requests to slave nodes.
The diagram below illustrates how slave nodes send back time given by their system clock.
3) Master node calculates the average time difference between all the clock times received
and the clock time given by the master’s system clock itself. This average time difference is
added to the current time at the master’s system clock and broadcasted over the network.
Network Time Protocol (NTP) is a protocol that helps the computers clock times to be
synchronised in a network. This protocol is an application protocol that is responsible for the
synchronisation of hosts on a TCP/IP network. NTP was developed by David Mills in 1981 at
the University of Delaware. This is required in a communication mechanism so that a
seamless connection is present between the computers.
Features of NTP :
● NTP servers have access to highly precise atomic clocks and GPU clocks
● It uses Coordinated Universal Time (UTC) to synchronise CPU clock time.
● Avoids even having a fraction of vulnerabilities in information exchange
communication.
● Provides consistent timekeeping for file servers
Working of NTP :
NTP is a protocol that works over the application layer, it uses a hierarchical system of time
resources and provides synchronisation within the stratum servers. First, at the topmost
level, there are highly accurate time resources’ ex. atomic or GPS clocks. These clock
resources are called stratum 0 servers, and they are linked to the below NTP server called
Stratum 1,2 or 3 and so on. These servers then provide the accurate date and time so that
communicating hosts are synced to each other.
Applications of NTP :
Advantages of NTP :
Disadvantages of NTP :
● When the servers are down the sync time is affected across a running
communication.
● Servers are prone to error due to various time zones and conflict may occur.
● Minimal reduction of time accuracy.
● When NTP packets are increased synchronization is conflicted.
● Manipulation can be done in synchronization.
Logical Clocks refer to implementing a protocol on all machines within your distributed
system, so that the machines are able to maintain consistent ordering of events within some
virtual timespan. A logical clock is a mechanism for capturing chronological and causal
relationships in a distributed system. Distributed systems may have no physically
synchronous global clock, so a logical clock allows global ordering on events from different
processes in such systems.
Example :
If we go outside then we have made a full plan that at which place we have to go first,
second and so on. We don’t go to second place at first and then the first place. We always
maintain the procedure or an organisation that is planned before. In a similar way, we should
do the operations on our PCs one by one in an organised way.
Suppose, we have more than 10 PCs in a distributed system and every PC is doing it’s own
work but then how we make them work together. There comes a solution to this i.e.
LOGICAL CLOCK.
Method-1:
This means that if one PC has a time 2:00 pm then every PC should have the same time
which is quite not possible. Not every clock can sync at one time. Then we can’t follow this
method.
Method-2:
What is causality ?
● Taking single PC only if 2 events A and B are occurring one by one then TS(A) <
TS(B). If A has timestamp of 1, then B should have timestamp more than 1, then only
happen before relationship occurs.
● Taking 2 PCs and event A in P1 (PC.1) and event B in P2 (PC.2) then also the
condition will be TS(A) < TS(B). Taking example- suppose you are sending message
to someone at 2:00:00 pm, and the other person is receiving it at 2:00:02 pm.Then
it’s obvious that TS(sender) < TS(receiver).
● Transitive Relation –
If, TS(A) <TS(B) and TS(B) <TS(C), then TS(A) < TS(C)
● Causally Ordered Relation –
a->b, this means that a is occurring before b and if there is any changes in a it will
surely reflect on b.
● Concurrent Event –
This means that not every process occurs one by one, some processes are made to
happen simultaneously i.e., A || B.
Two events are logically concurrent if and only if the events do not causally affect
each other. In other words, ei || ej ↔ Not(ei → ej) and Not(ej → ei).
Note that for logical concurrency of two events, the events may not occur at the same time.
Two events are physically concurrent iff the events occur at the same physical time.
To check if event x and y are concurrent or not, verify that (x→y) does not holds
neither (y→x) does.
How to check if x→y holds or not? If there is a directed path from x to y then x→y
holds else it does not.
Logical Clock
Home Work
Elements of T form a partially ordered or totally ordered set over a relation < ?
Solution)
Partial order defn: A binary relation is a partial order if and only if the relation is
reflexive(R), antisymmetric(A) and transitive(T).
Transitive(T) holds.
Antisymmetric(A): if (a,b) ∈ R and (b,a) ∈ R, then a=b, holds?
Reflexive(R): (a, a) ∈ R ∀ a ∈ X or as I ⊆ R where I is the identity relation on A.
does it holds?
Consistent: When T and C satisfy the condition: for two events ei and ej ,
(ei → ej) ⇾ (C(ei) < C(ej)).
This property is called the clock consistency condition.
Strongly Consistent: When T and C satisfy the condition: for two events ei and ej ,
(ei → ej) ⇿ (C(ei) < C(ej)).
then the system of clocks is said to be strongly consistent.
i.e. it should not happen that (ei || ej) and (C(ei) < C(ej)).
Note: that the timestamps alone do not induce a total order. Two events at different
processors can have an identical timestamp.
Tie breaking mechanism.Requests are timestamped and served according to the total order
based on these timestamp.
t denotes timestamp
i the identity of the process.
But the tuple (t, i) for each event with (t1,i1) < (t2, i2)
if either
t1 < t2
or
( (t1 == t2) and i1<i2) is a total order. In (t,i),
Note: In the vector Method this case won’t even appear that C(e1) < C(e2) and e1||e2,
because if e1||e2 then C(e1) and C(e2) vectors won’t even be comparable.
Event Counting: Set the increment d to 1 always. If some event e has a timestamp t, then e
is dependent on t – 1 other events to occur. This can be called the height of event e.
No Strong Consistency: Note that scalar time does not provide strong
consistency. [Strong consistency requires that ei → ej ⇿ C(ei) < C(ej).]
Example suffices. Refer to the timeline again , look at e33 and e42
Limitation:
Strong consistency not achieved as scalar time uses a single time that represents the logical
local clock and the logical global clock. This means that the causality of events across
processors is lost. That is, by looking at the times we can not make out which two events
are causally related and which two are not.
Vector Time
● vti[j] represents process pi’s latest knowledge of process pj local time.
● If vti[j]=x, then process pi knows that local time at process pj has progressed till x.
● The entire vector vti constitutes pi ’s view of the global logical time and is used to
timestamp events.
For process i, for d=1, are the events internal to the process ordered as a sequence
1,2,3,4,....
That is for process i: will Vi[i] always be increasing as 1,2,3,4,5,…and so on or can there be
values skipped?
Answer: Yes. prove it in cases
1) No Communication bwn processes.
2) Send Only
3) Send and Recieves in most complex order.
Using vector clocks, two timestamps vector vh and vk are compared as follows.
● vh == vk iff for all indices i, vh[i] == vk[i]
● vh <= vk iff for all indices i, vh[i] <= vk[i]
● vh < vk iff vh vk and there exists an index i where vh[i] < vk[i].
● vh || vk iff not(vh < vk) and not(vk < vh)
How many events causally precede e in the distributed computation ?
(⅀j vh[j] ) − 1 , -1 because we want number of events preceding it.
Note: The Formula won’t work for concurrent processes.
1. Suppose not then vi not < vi: However ei->ej either in same process or a send msg or
because of ez such that ei->ez->ej
a) If same process then both are vectors of the same process and Vki[k]<Vkj[k]
b) Let two vectors be Vim and Vjn. if send msg then Vim[i]=Vjn[i] and Vim[j]< Vjn[j] and for all
other k, Vim[k]>=Vjn[k]
c) By a and b we have vi<vj
Large message sizes owing to the vector being piggybacked on each message.
The message overhead grows linearly with the number of processors in the system and
when there are thousands of processors in the system, the message size becomes huge
even if there are only a few events occurring in a few processors.
Examples:
Teaser 1
Causal Message Ordering (Not Causality)
If
Send(M1) –> Send(M2) [M1, M2 are messages] (Source of M1 and M2 may
or may not be same process)
then
for all processes which receive the messages M1 and M2 should receive M1
before M2 [irrespective of number of intermediates through which the message
reaches the destination].
OR
If two messages causally ordered between them are sent (even if by two different
processes) to a process P; then process P should receive the msg also by the same causal
order.
In short: The message which is sent first should be received first (irrespective of who is at
destination).
Consider:
● A sends msg (M1) to B.
● A sends msg (M2) to C.
● B sends msg (M1) to C.
At C: C receives two messages. One from A (M1) and other from B (M2).
However, the msg sent by B to C is causally ordered after the msg sent by A to C.
Hence, msg sent by B to C should be received after msg sent by A to C.
Slides solution
No. Suppose B received a msg from C before it sent a msg to me then
B updates as
VB[D]= max(VB[D], VC[D]).
Hence we do not know if VB[D] is an update from C or an update B already had from D.
Teaser 2
Same as Teaser 1, but there is Causal Message Ordering.
Yes. Because if Broadcasts happen at time t, so it will never happen that 4 reaches K before
5 reaches K.
Slides solution
Let us assume that B got D’s news through C.
Now, that means there were two msgs with recipient as B. One Is the msg from D and
another is a msg from C due to which VB[D]=VC[D].
Now by causual order if two msgs are intended for the same destination; then the one sent
first should reach first.
Can the msg from C to B with D’s update be sent before D’s msg to B?
No as D had broadcast all its msgs. So C can be sending the update on D to B only after D
sent a msg to B.
Hence D’s msg to B will reach first and the update could not have come through anyone else
Teaser 3
I want to know if D’s msg has reached everyone.
Instead of receiving a msg from everyone if I receive a msg from B such that VB[x] >= t for
all x (every element of vector VB is more than t) then can I conclude that all have received
the broadcast from D?
t is time of broadcast.
Matrix Time
Continue from Here
In the system of matrix clocks, the time is represented by a set of n x n matrices of non-
negative integers.
A process pi maintains a matrix mti[1..n, 1..n] where, mti [i, i] denotes the local logical
clock of pi and tracks the progress of the computation at process pi.
mti[i, j] denotes the latest knowledge that process pi has about the local logical clock, mtj[j, j],
of process pj. Mt[i,*] is vector vti
mti[j, k] represents the knowledge that process pi has about the latest knowledge that pj has
about the local logical clock, mtk[k, k], of pk .
The entire matrix mti denotes pi’s local view of the global logical time.
Teaser 4
Global snapshot
(Doubt in above)
Messages in Transit
For a channel Cij , the following set of messages can be defined as in transit based on the
local states of the processes pi and pj.
Transit: transit(LSi, LSj) = {mij |send(mij) ∊ LSi Ʌ rec(mij) ∉ LSj }
Issue 1: How to distinguish between the messages to be recorded in the snapshot from
those not to be recorded.
● Any message that is sent by a process before recording its snapshot, must be
recorded in the global snapshot (from C1).
● Any message that is sent by a process after recording its snapshot, must not be
recorded in the global snapshot (from C2).
Issue 2: How to determine the instant when a process takes its snapshot.
● A process pj must record its snapshot before processing a message mij that was
sent by process pi after recording its (pi) snapshot.
Example
Correctness
Msg sent before probe msg can be recorded in its local snapshot or in the channel(as we
saw above).
● Any message that is sent by a process before recording its snapshot, must be
recorded in the global snapshot (from C1).
● Any message that is sent by a process after recording its snapshot, must not be
recorded in the global snapshot (from C2).
Note:
1. The recorded global state may not correspond to any of the global states that
occurred during the computation
2. The recorded global state is a valid state in an equivalent execution.
1) Spezialetti–Kearns algorithm:
a) Snapshots concurrently initiated by multiple processes into a single snapshot.
b) A process needs to take only one snapshot, irrespective of the number of
concurrent initiators and all processes are not sent the global snapshot.
Idea: A marker carries the identifier of the initiator of the algorithm. Each process has a
variable master to keep track of the initiator of the algorithm. When a process executes the
“marker sending rule” on the receipt of its first marker, it records the initiator’s identifier
carried in the received marker in the master variable.
a process does not take a snapshot or propagate a snapshot request initiated by a process if
it has already taken a snapshot in response to some other snapshot initiation.
Snapshot recording at a process is complete after it has received a marker along each of its
channels. After every process has recorded its snapshot, the system is partitioned into as
many regions as the number of concurrent initiations of the algorithm.
C2 holds because a red message is not included in the snapshot of the recipient process
and a channel state is the difference of two sets of white messages.
C1 holds because a white message m ij is included in the snapshot of process pj if p j
receives m ij before taking its snapshot. Otherwise, mij is included in the state of channel C ij
.
The Lai-Yang Algorithm works in case of non FIFO messages?
It does. Channel transmit messages are white msgs sent before the snapshot and have not
reached by the time other process has taken its snapshot. They are hence truly in transit and
can reach in any order.
The white messages has the history of msg sen/recieved before being sent, so it does not
matter if they are reordered in the channel, before after any other msg in channel. As the info
it holds cannot change.
Storage?
Storage is heavy as each process has to remember all msgs he has sent received until
snapshot.
Li et al.’s algorithm
Markers are tagged so as to generalize the red/white colors of the Lai–Yang algorithm to
accommodate repeated invocations of the algorithm and multiple initiators.
Correctness:
For any two processes pi and pj, the following property is satisfied:
send(mij) ∉ LSi ⇒ rec(mij) ∉ LSj
If you haven't sent anything at Lsi then it won’t happen that you will see something at LSj
because of causal ordering.
Acharya-Badrinath
Because of causal ordering we don’t have to keep track of which process is sent and which
is received , just the count of it is sufficient.
1. Each process pi maintains arrays SENTi[1, ...N] and RECDi[1, ..., N].
2. SENTi[j] is the number of messages sent by process pi to process pj.
3. RECDi [j] is the number of messages received by process pi from process pj.
4. Sent and received do not contribute to space complexity because they are used for
underlying causal ordering protocol.
5. When a process pi records its local snapshot LSi on the receipt of token, it includes
arrays RECDi and SENTi in its local state before sending the snapshot to the
initiator.
6. When the algorithm terminates, the initiator determines the state of channels as
follows:
a. The state of each channel from the initiator to each process is empty.
b. The state of the channel from process pi to process pj is the set of messages
whose sequence numbers are given by {RECDj[i] + 1, . . ., SENTi[j]}.
c. That is, send 10 msgs but receive 5 msgs so 6,7,8,9,10 are not received. Due
to causal order they will appear at the receiving end in that order itself.
Correctness Proof:
Let a message mij be such that rec(token i) → send(m ij). Clearly, send(tokenj)→
send(m ij) and the sequence number of m ij is greater than SENTi [j]. Therefore,
m ij is not recorded in SC ij . Thus, send(m ij ) ∉ LS i ⇒ m ij ∉ SC ij . This in
conjunction with property P1 implies that the algorithm satisfies condition C2.
Consider a message m ij which is the k th message from process p i to process pj
before pi takes its snapshot. The two possibilities below imply that condition C1
is satisfied:
● Process p j receives m ij before taking its snapshot. In this case, m ij is recorded in pj
’s snapshot.
● Otherwise, RECD j [i] ≤ k ≤ SENT i [j] and the message m ij will be
included in the state of channel Cij .
Complexity:
This algorithm requires 2n messages and 2 time units for recording and assembling the
snapshot, where one time unit is required for the delivery of a message. If the contents of
messages in channels state are required, the algorithm requires 2n messages and 2 time
units additionally.
Alagar–Venkatesan algorithm
A message is referred to as old if the send of the message causally precedes the send of the
token. Otherwise, the message is referred to as new. Whether a message is new or old can
be determined by examining the vector timestamp in the message, which is needed to
enforce causal ordering among messages.
1. When a process receives the token, it takes its snapshot, initializes the state of all
channels to empty, and returns Done message to the initiator. Now onwards, a
process includes a message received on a channel in the channel state only if it is an
old message.
2. After the initiator has received Done message from all processes, it broadcasts a
Terminate message.
3. A process stops the snapshot algorithm after receiving a Terminate message.
Correctness proof:
An interesting observation is that a process receives all the old messages in its
incoming channels before it receives the Terminate message. This is ensured by
the underlying causal message delivery property. The causal ordering property
ensures that no new message is delivered to a process prior to the token and
only old messages are recorded in the channel states. Thus, send(m ij ) ∉ LS i ⇒
m ij ∉ SC ij . This together with property P1 implies that condition C2 is satisfied.
Condition C1 is satisfied because each old message mij is delivered either before
the token is delivered or before the Terminate is delivered to a process and thus
gets recorded in LS i or SC ij , respectively.
Exercise 4.2 What good is a distributed snapshot when the system was never in
the state represented by the distributed snapshot? Give an application of distributed
snapshots.
Exercise 4.3 Consider a distributed system where every node has its physical clock
and all physical clocks are perfectly synchronized. Give an algorithm to record global
state assuming the communication network is reliable. (Note that your algorithm
should be simpler than the Chandy–Lamport algorithm.)
Important consideration:
A process wishing to enter the CS requests all other or a subset of processes by sending
REQUEST messages, and waits for appropriate replies before entering the CS.
While waiting the process is not allowed to make further requests to enter the CS.
In the “requesting the CS” state, the site is blocked and cannot make further requests for the
CS.
Performance metrics
Message complexity: # messages per CS execution
Synchronisation delay(SD): T(site A enters CS) - T(site B leaves CS)
Response time: T( CS execution complete) - T( CS request made)
time does not include the time a request waits at a site before its request messages
have been sent out.
System throughput: 1 / (SD + E)
Performance
● For each CS execution, Lamport’s algorithm requires (N − 1) REQUEST
messages, (N − 1) REPLY messages, and (N − 1) RELEASE messages. i.e
3(N − 1) messages per CS invocation.
● Synchronisation delay in the algorithm is T .
Optimisation:
Explanation:
Sj gets request from Si send reply when
1. Not in CS
2. Not requesting CS
3. Sj is requesting and TS(Sj) > TS(Si)
Else defer reply.
Notes:
● Site receives a message, it updates its clock using the timestamp in the message.
● Site takes up a request for the CS for processing, it updates its local clock and
assigns a timestamp to the request.
Ricart–Agrawala algorithm, for every requesting pair of sites, the site with higher priority
request will always defer the request of the lower priority site. At any time only the highest
priority request succeeds in getting all the needed REPLY messages.
Performance
● For each CS execution, Ricart-Agrawala algorithm requires (N − 1)
REQUEST messages and (N − 1) REPLY messages. i.e. 2(N − 1) messages
per CS execution.
● Synchronisation delay in the algorithm is T .(max. message transmission time)
Do I need a heap ? No
Will it work for non FIFO? Yes
Will it fail in case of RE-CS? Every cs request is treated as new request.
What if I am done with my CS and want to re-enter CS... Do I need to release CS and restart
with sending requests or can I be smarter ?
Once i has received a REPLY from j, it does not need to send a REQUEST to j again to re
enter CS unless it has already sent a REPLY to j after the first CS(in response to a
REQUEST from j)
Message complexity 0 to 2(n – 1) depending on the request pattern , worst case message
complexity still the same.
No starvation: That is the second CS is concurrent to request of Pj so ordering between
them can have either before the other And no two processes are in cs at the same time and
there is no starvation.
OR
Condition M3: states that the size of the requests sets of all sites must be equal
(all sites should have to do an equal amount of work to invoke mutual exclusion).
Condition M4: Exactly the same number of sites should request permission from any site,
(all sites have “equal responsibility” in granting permission to other sites)
Message Complexity: 3*√N ,
√N REQUEST, √N REPLY, and √N RELEASE messages
Synchronisation delay = 2*(max message transmission time)
[since a release request then a reply for the next queued request to be done.]
Major problem: DEADLOCK possible
Need three more types of messages (FAILED, INQUIRE, YIELD) to handle deadlock.
Message complexity can be 5*√N
Show deadlock
Makeqva 2
Handling deadlock:
A FAILED message from site S i to site S j indicates that S i cannot grant Sj ’s request
because it has currently granted permission to a site with a higher priority request.
_________________________________________________________________________
__
_________________________________________________________________________
__
OR
_________________________________________________________________________
__
_________________________________________________________________________
__
Solve Problem
Token-Based Algorithms
Suzuki–Kasami’s broadcast algorithm
1) if a site that wants to enter the CS does not have the token, it broadcasts a
REQUEST message for the token to all other sites.
2) A site that possesses the token sends it to the requesting site upon the receipt of its
REQUEST message.
a) If a site receives a REQUEST message when it is executing the CS, it sends
the token only after it has completed the execution of the CS.
Solution of:
1) Outdated Request: We add sequence number to the request i.e. the REQUEST(j) is
changed to REQUEST(j,n) where n is sequence number, indicating that site Sj has
completed nth execution of CS, so any any REQUEST(j,p) is denied where p <=n ,
to check if new request value is old we maintain RNi[1,...,N] where RN i[j] denotes
the largest sequence number received in a REQUEST message so far from site Sj.
RN i[j] = max(RN i[j], n). Thus, when a site Si receives a REQUEST(j, n) message,
the request is outdated if RNi [j] > n.
2) Pending (Outstanding) Request: Rather than maintaining a queue at every site we
maintain a queue Q of sites requesting CS in the token itself. and an array of integers
LN [1, ... ,N ], where LN [j] is the sequence number of the request which site S j
executed most recently. After executing its CS, a site S i updates LN [i] (present in
token) : = RNi[i] to indicate that its request corresponding to sequence number RN
i[i] has been executed.
Token array LN [1, ... ,N ] permits a site to determine if a site has an outstanding
request for the CS. Note that at site S i if RNi[j] = LN [j]+1, then site S j is currently
requesting a token.
After executing the CS, a site checks this condition for all the j’s to determine all the
sites that are requesting the token and places their i.d.’s in queue Q if these i.d.’s are
not already present in Q. Finally, the site sends the token to the site whose i.d. is at
the head of Q.
Correctness
Mutual exclusion is guaranteed because there is only one token in the system
and a site holds the token during the CS execution.
A requesting site enters the CS in finite time.
Proof Token request messages of a site S i reach other sites in finite time. Since
one of these sites will have token in finite time, site Si ’s request will be placed in
the token queue in finite time. Since there can be at most N − 1 requests in front
of this request in the token queue, site S i will get the token and execute the CS
in finite time.
Performance
Show no starvation
- Process j who is interested broadcasts its request. If it reaches
Pk when token with Pk then token will update the request in its
queue. Else, eventually one of the processes that the token is
with will either already have got the token or will get the token.
So eventually Process j will make it to the tokens queue.
Assumption:
1) The underlying network guarantees message delivery. (delays allowed)
2) All nodes are reliable.
Here nodes needs a privilege to enter the CS. privilege is exchanged in form of privilege
message.
ME holdes due to token.
Deadlock is Impossible
Starvation is Impossible
Cost and Performance Analysis
Chapter 9 Exercise:
9.1) Consider the following simple method to enforce mutual exclusion: all
sites are arranged in a logical ring fashion and a unique token circulates around the
ring hopping from a site to another site. When a site needs to executes its CS, it waits
for the token, grabs the token, executes the CS, and then dispatches the token to the
next site on the ring. If a site does not need the token on its arrival, it immediately
dispatches the token to the next site (in zero time).
1) What is the response time when the load is low?
When the load is low, the response time will be small as the time taken for the token to reach
a site that needs to execute its critical section (CS) will be minimal and the waiting time for
the token to be available will also be low. The response time will be T + E, as the time
taken for the token to reach the site that needs to execute its critical section (CS) will be T,
and the time for executing the CS will be E.
When the load is heavy, the response time will increase as the token will need to be passed
through multiple sites before reaching the site that needs to execute its CS. This will result in
increased waiting time and also the time taken for the token to circulate around the ring will
be longer. The response time will be roughly (N-1) * T + E, as the token will need to pass
through all the N-1 sites before reaching the site that needs to execute its CS, resulting in a
total delay of (N-1) * T, and the time for executing the CS will still be E. However, this is an
approximate calculation and the actual response time can vary based on the specific
implementation and system characteristics.
9.2) In Lamport’s algorithm, condition L1 can hold concurrently at several sites. Why
do we need this condition for guaranteeing mutual exclusion?
In Lamport's algorithm, the condition L1 allows multiple sites to hold their local clocks
simultaneously, allowing multiple sites to enter into the critical section (CS) at the same time.
This is necessary for guaranteeing mutual exclusion because without this condition, there
would be a risk of deadlock if a site's entry into the CS was dependent on the release of
another site's hold on the critical section.
By allowing multiple sites to hold the L1 condition at the same time, the algorithm can ensure
that each site can enter the CS when it is ready, without being blocked by another site. This
helps to avoid deadlocks and ensures that mutual exclusion is maintained, as no two sites
can enter the CS simultaneously.
9.3) Show that in Lamport’s algorithm if a site Si is executing the critical section, then
Si ’s request need not be at the top of the request_queue at another site Sj.
In Lamport's algorithm, a site Si's entry into the critical section (CS) is not dependent on its
position in the request queue at another site Sj, regardless of whether there are
messages in transit or not. The algorithm uses Lamport timestamps to ensure mutual
exclusion, not the order of requests in a queue.
Each site in the algorithm maintains a local clock and a request queue, which stores the
timestamps of requests from other sites to enter the CS. When a site wants to enter the CS,
it increments its local clock and sends a request message with the updated timestamp to all
other sites.
When a site receives a request message from another site, it updates its own local clock to
be the maximum of its current value and the value in the received message, and adds the
request to its request queue.
To enter the CS, a site must first ensure that it holds the L1 and L2 conditions, which state
that its local clock value is the smallest among all sites and that it has not received any
requests with a smaller timestamp than its own.
Since a site's entry into the CS is determined by its local clock value and the values of
received requests, and not by its position in a queue, it follows that a site Si can enter the CS
even if its request is not at the top of the request queue at another site Sj, regardless of
whether there are messages in transit or not.
In conclusion, the Lamport's algorithm uses timestamps, not the order of requests in a
queue, to guarantee mutual exclusion and ensure that only one site can enter the CS at a
time, and this is true even when there are no messages in transit.
9.4) What is the purpose of a REPLY message in Lamport’s algorithm? Note that it is
not necessary that a site must always return a REPLY message in response to a
REQUEST message. State the condition under which a site does not have to return
REPLY message. Also, give the new message complexity per critical section
execution in this case.
The message complexity per critical section execution in this case would be reduced, as
there would be one less message exchanged. The message complexity would be 1
REQUEST message and 1 RELEASE message, instead of 1 REQUEST message, 1 REPLY
message, and 1 RELEASE message.
9.5) Show that in the Ricart–Agrawala algorithm the critical section is accessed in
increasing order of timestamp. Does the same hold in Maekawa’s algorithm?
This ensures that the critical section is accessed in the order of increasing timestamps, as
the site with the lowest timestamp value will be granted access to the critical section first,
followed by the site with the next lowest timestamp value, and so on.
In Maekawa's algorithm, the critical section is not necessarily accessed in increasing order
of timestamp, as the decision to grant access to the critical section is made by a coordinator
site, rather than by each individual site as in the Ricart-Agrawala algorithm. The coordinator
site maintains a queue of REQUEST messages from the participating sites and grants
access to the critical section based on a predetermined protocol, which may not necessarily
be based on the timestamps of the REQUEST messages.
Therefore, in Maekawa's algorithm, the critical section may be accessed in a different order
than the order of increasing timestamps, and the order in which access is granted depends
on the specific protocol used by the coordinator site.
The centralized mutual exclusion algorithm is a simple solution to achieving mutual exclusion
in a distributed system, but it has some limitations. In this algorithm, a site sends a request
to the site that contains the shared resource, which is responsible for executing the requests
and ensuring mutual exclusion.
However, this approach has several drawbacks. Firstly, the site that contains the shared
resource becomes a single point of failure, as all other sites depend on it to access the
shared resource. This means that if this site fails, the entire system fails. Secondly, this
approach does not scale well, as the site that contains the shared resource becomes a
bottleneck as the number of sites increases.
Lamport's mutual exclusion algorithm was proposed as a solution to these limitations. It does
not rely on a centralized site, but instead uses a distributed algorithm to achieve mutual
exclusion. The algorithm uses a combination of REQUEST, REPLY, and RELEASE
messages to coordinate access to the shared resource.
Although this algorithm requires more messages than the centralized algorithm, it has
several advantages. Firstly, it is decentralized, meaning that it does not have a single point
of failure. Secondly, it is scalable, as the number of messages required does not increase
with the number of sites. Finally, it provides a more robust solution to mutual exclusion in a
distributed system, as it takes into account the possibility of messages being lost or delayed.
9.7) Show that in Lamport’s algorithm the critical section is accessed in increasing
order of timestamp.
In Lamport's algorithm, the critical section is accessed in increasing order of timestamp. This
is because each site assigns a unique timestamp to each REQUEST message it sends,
based on the value of its local clock. The REQUEST messages are sent to all other sites,
and a site will only grant access to the critical section if it has not already received a
REQUEST message with a lower timestamp value.
This ensures that the critical section is accessed in the order of increasing timestamps, as
the site with the lowest timestamp value will be granted access to the critical section first,
followed by the site with the next lowest timestamp value, and so on.
In this way, Lamport's algorithm ensures that the critical section is accessed in a mutually
exclusive manner and in increasing order of timestamp, avoiding any conflicts or race
conditions in accessing the shared resource.
9.8) Show by examples that the staircase configuration among sites is pre-served in
Singhal’s dynamic mutual exclusion algorithm when two or more sites request the CS
concurrently and have executed the CSs.
The staircase configuration among sites is a key property of Singhal's dynamic mutual
exclusion algorithm. This configuration refers to the order in which sites access the critical
section (CS) and is maintained even when two or more sites request the CS concurrently
and have executed their respective CSs.
Here's an example to illustrate this property:
Suppose there are three sites, A, B, and C, and their current timestamps are 10, 20, and 30
respectively.
1. Site A requests the CS, sending a REQUEST message with timestamp 10.
2. Site B requests the CS, sending a REQUEST message with timestamp 20.
3. Site C requests the CS, sending a REQUEST message with timestamp 30.
In Singhal's algorithm, the site with the lowest timestamp value will be granted access to the
CS first. In this case, site A with timestamp 10 will be granted access to the CS first, followed
by site B with timestamp 20, and finally site C with timestamp 30.
This is an example of the staircase configuration among sites being preserved even when
multiple sites request the CS concurrently. In this example, the order of accessing the CS is
A, B, and C, preserving the staircase configuration.
Suppose there are four sites, D, E, F, and G, and their current timestamps are 5, 10, 15, and
20 respectively.
In this case, the sites will be granted access to the CS in the order of D, E, F, and G,
preserving the staircase configuration.
These examples demonstrate that in Singhal's dynamic mutual exclusion algorithm, the
staircase configuration among sites is preserved, even when multiple sites request the CS
concurrently and have executed their respective CSs.
To prove the correctness of the algorithm, we show that a recorded snapshot satisfies conditions C1
and C2. Since a process records its snapshot when it receives the first marker on any incoming
channel, no messages that follow markers on the channels incoming to it are recorded in the process’s
snapshot.
Process ko jab first marker mil jata hai kisi channel se tab wo kisi bhi incoming message ko
apne snapshot me nhi dalta hai. Aaur ye hona bhi chahiye tha kyu ki, jo bhi message hum
marker msg bhejne ke baad bheja gya ho unko snapshot me nhi include krna chahiye. Due to
FIFO
Complexity
The recording part of a single instance of the algorithm requires O(e) messages and O(d) time, where
e is the number of edges in the network and d is the diameter of the network.
Safety (no false deadlocks): The algorithm should not report deadlocks that do not exist (called
phantom or false deadlocks).
Edge-chasing algorithms: Probe message used to detect deadlock. If you see a probe message again
then deadlock. Only the block process sent the probe further. Probe msg short and small size.
Mitchell
Diffusing computation based algo: uses echo to detect deadlock. Initiator detects deadlock. WFG
not built. Process cannot send a reply until it gets a reply for the query it sent. Chandy–Misra–Haas
(i) the algo should not cause the underlying computation algo to freeze.
Mitchell Merritt for the Single-Resource Model (no phantom deadlock, all deadlock detected)
Whenever a process receives a probe which is less than its public label, then it simply ignores that
probe.
Detect means that the probe with the private label of some process has returned to it, indicating a
deadlock.
Message Complexity:
The worst-case complexity of the algorithm is s(s - 1)/2 Transmit steps, where s is the number of
processes in the cycle.
Lemma 10.1: For any process u/v, if u > v, then u was set by a Transmit step.
2. Suppose a node has added the last edge of a cycle and hence started a transmit. Can his
private and public label change by the time the transmit reaches him back?
Node added at last has highest public, private value.
4. Can another node also have your public values though not in a cycle?(doubt)
Yes,
P1 → P2 (due to transit) → P3
5. In case of a cycle, the process that detects the deadlock will have the highest id in the cycle.
T/F?
True, since the last node has the highest value, and it is the one who detects a cycle.
6. Can a process ignore a transmit probe if the value is lesser than its own.
Whenever a process receives a probe which is less than its public label, then it simply ignores that
probe.
H.W Is the last process to block same as the one who detects the deadlock?
Proof of correctness:
Using priority:
Working:
If u>v : no transmission
If u<v: should transmit . Let the new public tuple of the blocked node be (a,b). Then,
a=v because transmission should happen.
b should be the lowest in the transmitted chain so that on transmission in a cycle; the lowest priority
finds his own public label and priority number and aborts.
If u==v; then, if p>q that means transmission should continue so that lower priority process aborts.
Because p>e>d>c>b>a
Chandy–Misra–Haas algorithm for the AND model
Performance analysis
In the algorithm, one probe message (per deadlock detection initiation) is sent on every edge of
the WFG which connects processes on two sites. Thus, the algorithm exchanges at most m(n −
1)/2 messages to detect a deadlock that involves m processes and spans over n sites. The size of
messages is fixed and is very small (only three integer words). The delay in detecting a deadlock
is O(n)
Chandy–Misra–Haas algorithm for the OR model
Performance analysis
For every deadlock detection, the algorithm exchanges e query messages and e reply messages,
where e = n(n − 1) is the number of edges.
( c ) here means condition of (f)
Class questions:
Before chandy hass
● What happens if there is no deadlock?
● How will Pi conclude that there is no deadlock?
● Something needs to be done to reset the dependency vector values of a future probe
i. What can be done?
● If a process is deadlocked because it is waiting on a cycle then it wont get probe
back though deadlocked- true/false?
● After the probe passes is it not possible that the edges are removed and hence truly
there is no cycle through the probe msg returns? That is, phantom deadlock
detected?
● Consider the reverse graph(all edges reversed). Now if we run the algorithm is it
possible that we detect a false deadlock as edges may change.
● Number of messages? m processes and n sites.
Exercise question:
Exercise 10.1 Consider the following simple approach to handle deadlocks in distributed
systems by using “time-outs”: a process that has waited for a specified period for a resource
declares that it is deadlocked and aborts to resolve the deadlock. What are the shortcomings of
using this method?
Exercise 10.2 Suppose all the processes in the system are assigned priorities which can be used
to totally order the processes. Modify Chandy et al.’s algorithm for the AND model so that when
a process detects a deadlock, it also knows the lowest priority deadlocked process.
Exercise 10.3 Show that, in the AND model, false deadlocks can occur due to deadlock
resolution in distributed systems [43]. Can something be done about it or they are bound to
happen?
Detecting phantom deadlock, means you think there is a deadlock but there is not any.
Do dry run of both chandy algo
Termination Detection
Condition:
1. Execution of a TD algorithm cannot indefinitely delay the underlying computation;
that is, execution of the termination detection algorithm must not freeze the underlying
computation.
2. The termination detection algorithm must not require addition of new communication
channels between processes by Weight Throwing
Correctness of Algorithm
Write the delay and all other info.
Exercise:
Exercise 7.1 Huang’s termination detection algorithm could be redesigned using a counter to
avoid the need of splitting weights. Present an algorithm for termination detection that uses
counters instead of weights.
Exercise 7.2 Design a termination detection algorithm that is based on the concept of weight
throwing and is tolerant to message losses. Assume that processe do not crash.
Exercise 7.3 Termination detection algorithms assume that an idle process can only be
activated on the reception of a message. Consider a system where an idle process can
become active spontaneously without receiving a message. Do you think a termination
detection algorithm can be designed for such a system? Give reasons for your answer.
Exercise 7.4 Design an efficient termination detection algorithm for a system where the
communication delay is zero.
Exercise 7.5 Design an efficient termination detection algorithm for a system where the
computation at a process is instantaneous (that is, all proceses are always in the idle state.)
"An Efficient Causal Order Algorithm for Message Delivery in Distributed System" is a
research paper that proposes a new algorithm for achieving causal message ordering in
distributed systems. The authors of the paper are Jangt, Park, Cho, and Yoon.
In a distributed system, messages are sent between nodes, and the order in which these
messages are received can impact the correctness of the system. Causal ordering is a type
of ordering that preserves causality between events. Specifically, if event A causes event B,
then any message that carries information about event A should be received before any
message that carries information about event B.
The algorithm proposed in the paper is based on vector clocks, which are used to track the
causal relationships between events in a distributed system. Each node maintains a vector
clock, which is a vector of integers that represents the node's current knowledge of the state
of the system. When a node sends a message, it attaches its current vector clock to the
message.
The receiving node uses the vector clock to determine whether the message should be
delivered immediately or held until other messages are received. If the vector clock indicates
that the message depends on other messages that have not yet been received, the
message is held until those messages arrive.
The authors of the paper demonstrate that their algorithm is more efficient than previous
causal ordering algorithms. Specifically, they show that their algorithm reduces the number
of messages that need to be held for delivery, which reduces message latency and improves
system performance.
Overall, the paper presents a new algorithm for achieving causal message ordering in
distributed systems that is more efficient than previous algorithms. The algorithm is based on
vector clocks and reduces message latency, which improves system performance.
_________________________________________________________________________
__
Shreyash
In regards to Chandy-Hass algorithm, answer the following questions with proper reasons:
1. What happens if there is no deadlock?
2. How will Pi conclude that there is no deadlock?
3. Something needs to be done to reset the dependency vector values of a future probe i.
What can be done?
4. If a process is deadlocked because it is waiting on a cycle then it won't get probe back
though deadlocked- true/false?
5. After the probe passes is it not possible that the edges are removed and hence truly there
is no cycle through the probe msg returns? That is, phantom deadlock detected?
1. If there is no deadlock, then the Chandy-Hass algorithm does not affect the
system's normal operation, and no action is taken.
2. To conclude that there is no deadlock, each process Pi maintains a local wait-
for graph, and periodically initiates a probe message that travels along a path
in the wait-for graph, through a subset of processes. If the probe returns to Pi,
then there is no deadlock in the subset of processes probed. This is because
if there were a cycle, the probe message would have been blocked at some
point, preventing it from returning.
3. To reset the dependency vector values of a future probe i, the process can
either initialize the vector values to zero, or set them to the current snapshot
of the process's wait-for graph.
4. False. If a process is deadlocked because it is waiting on a cycle, it will
eventually receive a probe message that travels along the cycle, and will be
able to detect the deadlock.
5. It is possible that the edges are removed after the probe message passes,
and hence there is no cycle when the probe message returns. However, the
Chandy-Hass algorithm is designed to handle this situation. Each process
maintains a local snapshot of the wait-for graph when it initiates a probe
message. If a process receives a probe message that passes through it, it can
compare the snapshot with its current wait-for graph. If there is a change in
the wait-for graph, the process sends an additional message to update the
dependency vector of the initiating process. This ensures that the initiating
process has an accurate snapshot of the wait-for graph, even if there are
changes after the probe message passes. Therefore, phantom deadlock can
be avoided.