0% found this document useful (0 votes)

170 views79 pages

Tính Toán Phân Tán

This document discusses terminology and basic concepts related to distributed computing algorithms. It defines key terms like physical and logical topology, centralized vs distributed algorithms, synchronous vs asynchronous systems, and different types of failures including crash, omission, and Byzantine failures. It also classifies algorithms as deterministic or nondeterministic, inhibitory or non-inhibitory, and discusses different complexity measures to analyze distributed algorithms including time, space, and message complexity.

Uploaded by

Trường Giang Đỗ

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

170 views79 pages

Tính Toán Phân Tán

Uploaded by

Trường Giang Đỗ

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 79

Chapter 5: Terminology and Basic Algorithms

Ajay Kshemkalyani and Mukesh Singhal

Distributed Computing: Principles, Algorithms, and Systems

Cambridge University Press

A. Kshemkalyani and M. Singhal (Distributed Computing)

Terminology and Basic Algorithms

CUP 2008

1 / 79

Distributed Computing: Principles, Algorithms, and Systems

Topology Abstraction and Overlays

System: undirected (weighted) graph (N, L), where n = |N|, l = |L|
Physical topology
I
I
I

Nodes: network nodes, routers, all end hosts (whether participating or not)
Edges: all LAN, WAN links, direct edges between end hosts
E.g., Fig. 5.1(a) topology + all routers and links in WANs

Logical topology (application context)

I
I

Nodes: end hosts where application executes

Edges: logical channels among these nodes

All-to-all fully connected (e.g., Fig 5.1(b))

or any subgraph thereof, e.g., neighborhood view, (Fig 5.1(a)) - partial
system view, needs multi-hop paths, easy to maintain
Superimposed topology (a.k.a. topology overlay):
I
I

superimposed on logical topology

Goal: efficient information gathering, distribution, or search (as in P2P
overlays)
e.g., ring, tree, mesh, hypercube

A. Kshemkalyani and M. Singhal (Distributed Computing)

Terminology and Basic Algorithms

CUP 2008

2 / 79

Distributed Computing: Principles, Algorithms, and Systems

Topology Abstractions

WAN
WAN

participating process(or)
(a)

WAN or other network

(b)

Figure 5.1: Example topology views at different levels of abstraction

A. Kshemkalyani and M. Singhal (Distributed Computing)

Terminology and Basic Algorithms

CUP 2008

3 / 79

Distributed Computing: Principles, Algorithms, and Systems

Classifications and Basic Concepts (1)

Application execution vs. control algorithm execution, each with own events
I

Control algorithm:
F

F
F
F

for monitoring and auxiliary functions, e.g., creating a ST, MIS, CDS, reaching
consensus, global state detection (deadlock, termination etc.), checkpointing
superimposed on application execution, but does not interfere
its send, receive, internal events are transparent to application execution
a.k.a. protocol

Centralized and distributed algorithms

Centralized: asymmetric roles; client-server configuration; processing and

bandwidth bottleneck; point of failure
Distributed: more balanced roles of nodes, difficult to design perfectly
distributed algorithms (e.g., snapshot algorithms, tree-based algorithms)

Symmetric and asymmetric algorithms

A. Kshemkalyani and M. Singhal (Distributed Computing)

Terminology and Basic Algorithms

CUP 2008

4 / 79

Distributed Computing: Principles, Algorithms, and Systems

Classifications and Basic Concepts (2)

Anonymous algorithm: process ids or processor ids are not used to make any
execution (run-time) decisions
I

Structurally elegant but hard to design, or impossible, e.g., anonymous leader

election is impossible

Uniform algorithm: Cannot use n, the number of processes, as a parameter in

the code
I

Allows scalability; process leave/join is easy and only neighbors need to be

aware of logical topology changes

Adaptive algorithm: Let k ( n) be the number of processes participating in

the context of a problem X when X is being executed. Complexity should be
expressible as a function of k, not n.
I

E.g., mutual exclusion: critical section contention overhead expressible in

terms of number of processes contending at this time (k)

A. Kshemkalyani and M. Singhal (Distributed Computing)

Terminology and Basic Algorithms

CUP 2008

5 / 79

Distributed Computing: Principles, Algorithms, and Systems

Classifications and Basic Concepts (3)

Deterministic vs. nondeterministic executions

Nondeterministic execution: contains at least 1 nondeterministic receive;

deterministic execution has no nondeterministic receive
F
F

Nondeterministic receive: can receive a message from any source

Deterministic receive: source is specified

Difficult to reason with

Asynchronous system: re-execution of deterministic program will produce same
partial order on events ((used in debugging, unstable predicate detection etc.)
Asynchronous system: re-execution of nondeterministic program may produce
different partial order (unbounded delivery times and unpredictable congestion,
variable local CPU scheduling delays)

A. Kshemkalyani and M. Singhal (Distributed Computing)

Terminology and Basic Algorithms

CUP 2008

6 / 79

Distributed Computing: Principles, Algorithms, and Systems

Classification and Basic Concepts (4)

Execution inhibition (a.k.a. freezing)

I
I
I

Protocols that require suspension of normal execution until some stipulated

operations occur are inhibitory
Concept: Different from blocking vs. nonblocking primitives
Analyze inhibitory impact of control algo on underlying execution
Classification 1:
F
F

F
I

Non-inhibitory protocol: no event is disabled in any execution

Locally inhibitory protocol: in any execution, any delayed event is a locally
delayed event, i.e., inhibition under local control, not dependent on any receive
event
Globally inhibitory: in some execution, some delayed event is not locally delayed

Classification 2: send inhibitory/ receive inhibitory/ internal event inhibitory

A. Kshemkalyani and M. Singhal (Distributed Computing)

Terminology and Basic Algorithms

CUP 2008

7 / 79

Distributed Computing: Principles, Algorithms, and Systems

Classifications and Basic Concepts (5)

Synchronous vs. asynchronous systems

Synchronous:
F
F
F

upper bound on message delay

known bounded drift rate of clock wrt. real time
known upper bound for process to execute a logical step

Asynchronous: above criteria not satisfied

spectrum of models in which some combo of criteria satisfied

Algorithm to solve a problem depends greatly on this model

Distributed systems inherently asynchronous
On-line vs. off-line (control) algorithms
I

On-line: Executes as data is being generated

Clear advantages for debugging, scheduling, etc.
Off-line: Requires all (trace) data before execution begins

A. Kshemkalyani and M. Singhal (Distributed Computing)

Terminology and Basic Algorithms

CUP 2008

8 / 79

Distributed Computing: Principles, Algorithms, and Systems

Classification and Basic Concepts (6)

Wait-free algorithms (for synchronization operations)

I
I
I

resilient to n 1 process failures, i.e., ops of any process must complete in

bounded number of steps, irrespective of other processes
very robust, but expensive
possible to design for mutual exclusion
may not always be possible to design, e.g., producer-consumer problem

Communication channels
I

point-to-point: FIFO, non-FIFO

At application layer, FIFO usually provided by network stack

A. Kshemkalyani and M. Singhal (Distributed Computing)

Terminology and Basic Algorithms

CUP 2008

9 / 79

Distributed Computing: Principles, Algorithms, and Systems

Classifications and Basic Concepts (7)

Process failures (sync + async systems) in order of increasing severity
I

I
I

Fail-stop: Properly functioning process stops execution. Other processes learn

about the failed process (thru some mechanism)
Crash: Properly functioning process stops execution. Other processes do not
learn about the failed process
Receive omission: Properly functioning process fails by receiving only some of
the messages that have been sent to it, or by crashing.
Send omission: Properly functioning process fails by sending only some of the
messages it is supposed to send, or by crashing. Incomparable with receive
omission model.
General omission: Send omission + receive omission
Byzantine (or malicious) failure, with authentication: Process may (mis)
behave anyhow, including sending fake messages.
Authentication facility = If a faulty process claims to have received a
message from a correct process, that is verifiable.
Byzantine (or malicious) failure, no authentication

The non-malicious failure models are benign

A. Kshemkalyani and M. Singhal (Distributed Computing)

Terminology and Basic Algorithms

CUP 2008

10 / 79

Distributed Computing: Principles, Algorithms, and Systems

Classifications and Basic Concepts (8)

Process failures (contd.) Timing failures (sync systems):

General omission failures, or clocks violating specified drift rates, or process

violating bounds on time to execute a step
More severe than general omission failures

Failure models influence design of algorithms

Link failures
I
I
I

Crash failure: Properly functioning link stops carrying messages

Omission failure: Link carries only some of the messages sent on it, not others
Byzantine failure: Link exhibits arbitrary behavior, including creating fake
messages and altering messages sent on it

Link failures Timing failures (sync systems): messages delivered

faster/slower than specified behavior

A. Kshemkalyani and M. Singhal (Distributed Computing)

Terminology and Basic Algorithms

CUP 2008

11 / 79

Distributed Computing: Principles, Algorithms, and Systems

Complexity Measures and Metrics

Each metric specified using lower bound (), upper bound (O), exact bound
()
Metrics
I
I

I
I
I

Space complexity per node

System-wide space complexity (6= n space complexity per node). E.g., worst
case may never occur at all nodes simultaneously!
Time complexity per node
System-wide time complexity. Do nodes execute fully concurrently?
Message complexity
F
F

Number of messages (affects space complexity of message ovhd)

Size of messages (affects space complexity of message ovhd + time component
via increased transmission time)
Message time complexity: depends on number of messages, size of messages,
concurrency in sending and receiving messages

: Other metrics: # send and # receive events; # multicasts, and how

implemented?
(Shared memory systems): size of shared memory; # synchronization
operations

A. Kshemkalyani and M. Singhal (Distributed Computing)

Terminology and Basic Algorithms

CUP 2008

12 / 79

Distributed Computing: Principles, Algorithms, and Systems

Program Structure
Communicating Sequential Processes (CSP) like:
[ G1 CL1 || G2 CL2 || || Gk CLk ]
The repetitive command * denotes an infinite loop.
Inside it, the alternative command || is over guarded commands.
Specifies execution of exactly one of its constituent guarded commands.
Guarded command syntax: G CL
guard G is boolean expression,
CL is list of commands to be executed if G is true.
Guard may check for message arrival from another process.
Alternative command fails if all the guards fail; if > 1 guard is true, one is
nondeterministically chosen for execution.
Gm CLm : CLm and Gm atomically executed.

A. Kshemkalyani and M. Singhal (Distributed Computing)

Terminology and Basic Algorithms

CUP 2008

13 / 79

Distributed Computing: Principles, Algorithms, and Systems

Basic Distributed Graph Algorithms: Listing

Sync 1-initiator ST (flooding)

MST, sync

Async 1-initiator ST (flooding)

MST, async

Async conc-initiator ST (flooding)

Synchronizers: simple, , ,

Async DFS ST

MIS, async, randomized

Broadcast & convergecast on tree

CDS

Sync 1-source shortest path

Compact routing tables

Distance Vector Routing

Leader election: LCR algorithm

Async 1-source shortest path

Dynamic object replication

All sources shortest path:

Floyd-Warshall
Sync, async constrained flooding

A. Kshemkalyani and M. Singhal (Distributed Computing)

Terminology and Basic Algorithms

CUP 2008

14 / 79

Distributed Computing: Principles, Algorithms, and Systems

Sync 1-initiator ST (flooding)

(local variables)
int visited, depth 0
int parent
set of int Neighbors set of neighbors
(message types)
QUERY
(1) if i
(2)
(3)
(4)
(5) for
(6)
(7)
(8)
(9)
(10)
(11)
(12)

= root then
visited 1;
depth 0;
send QUERY to Neighbors;
round = 1 to diameter do
if visited = 0 then
if any QUERY messages arrive then
parent randomly select a node from which QUERY was received;
visited 1;
depth round;
send QUERY to Neighbors \ {senders of QUERYs received in this round};
delete any QUERY messages that arrived in this round.

A. Kshemkalyani and M. Singhal (Distributed Computing)

Terminology and Basic Algorithms

CUP 2008

15 / 79

Distributed Computing: Principles, Algorithms, and Systems

Synchronous 1-init Spanning Tree: Example

B (2)

(1)

(3)

(2)

(1)

C
QUERY

(3)

F (2)

E (3)

Figure 5.2: Tree in boldface; round numbers of QUERY are labeled

Designated root. Node A in example.
Each node identifies parent
How to identify child nodes?

A. Kshemkalyani and M. Singhal (Distributed Computing)

Terminology and Basic Algorithms

CUP 2008

16 / 79

Distributed Computing: Principles, Algorithms, and Systems

Synchronous 1-init Spanning Tree: Complexity

Termination: after diameter rounds.

How can a process terminate after setting its parent?

Complexity:
Local space: O(degree)
P
Global space: O( local space)
Local time: O(degree + diameter )
Message time complexity: d rounds or message hops
Message complexity: 1, 2 messages/edge. Thus, [l, 2l]
Spanning tree: analogous to breadth-first search

A. Kshemkalyani and M. Singhal (Distributed Computing)

Terminology and Basic Algorithms

CUP 2008

17 / 79

Distributed Computing: Principles, Algorithms, and Systems

Asynchronous 1-init Spanning Tree: Code

(local variables)
int parent
set of int Children, Unrelated
set of int Neighbors set of neighbors
(message types)
QUERY, ACCEPT, REJECT
(1) When the predesignated root node wants to initiate the algorithm:
(1a) if (i = root and parent =) then
(1b)
send QUERY to all neighbors;
(1c)
parent i.
(2) When QUERY arrives from j:
(2a) if parent = then
(2b)
parent j;
(2c)
send ACCEPT to j;
(2d)
send QUERY to all neighbors except j;
(2e)
if (Children Unrelated) = (Neighbors \ {parent}) then
(2f)
terminate.
(2g) else send REJECT to j.
(3) When ACCEPT arrives from j:
(3a) Children Children {j};
(3b) if (Children Unrelated) = (Neighbors \ {parent}) then
(3c)
terminate.
(4) When REJECT arrives from j:
(4a) Unrelated Unrelated {j};
(4b) if (Children Unrelated) = (Neighbors \ {parent}) then
(4c)
terminate.

A. Kshemkalyani and M. Singhal (Distributed Computing)

Terminology and Basic Algorithms

CUP 2008

18 / 79

Distributed Computing: Principles, Algorithms, and Systems

Async 1-init Spanning Tree: Operation

root initiates flooding of QUERY to identify tree edges

parent: 1st node from which QUERY received
I
I

ACCEPT (+ rsp) sent in response; QUERY sent to other nbhs

Termination: when ACCEPT or REJECT (- rsp) received from non-parent
nbhs. Why?

QUERY from non-parent replied to by REJECT

Necessary to track neighbors? to determine children and when to terminate?
Why is REJECT message type required?
Can use of REJECT messages be eliminated? How? What impact?

A. Kshemkalyani and M. Singhal (Distributed Computing)

Terminology and Basic Algorithms

CUP 2008

19 / 79

Distributed Computing: Principles, Algorithms, and Systems

Asynchronous 1-init Spanning Tree: Complexity

Local termination: after receiving ACCEPT or REJECT from non-parent nbhs.

Complexity:
Local space: O(degree)
P
Global space: O( local space)
Local time: O(degree)
Message complexity: 2, 4 messages/edge. Thus, [2l, 4l]
Message time complexity: d + 1 message hops.
Spanning tree: no claim can be made. Worst case height n 1

A. Kshemkalyani and M. Singhal (Distributed Computing)

Terminology and Basic Algorithms

CUP 2008

20 / 79

Distributed Computing: Principles, Algorithms, and Systems

Asynchronous 1-init Spanning Tree: Example

(1)

B (4)

(1)

C
(3)

QUERY

(5)
(3)
(2)

Figure 5.3: Tree in boldface; Number indicates approximate order in which

QUERY get sent
Designated root. Node A in example.
tree edges: QUERY + ACCEPT msgs
cross-edges and back-edges: 2(QUERY + REJECT) msgs

A. Kshemkalyani and M. Singhal (Distributed Computing)

Terminology and Basic Algorithms

CUP 2008

21 / 79

Distributed Computing: Principles, Algorithms, and Systems

Asynchronous Spanning Tree: Concurrent Initiators

A
C

B
D

Algorithm:

Figure 5.4: Concurrent initiators A,G,J

No pre-designated root:

A node may spontaneously initiate

algorithm and become root.

Option 1: Merge partial STs. Difficult

based on local knowledge, can lead to
cycles

Each root initiates variant of 1-init

algorithm; lower priorities suppressed at
intermediate nodes

Option 2: Allow one ST computation

instance to proceed; supress others.

Termination: Only root detects

termination. Needs to send extra
messages to inform others.

Used by algorithm; selects root

with higher process id to continue
3 cases: newroot < = > myroot

Time complexity: O(l)

Message complexity: O(nl)

A. Kshemkalyani and M. Singhal (Distributed Computing)

Terminology and Basic Algorithms

CUP 2008

22 / 79

Distributed Computing: Principles, Algorithms, and Systems

Asynchronous Spanning Tree: Code (1/2)

(local variables)
int parent, myroot
set of int Children, Unrelated
set of int Neighbors set of neighbors
(message types)
QUERY, ACCEPT, REJECT
(1) When the node wants to initiate the algorithm as a root:
(1a) if (parent =) then
(1b)
send QUERY(i) to all neighbors;
(1c)
parent, myroot i.
(2) When QUERY(newroot) arrives from j:
(2a) if myroot < newroot then
// discard earlier partial execution due to its lower priority
(2b)
parent j; myroot newroot; Children, Unrelated ;
(2c)
send QUERY(newroot) to all neighbors except j;
(2d)
if Neighbors = {j} then
(2e)
send ACCEPT(myroot) to j; terminate.
// leaf node
(2f) else send REJECT(newroot) to j.
// if newroot = myroot then parent is already identified.
// if newroot < myroot ignore the QUERY. j will update its root when it receives QUERY(myroot).

A. Kshemkalyani and M. Singhal (Distributed Computing)

Terminology and Basic Algorithms

CUP 2008

23 / 79

Distributed Computing: Principles, Algorithms, and Systems

Asynchronous Spanning Tree: Code (2/2)

(3) When ACCEPT(newroot) arrives from j:

(3a) if newroot = myroot then
(3b)
Children Children {j};
(3c)
if (Children Unrelated) = (Neighbors \ {parent}) then
(3d)
if i = myroot then
(3e)
terminate.
(3f)
else send ACCEPT(myroot) to parent.
//if newroot < myroot then ignore the message. newroot > myroot will never occur.
(4) When REJECT(newroot) arrives from j:
(4a) if newroot = myroot then
(4b)
Unrelated Unrelated {j};
(4c)
if (Children Unrelated) = (Neighbors \ {parent}) then
(4d)
if i = myroot then
(4e)
terminate.
(4f)
else send ACCEPT(myroot) to parent.
//if newroot < myroot then ignore the message. newroot > myroot will never occur.

A. Kshemkalyani and M. Singhal (Distributed Computing)

Terminology and Basic Algorithms

CUP 2008

24 / 79

Distributed Computing: Principles, Algorithms, and Systems

Asynchronous DFS Spanning Tree

Handle concurrent initiators just as for the non-DFS algorithm, just examined
When QUERY, ACCEPT, or REJECT arrives: actions depend on whether
myroot < = newroot
Termination: only successful root detects termination. Informs others using
ST edges.
Time complexity: O(l)
Message complexity: O(nl)

A. Kshemkalyani and M. Singhal (Distributed Computing)

Terminology and Basic Algorithms

CUP 2008

25 / 79

Distributed Computing: Principles, Algorithms, and Systems

Asynchronous DFS Spanning Tree: Code

(local variables)
int parent, myroot
set of int Children
set of int Neighbors, Unknown set of neighbors
(message types)
QUERY, ACCEPT, REJECT
(1) When the node wants to initiate the algorithm as a root:
(1a) if (parent =) then
(1b)
send QUERY(i) to i (itself).
(2) When QUERY(newroot) arrives from j:
(2a) if myroot < newroot then
(2b)
parent j; myroot newroot; Unknown set of neighbours;
(2c)
Unknown Unknown \ {j};
(2d)
if Unknown 6= then
(2e)
delete some x from Unknown;
(2f)
send QUERY(myroot) to x;
(2g)
else send ACCEPT(myroot) to j;
(2h) else if myroot = newroot then
(2i)
send REJECT to j.
// if newroot < myroot ignore the query.
// j will update its root to a higher root identifier when it receives its QUERY.
(3) When ACCEPT(newroot) or REJECT(newroot) arrives from j:
(3a) if newroot = myroot then
(3b)
if ACCEPT message arrived then
(3c)
Children Children {j};
(3d)
if Unknown = then
(3e)
if parent 6= i then
(3f)
send ACCEPT(myroot) to parent;
(3g)
else set i as the root; terminate.
(3h)
else
(3i)
delete some x from Unknown;
(3j)
send QUERY(myroot) to x.
// if newroot < myroot ignore the query. Since sending QUERY to j, i has updated its myroot.
// j will update its myroot to a higher root identifier when it receives a QUERY initiated by it. newroot > myroot will never occur.
A. Kshemkalyani and M. Singhal (Distributed Computing)

Terminology and Basic Algorithms

CUP 2008

26 / 79

Distributed Computing: Principles, Algorithms, and Systems

tree edge
crossedge

initiated by leaves

root

convergecast

initiated by root

broadcast

Broadcast and Convergecast on a Tree (1)

backedge

Figure 5.5: Tree structure for broadcast and convergecast

Question:
how to perform BC and CC on a ring? on a mesh?
Costs?

A. Kshemkalyani and M. Singhal (Distributed Computing)

Terminology and Basic Algorithms

CUP 2008

27 / 79

Distributed Computing: Principles, Algorithms, and Systems

Broadcast and Convergecast on a Tree (2)

Broadcast: distribute information
BC1. Root sends info to be broadcast to all its children. Terminate.
BC2. When a (nonroot) node receives info from its parent, it copies it
and forwards it to its children. Terminate.
Convergecast: collect info at root, to compute a global function
CVC1. Leaf node sends its report to its parent. Terminate.
CVC2. At a non-leaf node that is not the root: When a report is received
from all the child nodes, the collective report is sent to the parent.
Terminate.
CVC3. At root: When a report is received from all the child nodes, the
global function is evaluated using the reports. Terminate.
Uses: compute min, max, leader election, compute global state functions
Time complexity: O(h); Message complexity: n 1 messages for BC or CC

A. Kshemkalyani and M. Singhal (Distributed Computing)

Terminology and Basic Algorithms

CUP 2008

28 / 79

Distributed Computing: Principles, Algorithms, and Systems

Single Source Shortest Path: Sync Bellman-Ford

Weighted graph, no cycles with negative weight

No node has global view; only local topology
Assumption: node knows n; needed for termination
After k rounds: length at any node has length of shortest path having k hops
After k rounds: length of all nodes up to k hops away in final MST has
stabilized
Termination: n 1 rounds
Time Complexity: n 1 rounds
Message complexity: (n 1) l messages

A. Kshemkalyani and M. Singhal (Distributed Computing)

Terminology and Basic Algorithms

CUP 2008

29 / 79

Distributed Computing: Principles, Algorithms, and Systems

Sync Distributed Bellman-Ford: Code

(local variables)
int length
int parent
set of int Neighbors set of neighbors
set of int {weighti,j , weightj,i | j Neighbors} the known values of the weights of incident links
(message types)
UPDATE
(1) if i = i0 then length 0;
(2) for round = 1 to n 1 do
(3)
send UPDATE(i, length) to all neighbors;
(4)
await UPDATE(j, lengthj ) from each j Neighbors;
(5)
for each j Neighbors do
(6)
if (length > (lengthj + weightj,i ) then
(7)
length lengthj + weightj,i ; parent j.

A. Kshemkalyani and M. Singhal (Distributed Computing)

Terminology and Basic Algorithms

CUP 2008

30 / 79

Distributed Computing: Principles, Algorithms, and Systems

Distance Vector Routing

Used in Internet routing (popular upto to mid-1980s), having dynamically

changing graph, where link weights model delay/ load
Variant of sync Bellman-Ford; outer for loop is infinite
Track shortest path to every destination
length replaced by LENGTH[1..n]; parent replaced by PARENT [1..n]
kth component denotes best-known length to LENGTH[k]
In each iteration
I
I
I

apply triangle inequality for each destination independently

Triangle inequality: (LENGTH[k] > (LENGTHj [k] + weightj,i )
Node i estimates weightij using RTT or queuing delay to neighbor j

A. Kshemkalyani and M. Singhal (Distributed Computing)

Terminology and Basic Algorithms

CUP 2008

31 / 79

Distributed Computing: Principles, Algorithms, and Systems

Single Source Shortest Path: Async Bellman-Ford

Weighted graph, no cycles with negative weight

No node has global view; only local topology
exponential (c n ) number of messages and exponential (c n d) time
complexity in the worst case, where c is some constant
If all links have equal weight, the algorithm computes the minimum-hop
path; the minimum-hop routing tables to all destinations are computed using
O(n2 l) messages

A. Kshemkalyani and M. Singhal (Distributed Computing)

Terminology and Basic Algorithms

CUP 2008

32 / 79

Distributed Computing: Principles, Algorithms, and Systems

Async Distributed Bellman-Ford: Code

(local variables)
int length
set of int Neighbors set of neighbors
set of int {weighti,j , weightj,i | j Neighbors} the known values of the weights of incident links
(message types)
UPDATE
(1) if i = i0 then
(1a)
length 0;
(1b)
send UPDATE(i0 , 0) to all neighbours; terminate.
(2) When UPDATE(i0 , lengthj ) arrives from j:
(2a)
if (length > (lengthj + weightj,i )) then
(2b)
length lengthj + weightj,i ; parent j;
(2c)
send UPDATE(i0 , length) to all neighbors;

A. Kshemkalyani and M. Singhal (Distributed Computing)

Terminology and Basic Algorithms

CUP 2008

33 / 79

Distributed Computing: Principles, Algorithms, and Systems

All-All Shortest Paths: Floyd-Warshall

passes through nodes in

{1,2,...,pivot1}

LENGTH[s,t]
LENGTH[s,pivot]

VIA(VIA(s,t), t)

LENGTH[pivot,t]

passes through nodes in passes through nodes in

{1,2,...,pivot1}
{1,2,...,pivot1}

pivot

VIA(s,t)
s
(b)

(a)

Figure 5.6: (a) Triangle inequality for Floyd-Warshall algorithm. (b) VIA
relationships along a branch of the sink tree for a given (s, t) pair

A. Kshemkalyani and M. Singhal (Distributed Computing)

Terminology and Basic Algorithms

CUP 2008

34 / 79

Distributed Computing: Principles, Algorithms, and Systems

All-All Shortest Paths: Floyd-Warshall

After pivot iterations of the outer loop,

Invariant
LENGTH[i, j] is the shortest path going through intermediate nodes from the set
{i, . . . , pivot}. VIA[i, j] is the corresponding first hop.
(1) for pivot = 1 to n do
(2) for s = 1 to n do
(3)
for t = 1 to n do
(4)
if LENGTH[s, pivot] + LENGTH[pivot, t] < LENGTH[s, t] then
(5)
LENGTH[s, t] LENGTH[s, pivot] + LENGTH[pivot, t];
(6)
VIA[s, t] VIA[s, pivot].

Complexity (centralized): O(n3 )

A. Kshemkalyani and M. Singhal (Distributed Computing)

Terminology and Basic Algorithms

CUP 2008

35 / 79

Distributed Computing: Principles, Algorithms, and Systems

Distributed Floyd-Warshall (1)

Row i of LENGTH[1..n, 1..n], VIA[1..n, 1..n] stored at i, which is responsible

for updating the rows. (So, i acts as source.)
Corresponding to centralized algorithm, line (4):
I

How does node i access remote datum LENGTH[pivot, t] in each iteration

pivot?
F

Distributed (dynamic) sink tree: In any iteration pivot, all nodes

s | LENGTH[s, t] 6= are on a sink tree, with sink at t

How to synchronize execution of outer loop iteration at different nodes?

(otherwise, algorithm goes wrong).
F

Simulate synchronizer: e.g., use r eceive to get data LENGTH[pivot, ] from

parent on sink tree

A. Kshemkalyani and M. Singhal (Distributed Computing)

Terminology and Basic Algorithms

CUP 2008

36 / 79

Distributed Computing: Principles, Algorithms, and Systems

Distributed Floyd-Warshall: Data structures

(local variables)
array of int LEN[1..n]

// LEN[j] is the length of the shortest known path from i to node j.

// LEN[j] = weightij for neighbor j, 0 for j = i, otherwise
array of int PARENT [1..n] // PARENT [j] is the parent of node i (myself) on the sink tree rooted at j.
// PARENT [j] = j for neighbor j, otherwise
set of int Neighbours set of neighbors
int pivot, nbh 0
(message types)
IN TREE(pivot), NOT IN TREE(pivot), PIV LEN(pivot, PIVOT ROW [1..n])
// PIVOT ROW [k] is LEN[k] of node pivot, which is LEN[pivot, k] in the central algorithm
// the PIV LEN message is used to convey PIVOT ROW .

A. Kshemkalyani and M. Singhal (Distributed Computing)

Terminology and Basic Algorithms

CUP 2008

37 / 79

Distributed Computing: Principles, Algorithms, and Systems

Distributed Floyd-Warshall: Code

(1) for pivot = 1 to n do

(2) for each neighbour nbh Neighbours do
(3)
if PARENT [pivot] = nbh then
(4)
send IN TREE(pivot) to nbh;
(5)
else send NOT IN TREE(pivot) to nbh;
(6) await IN TREE or NOT IN TREE message from each neighour;
(7) if LEN[pivot] 6= then
(8)
if pivot 6= i then
(9)
receive PIV LEN(pivot, PIVOT ROW [1..n]) from PARENT [pivot];
(10)
for each neighbour nbh Neighbours do
(11)
if IN TREE message was received from nbh then
(12)
if pivot = i then
(13)
send PIV LEN(pivot, LEN[1..n]) to nbh;
(14)
else send PIV LEN(pivot, PIVOT ROW [1..n]) to nbh;
(15)
for t = 1 to n do
(16)
if LEN[pivot] + PIVOT ROW [t] < LEN[t] then
(17)
LEN[t] LEN[pivot] + PIVOT ROW [t];
(18)
PARENT [t] PARENT [pivot].

A. Kshemkalyani and M. Singhal (Distributed Computing)

Terminology and Basic Algorithms

CUP 2008

38 / 79

Distributed Computing: Principles, Algorithms, and Systems

Distributed Floyd-Warshall: Dynamic Sink Tree

Rename LENGTH[i, j], VIA[i, j] as LEN[j], PARENT [j] in distributed algorithm
= LENGTH[i, pivot] is LEN[pivot]
At any node i, in iteration pivot:
iff LEN[pivot] 6= at node i, then pivot distributes LEN[] to all nodes
(including i) in sink tree of pivot
Parent-child edges in sink tree need to be IDed. How?
1
2

A node sends IN TREE to PARENT [pivot]; NOT IN TREE to other neighbors

Receive IN TREE from k = k is a child in sink tree of pivot

Await IN TREE or NOT IN TREE from each neighbor.

This send-receive is synchronization!
pivot broadcasts LEN[] down its sink tree.
This send-receive is synchronization!
Now, all nodes execute triangle inequality in pseudo lock-step
Time Complexity: O(n2 ) execution/node, + time for n broadcasts
Message complexity: n iterations;
2 IN TREE or NOT IN TREE msgs of size O(1) per edge: O(l) msgs
n 1 PIV LEN msgs of size O(n): O(n) msgs
Total O(n(l + n)) messages; Total O(nl + n3 ) message space
A. Kshemkalyani and M. Singhal (Distributed Computing)

Terminology and Basic Algorithms

CUP 2008

39 / 79

Distributed Computing: Principles, Algorithms, and Systems

Distributed Floyd-Warshall: Sink Tree

B
NOT_IN_TREE(pivot)

NOT_IN_TREE(pivot)
NOT_IN_TREE(pivot)

NOT_IN_TREE(pivot)

IN_TREE(pivot)

Figure 5.7: Identifying parent-child nodes in sink tree

A. Kshemkalyani and M. Singhal (Distributed Computing)

Terminology and Basic Algorithms

CUP 2008

40 / 79

Distributed Computing: Principles, Algorithms, and Systems

Constrained Flooding (no ST)

FIFO channels; duplicates depected using seq. nos.

Asynchronous flooding:
I
I

used by Link State Routing in IPv4

Complexity: 2l messages worst case; Time: d sequential hops

Synchronous flooding (to learn one datum from each processor):

I
I
I

STATEVEC [k] is estimate of ks datum

Message complexity: 2ld messages, each of size n
Time complexity: d rounds

A. Kshemkalyani and M. Singhal (Distributed Computing)

Terminology and Basic Algorithms

CUP 2008

41 / 79

Distributed Computing: Principles, Algorithms, and Systems

Async Constrained Flooding (no ST)

(local variables)
array of int SEQNO[1..n] 0
set of int Neighbors set of neighbors
(message types)
UPDATE
(1) To send a message M:
(1a) if i = root then
(1b)
SEQNO[i] SEQNO[i] + 1;
(1c)
send UPDATE(M, i, SEQNO[i]) to each j Neighbors.
(2) When UPDATE(M, j, seqnoj ) arrives from k:
(2a) if SEQNO[j] < seqnoj then
(2b)
Process the message M;
(2c)
SEQNO[j] seqnoj ;
(2d)
send UPDATE(M, j, seqnoj ) to Neighbors/{k}
(2e) else discard the message.

A. Kshemkalyani and M. Singhal (Distributed Computing)

Terminology and Basic Algorithms

CUP 2008

42 / 79

Distributed Computing: Principles, Algorithms, and Systems

Sync Constrained Flooding (no ST)

Algorithm learns all nodes identifiers

(local variables)
array of int STATEVEC [1..n] 0
set of int Neighbors set of neighbors
(message types)
UPDATE
(1) STATEVEC [i] local value;
(2) for round = 1 to diameter d do
(3)
send UPDATE(STATEVEC [1..n]) to each j Neighbors;
(4)
for count = 1 to |Neighbors| do
(5)
await UPDATE(SV [1..n]) from some j Neighbors;
(6)
STATEVEC [1..n] max(STATEVEC [1..n], SV [1..n]).

A. Kshemkalyani and M. Singhal (Distributed Computing)

Terminology and Basic Algorithms

CUP 2008

43 / 79

Distributed Computing: Principles, Algorithms, and Systems

Minimum Spanning Tree (MST): Overview

Assume undirected weighted graph. If weights are not unique, assume some
tie-breaker such as nodeIDs are used to impose a total order on edge weights.
Review defns: forest, spanning forest, spanning tree, MST
Kruskals MST:
I
I
I

I
I

Assume forest of graph components

maintain sorted list of edges
In each of n 1 iterations, identify minimum weight edge that connects two
different components
Include the edge in MST
O(l log l)

Prims MST:
I
I

Begin with a single node component

In each of n 1 iterations, select the minimum weight edge incident on the
component. Component expands using this selected edge.
O(n2 ) (or O(n log n) using Fibonacci heaps in dense graphs)

A. Kshemkalyani and M. Singhal (Distributed Computing)

Terminology and Basic Algorithms

CUP 2008

44 / 79

Distributed Computing: Principles, Algorithms, and Systems

GHS Synchronous MST Algorithm: Overview

Gallagher-Humblet-Spira distributed MST uses Kruskals strategy. Begin with
forest of graph components.
MWOE (minimum weight outgoing edge): outgoing is logical, i.e.,
indicates direction of expansion of component
Spanning trees of connected components combine with the MWOEs to still
retain the spanning tree property in combined component
Concurrently combine MWOEs:
I

after k iterations,

n
2k

components = at most log n iterations

Each component has a leader node in an iteration

Each iteration within a component has 5 steps, triggered by leader
I
I
I

broadcast-convergecast phase: leader identifies MWOE

broadcast phase: (potential) leader for next iteration identified
broadcast phase: among merging components, 1 leader is selected; it identifies
itself to all in the new component

A. Kshemkalyani and M. Singhal (Distributed Computing)

Terminology and Basic Algorithms

CUP 2008

45 / 79

Distributed Computing: Principles, Algorithms, and Systems

Minimum Weight Outgoing Edge: Example

(a)

(b)

Figure 5.8: Merging of MWOE components. (a) Cycle len = 2 possible. (b) Cycle
len > 2 not possible.

Observation 5.1
For any spanning forest {(Ni , Li ) | i = 1 . . . k} of graph G , consider any
component (Nj , Lj ). Denote by j , the edge having the smallest weight among
those that are incident on only one node in Nj . Then an MST for G that includes
all the edges in each Li in the spanning forest, must also include edge i .

A. Kshemkalyani and M. Singhal (Distributed Computing)

Terminology and Basic Algorithms

CUP 2008

46 / 79

Distributed Computing: Principles, Algorithms, and Systems

MST Example

112

44
13

tree edge
cross edge

11
(MWOE)

outedge
root of component

Figure 5.9: Phases within an iteration in a component.

(a) Root broadcasts SEARCH MWOE; (b) Convergecast REPLY MWOE occurs.
(c) Root broadcasts ADD MWOE; (d) If the MWOE is also chosen as the MWOE
by the component at the other end of the MWOE, the incident process with the
higher ID is the leader for the next iteration; and broadcasts NEW LEADER.

A. Kshemkalyani and M. Singhal (Distributed Computing)

Terminology and Basic Algorithms

CUP 2008

47 / 79

Distributed Computing: Principles, Algorithms, and Systems

Sync GHS: Message Types

(message types:)
SEARCH MWOE(leader )
EXAMINE(leader )
REPLY MWOES(local ID, remote ID)
ADD MWOE(local ID, remote ID)
NEW LEADER(leader )

A. Kshemkalyani and M. Singhal (Distributed Computing)

// broadcast by current leader on tree edges

// sent on non-tree edges after receiving SEARCH MWOE
// details of potential MWOEs are convergecast to leader
// sent by leader to add MWOE and identify new leader
// broadcast by new leader after merging components

Terminology and Basic Algorithms

CUP 2008

48 / 79

Distributed Computing: Principles, Algorithms, and Systems

Sync GHS: Code

leader = i;
for round = 1 to log (n) do

// each merger in each iteration involves at least two components

if leader = i then
broadcast SEARCH MWOE(leader ) along marked edges of tree.

On receiving a SEARCH MWOE(leader ) message that was broadcast on marked edges:

Each process i (including leader ) sends an EXAMINE message along unmarked (i.e., non-tree) edges to determine if
the other end of the edge is in the same component (i.e., whether its leader is the same).
From among all incident edges at i, for which the other end belongs to a different component, process i picks its
incident MWOE(localID,remoteID).

The leaf nodes in the MST within the component initiate the convergecast using REPLY MWOEs, informing their parent of
their MWOE(localID,remoteID). All the nodes participate in this convergecast.

if leader = i then
await convergecast replies along marked edges.
Select the minimum MWOE(localID,remoteID) from all the replies.
broadcast ADD MWOE(localID,remoteID) along marked edges of tree.
// To ask process localID to mark the (localID, remoteID) edge,
// i.e., include it in MST of component.

if an MWOE edge gets marked by both the components on which it is incident then
1
2
3

Define new leader as the process with the larger ID on which that MWOE is incident (i.e., process whose ID is
max(localID, remoteID)).
new leader identifies itself as the leader for the next round.
new leader broadcasts NEW LEADER in the newly formed component along the marked edges announcing itself as
the leader for the next round.

A. Kshemkalyani and M. Singhal (Distributed Computing)

Terminology and Basic Algorithms

CUP 2008

49 / 79

Distributed Computing: Principles, Algorithms, and Systems

GHS: Complexity
log n rounds (synchronous)
Time complexity: O(n log n)
Message complexity:
I
I

In each iteration, O(n) msgs along tree edges (steps 1,3,4,5)

In each iteration, l EXAMINE msgs to determine MWOEs

Hence, O((n + l) log n) messages

Correctness requires synchronous operation
I

In step (2), EXAMINE used to determine if unmarked neighbor belongs to

same component. If nodes of an unmarked edge are in different levels,
problem!
Consider EXAMINE sent on edge (j, k), belonging to same component. But k
may not have learnt it belongs to new component and new leader ID; and
replies +ve
Can lead to cycles.

A. Kshemkalyani and M. Singhal (Distributed Computing)

Terminology and Basic Algorithms

CUP 2008

50 / 79

Distributed Computing: Principles, Algorithms, and Systems

MST (asynchronous)

Synchronous GHS simulated using extra msgs/steps.

New leader does BC/CC on marked edges of new component.

F
F

In Step (2), recipient of EXAMINE can delay response if in old round

n log n extra messages overall

On involvement in a new round, inform each neighbor

F
F

Send EXAMINE when all nbhs along unmarked edges in same round
l log n extra messages overall

Engineer!! asynchronous GHS:

I
I

msg O(n log n + l) time: O(n log n (l + d))

Challenges
F
F
F

determine levels of adjacent nodes

repeated combining with singleton components = log n becomes n
If components at different levels, coordinate search for MWOEs, merging

A. Kshemkalyani and M. Singhal (Distributed Computing)

Terminology and Basic Algorithms

CUP 2008

51 / 79

Distributed Computing: Principles, Algorithms, and Systems

Synchronizers
Definition
Class of transformation algorithms that allow a synchronous program (designed for
a synchronous system) to run on asynchronous systems.
Assumption: failure-free system
Designing tailor-made async algo from scratch may be more efficient than
using synchronizer

Process safety
Process i is safe in round r if all messages sent by i have been received.
Implementation key: signal to each process when it is safe to go to next round,
i.e., when all msgs to be received have arrived

A. Kshemkalyani and M. Singhal (Distributed Computing)

Terminology and Basic Algorithms

CUP 2008

52 / 79

Distributed Computing: Principles, Algorithms, and Systems

Synchronizers: Notation

Ma = Ms + (Minit + rounds Mround )

(1)

Ta = Ts + Tinit + rounds Tround

(2)

Ms : # messages in the synchronous algorithm.

rounds: # rounds in the synchronous algorithm.
Ts : time for the synchronous algorithm.
Assuming one unit (message hop) per round, this equals rounds.
Mround : # messages needed to simulate a round,
Tround : # sequential message hops to simulate a round.
Minit , Tinit : # messages, # sequential message hops to initialize async
system.

A. Kshemkalyani and M. Singhal (Distributed Computing)

Terminology and Basic Algorithms

CUP 2008

53 / 79

Distributed Computing: Principles, Algorithms, and Systems

Synchronizers: Complexity

Minit
Tinit
Mround
Tround

Simple synchronizer

synchronizer

0
d
2|L|
1

0
0
O(|L|)
O(1)

O(n log (n) + |L|)

O(n)
O(n)
O(n)

O(kn2 )
n log (n)/log (k)
O(Lc ) ( O(kn))
O(hc ) ( O(log (n)/log (k)))

The message and time complexities for synchronizers.

hc is the greatest height of a tree among all the clusters.
Lc is the number of tree edges and designated edges in the clustering scheme for the
synchronizer.
d is the graph diameter.

A. Kshemkalyani and M. Singhal (Distributed Computing)

Terminology and Basic Algorithms

CUP 2008

54 / 79

Distributed Computing: Principles, Algorithms, and Systems

Simple Synchronizer

A process sends each neighbor 1 message/round

Combine messages or send dummy message
On receiving a msg from each neighbor, go to next round.
Neighbors Pi , Pj may be only one round apart
Pi in roundi can receive msg from only roundi or roundi + 1 of neighbor.
Initialization:
I
I
I

Any process may start round x.

In d time units, all processes would be in round x.
Tinit = d, Minit = 0.

Complexity: Mround = 2|L|, Tround = 1.

A. Kshemkalyani and M. Singhal (Distributed Computing)

Terminology and Basic Algorithms

CUP 2008

55 / 79

Distributed Computing: Principles, Algorithms, and Systems

Synchronizer
Pi in round r moves to r + 1 if all neighbors are safe for round r .
When neighbor Pj receives ack for each message it sent, it informs Pi (and
its other neighbors) that it is safe.
B

3
3

D
execution message

D
acknowledgement

(a)

"safe"

(b)

Figure 5.10: Example. (a) Execution msgs (1) and acks (2). (b) I am safe msgs (3).

A. Kshemkalyani and M. Singhal (Distributed Computing)

Terminology and Basic Algorithms

CUP 2008

56 / 79

Distributed Computing: Principles, Algorithms, and Systems

Synchronizer: Complexity

Complexity:
I
I

l 0 msgs l 0 acks; transport layer acks free!

2|L| messages/round to inform neighbors of safety.
Mround = O(|L|).Tround = O(1).

Initialization: None. Any process may spontaneously wake up.

A. Kshemkalyani and M. Singhal (Distributed Computing)

Terminology and Basic Algorithms

CUP 2008

57 / 79

Distributed Computing: Principles, Algorithms, and Systems

Synchronizer

Initialization: rooted spanning tree, O(n log n + |L|) messages, O(n) time.
Operation:
Safe nodes initiate convergecast (CvgC)
intermediate nodes propagate CvgC when their subtree is safe.
When root becomes safe and receives CvgC from all children, initiates tree
broadcast to inform all to move to next round.
Complexity: l 0 acks for free, due to transport layer.
Mround = 2(n 1)
Tround = 2 log n average; 2n worst case

A. Kshemkalyani and M. Singhal (Distributed Computing)

Terminology and Basic Algorithms

CUP 2008

58 / 79

Distributed Computing: Principles, Algorithms, and Systems

Synchronizer: Clusters
Set of clusters; each cluster has a spanning tree
Intra-cluster: synchronizer over tree edges
Inter-cluster: synchronizer over designated inter-cluster edges. (For 2
neighboring clusters, 1 inter-cluster edge is designated.)
A

tree edge
designated (intercluster) edge

root

Figure 5.11: Cluster organization. Only tree edges and inter-cluster designated edges are shown.

A. Kshemkalyani and M. Singhal (Distributed Computing)

Terminology and Basic Algorithms

CUP 2008

59 / 79

Distributed Computing: Principles, Algorithms, and Systems

Synchronizer: Operation and Complexity

Within cluster, synchronizer executed

Once cluster is stabilized, synchronizer over inter-cluster edges
To convey stabilization of inter-cluster synchronizer, within a cluster, CvgC
and BC phases over tree
This CvgC initiated by leaf nodes once neighboring clusters are stabilized.
Mround = O(Lc ), Tround = O(hc ).

A. Kshemkalyani and M. Singhal (Distributed Computing)

Terminology and Basic Algorithms

CUP 2008

60 / 79

Distributed Computing: Principles, Algorithms, and Systems

Synchronizer: Code
(message types)
Subtree safe
This cluster safe
My cluster safe
Neighboring cluster safe
Next round

// synchronizer phases convergecast within cluster

// synchronizer phases broadcast within cluster
// embedded inter-cluster synchronizers messages across cluster boundaries
// Convergecast following inter-cluster synchronizer phase
// Broadcast following inter-cluster synchronizer phase

for each round do

1 ( synchronizer phase:) This phase aims to detect when all the nodes within a cluster are safe, and inform all the nodes in
that cluster.

1
2

Using the spanning tree, leaves initiate the convergecast of the Subtree safe message towards the root of the cluster.
After the convergecast completes, the root initiates a broadcast of This cluster safe on the spanning tree within the
cluster.

(Embedded synchronizer:)
1

During this broadcast in the tree, as the nodes get engaged, the nodes also send My cluster safe messages
on any incident designated inter-cluster edges.

Each node also awaits My cluster safe messages along any such incident designated edges.

(Convergecast and broadcast phase:) This phase aims to detect when all neighboring clusters are safe, and to inform every
node within this cluster.
1

(Convergecast:)
1

After the broadcast of the earlier phase (1.2) completes, the leaves initiate a convergecast using
Neighboring cluster safe messages once they receive any expected My cluster safe messages (step (1.3))
on all the designated incident edges.
An intermediate node propagates the convergecast once it receives the Neighboring cluster safe message
from all its children, and also any expected My cluster safe message (as per step (1.3)) along designated
edges incident on it.

(Broadcast:) Once the convergecast completes at the root of the cluster, a Next round message is broadcast in the
clusters tree to inform all the tree nodes to move to the next round.

A. Kshemkalyani and M. Singhal (Distributed Computing)

Terminology and Basic Algorithms

CUP 2008

61 / 79

Distributed Computing: Principles, Algorithms, and Systems

Maximal Independent Set: Definition

For a graph (N, L), an independent set of nodes N 0 , where N 0 N, is such

that for each i and j in N 0 , (i, j) 6 L.
An independent set N 0 is a maximal independent set if no strict superset of
N 0 is an independent set.
A graph may have multiple MIS; perhaps of varying sizes.
The largest sized independent set is the maximum independent set.

Application: wireless broadcast - allocation of frequency bands (mutex)

NP-complete

A. Kshemkalyani and M. Singhal (Distributed Computing)

Terminology and Basic Algorithms

CUP 2008

62 / 79

Distributed Computing: Principles, Algorithms, and Systems

Lubys Randomized Algorithm, Async System

Iteratively:
Nodes pick random nos, exchange with nbhs
Lowest number in neighborhood wins (selected in MIS)
If neighbor is selected, I am eliminated ( safety)
Only neighbors of selected nodes are eliminated ( correctness)
Complexity:
In each iteration, 1 selected, 1 eliminated n/2 iterations.
Expected # iterations O(log , n) due to randomized nature.

A. Kshemkalyani and M. Singhal (Distributed Computing)

Terminology and Basic Algorithms

CUP 2008

63 / 79

Distributed Computing: Principles, Algorithms, and Systems

Lubys Maximal Independent Set: Code

(variables)
set of integer Neighbours
real randomi
boolean selectedi
boolean eliminatedi
(message types)
RANDOM(real random)
SELECTED(integer pid, boolean indicator )
ELIMINATED(integer pid, boolean indicator )

// set of neighbours
// random number from a sufficiently large range
// becomes true when Pi is included in the MIS
// becomes true when Pi is eliminated from the candidate set
// a random number is sent
// whether sender was selected in MIS
// whether sender was removed from candidates

(1a) repeat
(1b) if Neighbours = then
(1c)
selectedi true; exit();
(1d) randomi a random number;
(1e) send RANDOM(randomi ) to each neighbour;
(1f) await RANDOM(randomj ) from each neighbour j Neighbours;
(1g) if randomi < randomj (j Neighbours) then
(1h)
send SELECTED(i, true) to each j Neighbours;
(1i)
selectedi true; exit(); // in MIS
(1j) else
(1k)
send SELECTED(i, false) to each j Neighbours;
(1l)
await SELECTED(j, ?) from each j Neighbours;
(1m)
if SELECTED(j, true) arrived from some j Neighbours then
(1n)
for each j Neighbours from which SELECTED(?, false) arrived do
(1o)
send SELECTED(i, true) to j;
(1p)
eliminatedi true; exit(); // not in MIS
(1q)
else
(1r)
send ELIMINATED(i, false) to each j Neighbours;
(1s)
await ELIMINATED(j, ?) from each j Neighbours;
(1t)
for all j Neighbours do
(1u)
if ELIMINATED(j, true) arrived then
(1v)
Neighbours Neighbours \ {j};
(1w) forever.

A. Kshemkalyani and M. Singhal (Distributed Computing)

Terminology and Basic Algorithms

CUP 2008

64 / 79

Distributed Computing: Principles, Algorithms, and Systems

Maximal Independent Set: Example

B
6

2
D
1

J
0

I
D

9
J

K
1

(b)

(a)

Figure 5.12: (a) Winners and losers in round 1. (b) Winners up to round 1, losers in round 2.

Third round: I is winner. MIS={C , E , G , I , K }.

Note: {A, C , G , J} is a smaller MIS.

A. Kshemkalyani and M. Singhal (Distributed Computing)

Terminology and Basic Algorithms

CUP 2008

65 / 79

Distributed Computing: Principles, Algorithms, and Systems

Connected Dominating Set (CDS)

A dominating set of graph (N, L) is a set N 0 N | each node in N \ N 0 has

an edge to some node in N 0 .
A connected dominating set (CDS) of (N, L) is a dominating set N 0 such
that the subgraph induced by the nodes in N 0 is connected.
NP-Complete
I
I

Finding the minimum connected dominating set (MCDS)

Determining if there exists a dominating set of size k < |N|

Poly-time heuristics: measure using approximation factor, stretch factor

I
I

Create ST; delete edges to leaves

Create MIS; add edges to create CDS

Application: backbone for broadcasts

A. Kshemkalyani and M. Singhal (Distributed Computing)

Terminology and Basic Algorithms

CUP 2008

66 / 79

Distributed Computing: Principles, Algorithms, and Systems

Compact Routing Tables (1)

4
13

Avoid tables of size n - large size, more

processing time
Hierarchical routing - hierarchical
clustered network, e.g., IPv4

47
11

Tree labeling schemes

I
I

Logical tree topology for routing

Node labels | dests reachable via
link labeled by contiguous
addresses [x, y ]
Small tables but traffic imbalance

A. Kshemkalyani and M. Singhal (Distributed Computing)

6
55

33
42

77
16

Figure 5.13: Tree label based routing tables.

Tree edges labels in rectangles. Non-tree edges
in dashed lines.

Terminology and Basic Algorithms

CUP 2008

67 / 79

Distributed Computing: Principles, Algorithms, and Systems

Compact Routing Tables (2)

Interval routing:
I
I

Node labeling: B is a 1:1 mapping on N.

Edge labeling: I labels each edge in L by some subset of node labels B(N) |
for any node x
F
F

I
I

all destinations are covered (y Neighbours I(x, y ) B(x) = N) and

there is no duplication of coverage (I(x, w ) I(x, y ) = for
w , y Neighbours).

For any s, t, there exists a path hs = x0 , x1 . . . xk1 , xk = ti where

B(t) I(xi1 , xi ) for each i [1, k].
Interval labeling possible for every graph!
No guarantee on path lengths; not robust to topology changes.

Prefix routing: Node, channel labels from same domain, view as strings
I

To route: use channel whose label is longest prefix of dest.

A. Kshemkalyani and M. Singhal (Distributed Computing)

Terminology and Basic Algorithms

CUP 2008

68 / 79

Distributed Computing: Principles, Algorithms, and Systems

Compact Routing Tables (3)

Stretch factor of a routing scheme r

distancer (i,j)
}.
maxi,jN { distance
opt (i,j)

Designing compact routing schemes:

rich in graph algorithmic problems
Identify and prove bounds on efficiency of routes
Different specialized topologies (e.g., grid, ring, tree) offer scope for easier
results

A. Kshemkalyani and M. Singhal (Distributed Computing)

Terminology and Basic Algorithms

CUP 2008

69 / 79

Distributed Computing: Principles, Algorithms, and Systems

Leader Election

Defn: All processes agree on a common distinguished process (leader)

Distributed algorithms not completely symmetrical; need a initiator, finisher
process; e.g., MST for BC and CvgC to compute global function
LeLang Chang Roberts (LCR) algorithm
I
I
I
I

Asynchronous unidirectional ring

All processes have unique IDs
Processes circulate their IDs; highest ID wins
Despite obvious optimizations, msg complexity n (n 1)/2; time complexity
O(n).

Cannot exist deterministic leader election algorithm for anonymous rings

Algorithms may be uniform

A. Kshemkalyani and M. Singhal (Distributed Computing)

Terminology and Basic Algorithms

CUP 2008

70 / 79

Distributed Computing: Principles, Algorithms, and Systems

Leader Election - LCR algorithm: Code

(variables)
boolean participate false
(message types)
PROBE integer
SELECTED integer

// becomes true when Pi is included in the MIS

// contains a node identifier
// announcing the result

(1) When a process wakes up to participate in leader election:

(1a) send PROBE(i) to right neighbor;
(1b) participate true.
(2) When a PROBE(k) message arrives from the left neighbor Pj :
(2a) if participate = false then execute step (1) first.
(2b) if i > k then
(2c)
discard the probe;
(2d) else if i < k then
(2e)
forward PROBE(k) to right neighbor;
(2f) else if i = k then
(2g)
declare i is the leader;
(2h)
circulate SELECTED(i) to right neighbor;
(3) When a SELECTED(x) message arrives from left neighbor:
(3a) if x 6= i then
(3b)
note x as the leader and forward message to right neighbor;
(3c) else do not forward the SELECTED message.

A. Kshemkalyani and M. Singhal (Distributed Computing)

Terminology and Basic Algorithms

CUP 2008

71 / 79

Distributed Computing: Principles, Algorithms, and Systems

Leader Election: Hirschberg-Sinclair Algorithm

Binary search in both directions on ring; token-based

In each round k, each active process does:
I
I

Token circulated to 2k nghbrs on both sides

Pi is a leader after round k iff i is the highest ID among 2k nghbrs in both
directions
After round k, any pair of leaders are at least 2k apart
# leaders diminishes logarithmically as n/2k
Only winner (leader) after a round proceeds to next round.

In each round, max n msgs sent using supression as in LCR

log n rounds
Message complexity: O(n log n) (formulate exact expression)!
Time complexity: O(n).

A. Kshemkalyani and M. Singhal (Distributed Computing)

Terminology and Basic Algorithms

CUP 2008

72 / 79

Distributed Computing: Principles, Algorithms, and Systems

Object Replication Problems

Weighted graph (N, L), k users at Nk N nodes, r replicas of a object at

Nr N.
What is the optimal placement of the replicas if k > r and accesses are
read-only?
I

P
Evaluate all choices for Nr to identify min( iNk ,ri Nr disti,ri ), where disti,ri is
the cost from node i to ri , the replica nearest to i.

If Read accesses from each user in Nk have a certain frequency (or weight),
the minimization function changes.
Address BW of each edge.
Assume user access is a Read with prob. x, and an Update with prob. 1 x.
Update requires all replicas to be updated.
I

What is the optimal placement of the replicas if k > r ?

A. Kshemkalyani and M. Singhal (Distributed Computing)

Terminology and Basic Algorithms

CUP 2008

73 / 79

Distributed Computing: Principles, Algorithms, and Systems

Adaptive Data Replication: Problem Formulation

Network (V , E ). Assume single replicated object.
Replication scheme: subset R of V | each node in R has a replica.
ri , wi : rates of reads and writes issued by i
cr (i), cw (i): cost of a read and write issued by i.
R: set of all possible replication schemes.
Goal: minimize cost of the replication scheme:
X
X
wi cw (i)]
min [
ri cr (i) +
RR

Arbitrary graph: cost is NP-Complete

Hence, assume tree overlay
Assume one copy serializability, implemented by Read-One-Write-All (ROWA)
policy.

A. Kshemkalyani and M. Singhal (Distributed Computing)

Terminology and Basic Algorithms

CUP 2008

74 / 79

Distributed Computing: Principles, Algorithms, and Systems

Adaptive Data Replication over Tree Overlay

All communication, set R on tree T overlay

R: amoeba-like subgraph, moves to center-of-gravity of activity
I
I
I

Expands when Read cost is higher

Shrinks when Write cost is higher
Equilibrium-state R is optimal; converges in d + 1 steps once Read-Write
pattern stabilizes
Dynamic activity: algorithm re-executed in epochs

Read: From closest replica, along T . Use parent pointers.

Write: To closest replica, along T . Then propagate in R.
Use R neighbor , set of neighbors in R.
Implementation: (i) in R? (ii) R neighbor , (iii) parent.

A. Kshemkalyani and M. Singhal (Distributed Computing)

Terminology and Basic Algorithms

CUP 2008

75 / 79

Distributed Computing: Principles, Algorithms, and Systems

Adaptive Data Replication: Convergence (1)

Rfringe

C
A

E
D

Rneighbour
Rneighbour
and
Rfringe

Figure 5.14: Nodes in ellipse belong to R.

C is R-fringe
A, E are R-fringe and R-neighbour

R-neighbour: i R; and has at least one

neighbour j 6 R.
R-fringe: i R; and has only one
neighbour j R.
Thus, i is a leaf in the
subgraph of T induced by R
and j is parent of i.
singleton: |R| =1 and i R.

D is R-neighbour

A. Kshemkalyani and M. Singhal (Distributed Computing)

Terminology and Basic Algorithms

CUP 2008

76 / 79

Distributed Computing: Principles, Algorithms, and Systems

Adaptive Data Replication: Tests

Tests at end of each epoch.
Expansion test: R-neighbour node i includes
neighbor j in R if r > w .
Contraction test: R-fringe node i excludes
itself from R if w > r .
Before exiting, seek
permission from j to avoid
R = .

r+w

w
(a)

j
i

r
(b)

r+w

(c)

Figure 5.15: (a) Expansion test. (b) Contraction

test. (c) Switch test.

R-neighbour may also be R-fringe or

Switch test: Singleton node i transfers its singleton. In either case, the expansion test
replica to j if r + w being
executed first; if it fails, contraction test or
forwarded by j is greater
switch test is executed.
than r + w that node i
receives from all other
nodes.

A. Kshemkalyani and M. Singhal (Distributed Computing)

Terminology and Basic Algorithms

CUP 2008

77 / 79

Distributed Computing: Principles, Algorithms, and Systems

Adaptive Data Replication: Code (1)

(variables)
array of integer Neighbours[1 . . . bi ];
array of integer Read Received[1 . . . |bi |];
array of integer Write Received[1 . . . |bi |];
integer writei , readi ;
boolean success;

// bi neighbours in tree T topology

// jth element gives # reads from Neighbours[j]
// jth element gives # writes from Neighbours[j]
// # writes and # reads issued locally

(1) Pi determines which tests to execute at the end of each epoch:

(1a) if i is R-neighbour and R-fringe then
(1b)
if expansion test fails then
(1c)
reduction test
(1d) else if i is R-neighbour and singleton then
(1e)
if expansion test fails then
(1f)
switch test
(1g) else if i is R-neighbour and not R-fringe and not singleton then
(1h)
expansion test
(1i) else if i is R neighbour and R-fringe then
(1j)
contraction test.
(2) Pi executes expansion test:
(2a) for j from 1 to bi do
(2b)
if Neighbours[j] not in R then
P
(2c)
if Read Received[j] > (writei + k=1...b ,k6=j Write Received[k]) then
i
(2d)
send a copy of the object to Neighbours[j]; success 1;
(2e) return(success).

A. Kshemkalyani and M. Singhal (Distributed Computing)

Terminology and Basic Algorithms

CUP 2008

78 / 79

Distributed Computing: Principles, Algorithms, and Systems

Adaptive Data Replication: Code (2)

(variables)
array of integer Neighbours[1 . . . bi ];
array of integer Read Received[1 . . . |bi |];
array of integer Write Received[1 . . . |bi |];
integer writei , readi ;
boolean success;

// bi neighbours in tree T topology

// jth element gives # reads from Neighbours[j]
// jth element gives # writes from Neighbours[j]
// # writes and # reads issued locally

(3) Pi executes contraction test:

(3a) let Neighbours[j] be the only neighbour
in R;
P
(3b) if Write Received[j] > (readi + k=1...b ,k6=j Read Received[k]) then
i
(3c)
seek permission from Neighbours[j] to exit from R;
(3d)
if permission received then
(3e)
success 1; inform all neighbours;
(3f) return(success).
(4) Pi executes switch test:
(4a) for j from 1 to bi do
(4b)
ifP
(Read Received[j] + Write Received[j]) >
[ k=1...b ,k6=j (Read Received[k] + Write Received[k]) + readi + writei ] then
i
(4c)
transfer object copy to Neighbours[j]; success 1; inform all neighbours;
(4d) return(success).

A. Kshemkalyani and M. Singhal (Distributed Computing)

Terminology and Basic Algorithms

CUP 2008

79 / 79

How To A Measure The Performance of A Server?
No ratings yet
How To A Measure The Performance of A Server?
5 pages
Tirfor Wrench
No ratings yet
Tirfor Wrench
18 pages
Distributed Systems Overview
100% (1)
Distributed Systems Overview
14 pages
Introduction To PKI, Certificates & Public Key Cryptography: Erwan Lemonnier
No ratings yet
Introduction To PKI, Certificates & Public Key Cryptography: Erwan Lemonnier
12 pages
Distributed Con Currency Control - 2 of 3
100% (1)
Distributed Con Currency Control - 2 of 3
46 pages
Distributed Systems - System Models
0% (1)
Distributed Systems - System Models
63 pages
SAUER SUSTRAND Series-51-Motor-Repair-Manual PDF
100% (2)
SAUER SUSTRAND Series-51-Motor-Repair-Manual PDF
20 pages
Global State & Snapshot Algorithms
No ratings yet
Global State & Snapshot Algorithms
51 pages
Fall Protection - Right Use of Full Body Harness-1
No ratings yet
Fall Protection - Right Use of Full Body Harness-1
20 pages
Hartley PDF
100% (2)
Hartley PDF
226 pages
Encrypted Data Analysis
No ratings yet
Encrypted Data Analysis
30 pages
Railway, Bridge and Tunnel Engineering
No ratings yet
Railway, Bridge and Tunnel Engineering
3 pages
Agal-8mkqdm r0 en
No ratings yet
Agal-8mkqdm r0 en
16 pages
Concurrency Control in DDBMS
No ratings yet
Concurrency Control in DDBMS
37 pages
Chapter 1
100% (1)
Chapter 1
36 pages
Jupyterlab
100% (1)
Jupyterlab
91 pages
DS Lecture1 PDF
No ratings yet
DS Lecture1 PDF
41 pages
Queuing Theory Models & Cost Analysis
No ratings yet
Queuing Theory Models & Cost Analysis
29 pages
Engineering Procedure
No ratings yet
Engineering Procedure
10 pages
Precast Prestressed Concrete Design
No ratings yet
Precast Prestressed Concrete Design
5 pages
Gas Compressor Noise Control Guide
No ratings yet
Gas Compressor Noise Control Guide
284 pages
Distributed Systems REPORT
No ratings yet
Distributed Systems REPORT
39 pages
Capacity Planning For Application Design: White Paper
No ratings yet
Capacity Planning For Application Design: White Paper
10 pages
QEMU Internals for Developers
100% (1)
QEMU Internals for Developers
23 pages
Introduction To Control Systems: Unit - 1
No ratings yet
Introduction To Control Systems: Unit - 1
19 pages
Distributed Systems
67% (3)
Distributed Systems
331 pages
Ns 3 Tutorial Slides
No ratings yet
Ns 3 Tutorial Slides
107 pages
Rest & Restful Web Services
No ratings yet
Rest & Restful Web Services
38 pages
Distributed Computing Basics
No ratings yet
Distributed Computing Basics
36 pages
IT502 Operating System: Charotar Institute of Technology
No ratings yet
IT502 Operating System: Charotar Institute of Technology
49 pages
Distributed Systems: Chapter 1 - Introduction
100% (2)
Distributed Systems: Chapter 1 - Introduction
74 pages
Ch03 Types and Application of Virtualization
No ratings yet
Ch03 Types and Application of Virtualization
17 pages
FreeRADIUS Implementation Ch6
No ratings yet
FreeRADIUS Implementation Ch6
10 pages
FailureDetector ds14
No ratings yet
FailureDetector ds14
33 pages
Go Language: A Modern Approach
100% (3)
Go Language: A Modern Approach
56 pages
Ethics in Information Technology, Fourth Edition: Ethics of IT Organizations
100% (1)
Ethics in Information Technology, Fourth Edition: Ethics of IT Organizations
45 pages
Distributed Systems (Cosc 6003) : Chapter 1 - Introduction
No ratings yet
Distributed Systems (Cosc 6003) : Chapter 1 - Introduction
37 pages
WorkloadCharacterizationAndModeling 2005 Feitelson
No ratings yet
WorkloadCharacterizationAndModeling 2005 Feitelson
508 pages
Network Forensics For Detecting Flooding Attack On Internet of Things IoT Device
No ratings yet
Network Forensics For Detecting Flooding Attack On Internet of Things IoT Device
9 pages
Distributed Resource Management: Distributed Shared Memory
No ratings yet
Distributed Resource Management: Distributed Shared Memory
20 pages
A Comprehensive Survey On IoT Towards 5G Wireless Systems PDF
No ratings yet
A Comprehensive Survey On IoT Towards 5G Wireless Systems PDF
18 pages
Distributed Computing Models
No ratings yet
Distributed Computing Models
31 pages
Radiographic Testing Mastery Guide
No ratings yet
Radiographic Testing Mastery Guide
3 pages
(15 Feb) Ipptchap013 SQC
100% (1)
(15 Feb) Ipptchap013 SQC
31 pages
Fundamentals of Distributed Systems
100% (1)
Fundamentals of Distributed Systems
20 pages
Fault Tolerance:-: Introduction, Process Resilience, Distributed Commit, Recovery
No ratings yet
Fault Tolerance:-: Introduction, Process Resilience, Distributed Commit, Recovery
52 pages
IT Fundamentals Quiz Questions
No ratings yet
IT Fundamentals Quiz Questions
518 pages
You Aren't A One Man Army:: Introducing 0MQ
No ratings yet
You Aren't A One Man Army:: Introducing 0MQ
26 pages
Modeling Basic Operations and Inputs
No ratings yet
Modeling Basic Operations and Inputs
193 pages
Distributed System Patterns
No ratings yet
Distributed System Patterns
31 pages
Computer Security and Safety, Ethics, and Privacy
No ratings yet
Computer Security and Safety, Ethics, and Privacy
51 pages
Distributed Systems Overview
No ratings yet
Distributed Systems Overview
70 pages
Presentation On 802.11ac
No ratings yet
Presentation On 802.11ac
88 pages
Kubernates-Part1
No ratings yet
Kubernates-Part1
28 pages
Confidentiality and Privacy Controls: Control and Audit of Accounting Information Systems
No ratings yet
Confidentiality and Privacy Controls: Control and Audit of Accounting Information Systems
45 pages
Distributed Systems:: Principles and Paradigms
No ratings yet
Distributed Systems:: Principles and Paradigms
31 pages
Netdev Tutorial
No ratings yet
Netdev Tutorial
40 pages
Us 5373560
No ratings yet
Us 5373560
20 pages
PCM and Pulse Modulation
No ratings yet
PCM and Pulse Modulation
27 pages
eXtremeDB Dot Net User Guide
No ratings yet
eXtremeDB Dot Net User Guide
20 pages
Article 6 Instruments, Fittings, and Controls: HG-600 General
No ratings yet
Article 6 Instruments, Fittings, and Controls: HG-600 General
5 pages
Distributed Systems Characterization and Design
No ratings yet
Distributed Systems Characterization and Design
35 pages
CV Hermanto Tarihoran-2
No ratings yet
CV Hermanto Tarihoran-2
1 page
Cost-Benefit Assessment of Energy Storage For Utility and Customers A Case Study in Malaysia
No ratings yet
Cost-Benefit Assessment of Energy Storage For Utility and Customers A Case Study in Malaysia
11 pages
LadkinRCAoverview20130120 PDF
No ratings yet
LadkinRCAoverview20130120 PDF
19 pages
9 Privacy
No ratings yet
9 Privacy
15 pages
HSPICE Toolbox For Matlab and Octave (Also For Use With Ngspice)
No ratings yet
HSPICE Toolbox For Matlab and Octave (Also For Use With Ngspice)
5 pages
Distributed Systems
No ratings yet
Distributed Systems
238 pages
Pfleeger Ch01 Part1
No ratings yet
Pfleeger Ch01 Part1
18 pages
LG AC Service Guide
No ratings yet
LG AC Service Guide
39 pages
Edge Computing
No ratings yet
Edge Computing
22 pages
IS 10262 2019 NewConcreteMix Design
No ratings yet
IS 10262 2019 NewConcreteMix Design
69 pages
CS3551Unit 1
No ratings yet
CS3551Unit 1
53 pages
Fault Tolerance
No ratings yet
Fault Tolerance
33 pages
Introduction To CICD
No ratings yet
Introduction To CICD
8 pages
Lecture 2 - Distributed Systems Models
No ratings yet
Lecture 2 - Distributed Systems Models
51 pages
Data - Sheet - MATRIX - 10-4T - 1.5 - DWS + FWS SYSTEM
No ratings yet
Data - Sheet - MATRIX - 10-4T - 1.5 - DWS + FWS SYSTEM
7 pages
Lecture 7 Disributed Algorithms
No ratings yet
Lecture 7 Disributed Algorithms
43 pages
Chapter 5
No ratings yet
Chapter 5
79 pages
GR-CCL Logging SOP Guide
No ratings yet
GR-CCL Logging SOP Guide
44 pages
GNS3 Your Network Simulation Playground
No ratings yet
GNS3 Your Network Simulation Playground
10 pages
CHINA'S HISTORIC BRIDGES - ARCHITECTURE OVER WATER - Zhejiang Province Ronald G. Knapp - 2015)
No ratings yet
CHINA'S HISTORIC BRIDGES - ARCHITECTURE OVER WATER - Zhejiang Province Ronald G. Knapp - 2015)
27 pages
Hudong Manual Vol 2 and 3 - 103840
No ratings yet
Hudong Manual Vol 2 and 3 - 103840
4 pages
CNC Motion Control Guide
No ratings yet
CNC Motion Control Guide
84 pages
Cybersecurity 5
No ratings yet
Cybersecurity 5
34 pages
Mma 400G
No ratings yet
Mma 400G
1 page
Hjefhevhfviwaevfiew
No ratings yet
Hjefhevhfviwaevfiew
7 pages
Digital Trust, An Overview
No ratings yet
Digital Trust, An Overview
24 pages
Corrosion Important Questions
No ratings yet
Corrosion Important Questions
4 pages
Well Costs
No ratings yet
Well Costs
27 pages