Process groups and message ordering
If processes belong to groups, certain algorithms can be used that depend on group properties
membership
create
t ( name ),
) kill ( name )
join ( name, process ), leave ( name, process )
internal structure?
NO ( peer structure ) failure tolerant,
tolerant complex protocols
YES ( a single coordinator and point of failure ) simpler protocols
e.g. all join requests must go to the coordinator concurrent joins avoided
closed or open?
p
OPEN a non-member can send a message to the group
CLOSED only members can send to the group
failures?
a failed process leaves the group without executing leave
robustness
leave, join and failures happen during normal operation algorithms must be robust
Message ordering Process groups
Message delivery for a process group - assumptions
ASSUMPTIONS
messages are multicast to named process groups
reliable channels: a given message is delivered reliably to all members of the group
FIFO from a g
given source to a ggiven destination
processes dont crash (failure and restart not considered)
processes behave as specified ee.g.
g send the same values to all processes
- we are not considering so-called Byzantine behaviour
(when malicious or erroneous processes do not behave according to their specifications
see Lamports Byzantine Generals problem).
Message ordering Process groups
Ordering message delivery
application
process
may specify delivery order to message service
i l (no)
( ) order,
d totall order,
d causall order
d
e.g. arrival
message service
may reorder delivery to application
((on request
q
for
f some order)) byy buffering
ff
g messages
g
OS comms. interface
assume FIFO from each source at this level
(done by lower levels)
total order = every process receives all messages in the same order (including its own).
own)
We first consider causal order
Message ordering Process groups
Message delivery causal order
First, define causal order in terms of one-to-one messages; later, multicast to a process group
application processes
P1
time
m
P2
m
P3
P1 sent message m provably before P2 sent message m'
The above diagram shows a violation of causal delivery order
Causal deliveryy order requires
q
that, at P3, m is delivered before m'
The definition relates to POTENTIAL causality, not application semantics
DEFINITION off causall delivery
d li
order
d (where
( h < means happened
h
d before)
b f )
sendi ( m ) < sendj ( m' ) => deliverk ( m ) < deliverk ( m')
Message ordering Process groups
Message delivery causal order for a process group
If we know that all processes in a group receive all messages, the message delivery
service can implement causal delivery order (for total order, see later)
application processes, time
P1
P2
P3
application
process
the message service can postpone the delivery
of messages to the application process
message service
OS comms. interface
Message ordering Process groups
Message delivery causal order using Vector Clocks
A vector clock is maintained by the message service at each node for each process:
application processes
1,0,0
2,1,0
P1
0,1,0
1,2,0
P2
P3
1,0,1
1,1,2
vector notation:
- fixed number of processes N
- each processs message service keeps a vector of dimension N
- for each process, each entry records the most up-to-date value of the state counter
delivered to the application process,
process for the process at that position
Message ordering Process groups
Vector Clocks message service operation
application processes
1,0,0
2,1,0
P1
P2
0,1,0
3,3,0
4,3,4
1,2,0 1,3,0
1,4,4
P3
1,0,1
1,1,2
1,3,3
1,3,4
Message service operation:
before send increment local process state value in local vector
on send, timestampp message
g with sending
g pprocesss local vector
on receive by message service see below
on deliver to receiving application process, increment receiving processs state value
in its local vector and update the other fields of the vector by comparing its values
with the incoming vector (timestamp) and recording the higher value in each field,
thus updating this processs knowledge of system state
Message ordering Process groups
Implementing causal order using Vector Clocks
application
li ti processes
1,0,0
2,2,0
P1
P2
110
1,1,0
P3
120
1,2,0
? = message service
000
0,0,0
P3s vector is at (0,0,0) and a message with timestamp (1,2,0) arrives from P2
i.e. P2 has received a message from P1 that P3 hasnt seen.
More detail of P3s message service:
receiver vector
sender sender vector decision new receiver vector
000
0,0,0
P2
120
1,2,0
buffer
0,0,0
000
P3 is missing a message from P1 that sender P2 has already received
0,0,0
P1
1,0,0
deliver
1,0,1
1,0,1
P2
1,2,0
deliver
1,2,2
In each case: do the sender and receiver agree on the state of all other processes?
If the sender has a higher state value for any of these others, the receiver is missing a
message so buffer the message
message,
Message ordering Process groups
Vector Clocks - example
application
li ti processes
1,0,0,0
2,0,2,0
P1
P2
P3
?
1,0,1,0
P4
1,0,2,0
1,0,0,1
1,0,2,2
sender sender vector receiver receiver vector decision
new receiver vector
P3
1,0,2,0
P1
1,0,0,0
deliver
P1 -> 2,0,2,0
same states for P2 and P4
P3
1020
1,0,2,0
P4
1001
1,0,0,1
ddeliver
li
P4 -> 1,0,2,2
1022
same states for P1 and P2
P3
1,0,2,0
P2
0,0,0,0
buffer
P2 -> 0,0,0,0
same state ffor P4,, different
ff
for
f P1- missingg message
g
P1
1,0,0,0
P2
0,0,0,0
deliver
P2 -> 1,1,0,0
reconsider buffered message:
P3
1,0,2,0
P2
1,1,0,0
deliver
P2 -> 1,2,2,0
same states
t t for
f P1 andd P4
Message ordering Process groups
Total order is not enforced by the vector clocks algorithm
application
li ti processes
1,0,0,0
P2
m1
P3
1,0,1,0
P4
3,2,2,0
2,2,0,0
P1
1100
1,1,0,0
1200
1,2,0,0
1320
1,3,2,0
m2
m3
1,0,2,0
1,0,0,1
1,0,2,2
1,2,3,0
1,2,2,3
m2 and m3 are not causally related
P1 receives m1, m2, m3
P2 receives m1, m2, m3
P3 receives m1, m3, m2
P4 receives m1, m3, m2
If the application requires total order this could be enforced by modifying
the vector clock algorithm to include ACKs and delivery to self.
Message ordering Process groups
10
Totally ordered multicast
The vectors can be a large overhead on message transmission and a simpler algorithm
can be used if only total order is required.
Recall the ASSUMPTIONS
messages are multicast to named process groups
reliable channels: a given message is delivered reliably to all members of the group
FIFO from a given source to a given destination
processes dont crash (failure and restart not considered)
no Byzantine behaviour
total order algorithm
- sender multicasts to all including itself
- all acknowledge receipt as a multicast message
- message is delivered in timestamp order after all ACKs have been received
If the
h ddelivery
li
system must support both,
b h so that
h applications
li i
can choose,
h
vector clocks can achieve both causal and total ordering.
Message ordering Process groups
11
Total ordered multicast outline of approach
application
li ti processes
P1
P2
2 3
P3
3
ACKs
P1 increments its clock to 1 and multicasts a message with timestamp 1
All delivery systems collect the message, multicast ACK and collect all ACKs
- no contention deliver message to application processes
and increment local clocks to 2.
P2 and P3 both multicast messages with timestamp 3
All delivery systems collect messages,
messages multicast ACKs and collect ACKs.
ACKs
- contention so use a tie-breaker (e.g. lowest process ID wins)
and deliver P2s message before P3s
This is just a sketch of an approach. In practice, timeouts would have to be used to take
account of long delays due to congestion and/or failure of components and/or
communication links
Message ordering Process groups
12