Cristian's algorithm: clock sync
Round trip delay D = Dreq + Dresp = (T4 -T1)-(T3 -
T2 ) Approximates Dreq ≈ Dresp so Dresp = D/2
Client sets local clock to T3+D/2
Adjusting clock time
C(t) = a1*H(t) + b1: for monotonically
Board Cast:
FIFO broadcast: If m1 and m2 are broadcast by
the same node, and broadcast(m1) →
broadcast(m2), then m1 must be delivered before Board Cast Methods:
m2 Reliable:
Causal broadcast: If broadcast(m1) → Eager reliable broadcast: N^2
broadcast(m2), then m1 must be delivered before Gossip protocol: sends to a fixed number of
random nodes, when node for the first time, it Data Consistency
m2
forwards message to a fixed number of random get() expects value written by most recent put()
Total order broadcast: If m1 is delivered before
m2 on one node, then m1 must be delivered before nodes. Pros: efficient, resilient to loss, crashes. Linearizability
m2 on all nodes Cons: only guarantees reliable delivery with high conditions: 1. Operations appear to execute in a
probability total order (Clients see same order of writes) 2.
Logical Clock
Total order maintains real-time order between
1. Internal event Increment the local clock: Causal broadcast: operations • If Operation A completes before
C (i)=C (i)+ 1 Operation B begins in real-time, then A must be
2. Sending a message: The sender increments its ordered before B • If neither A nor B completes
clock before sending:C (i)=C (i)+ 1 before the other begins, then there is no real-time
3. Receiving a message: order, but there must be some total order (Clients
C (i)=max(C ( j),C (i))+1where C (i) is the read latest data, once a read returns a value, all
timestamp from the received message. later reads return that value)
Vector Clock How to ensure exactly-once semantics for
1. For local event: VC [i]=VC [i]+ 1| 2. When Linearizability: • Perform duplication detection •
Handle server crashes, or • Use a fault-tolerant
Process i sends a message:
service
VC [i]=VC [i]+ 1 , send (VC , m) | 3. When Crash recovery:
Process i receives message Shadow copy: • Pre-commit: Create a complete
VC [ j ] =max { VC [ j ] , VCm [ j ] } , for each j∈ {1 , … , n } working copy, make changes to the working copy •
VC [i]=VC [i]+ 1 Commit point: Atomically exchange working copy
For Logical Clock: a ⇒ b iff (C(a) < C(b)) or [(C(a)
Compare clock e.g. lower-level atomic method, e.g., rename •
Post-commit • Release space occupied by original
= C(b)) and (i < j)] copy • Recovery
For Vector Clock: Same: V(a) = V(b) when ak =
all k and V(a) ≠ V(b) | Concurrent: V(a) ∥ V(b)
bk for all k | a → b : V(a) < V(b) when ak ≤ bk for
when ai < bi and aj > bj , for some i and j Total order broadcast: single leader:
To broadcast sends to the leader; Leader
broadcasts it via FIFO broadcast;. Assumption:
leader does not crash
Total order broadcast: logical clocks:
When node broadcasts message: • Attach logical
clock, Send message via reliable broadcast
When node receives message: • Buffer message
in total order of timestamps • Suppose the earliest
message in the buffer has timestamp T • Deliver
when we have seen all messages with timestamp
<T Write-Ahead Logging
Assumption: nodes do not crash Undo-Logging:
Typically, a majority quorum is used: R = W =
(n+1)/2
Read repair • After get() returns, it issues a put()
with the latest value to all replicas that responded
with stale value or did not respond
Broadcast-based replication
Primary-backup replication • One primary, others
backup • Primary receives and executes •
Replicates updated (passive replication). Primary
waits for acks from all backups, then respond. Can
have n-1 fail
State machine replication (SMR) • Symmetric
replicas • Any replica receives and replicates
operations • All replicas execute operations (active
replication) • Fault tolerance based on consensus
algorithm, can have (n-1)/2 fail
Requirements
• Initial state: start in the same state • Determinism:
receiving the same input on the same state
produce the same output and resulting state •
Agreement: All replicas process inputs in the same
sequence
Benefit of Log: • keeps current state of each lock •
Log allows leader to order the operations• Log
allows storing both tentative, committed operations
• Replicas only deliver committed operations to
service • Log allows handling failures (leader
resend)
Log synchronization:
Checkpoint: Leader forces followers to have same log
Restrictions on Election: Replicas respond to
candidate if it is at least as up to date: • Candidate
has higher term in last log entry, or • Candidate
has same last term and same or longer log length
When Leader Commit: when it is stored durably
on a majority
Log Compaction
Snapshot + Discards log until snapshot log index
Snapshot RPC: : If leader compacts log while
follower offline, follower’s log may end before the
start leader’s log - Leader sends snapshot and log.
Client Interaction
Storage API • put(key, value, T), (value, T) = Problem: Suppose leader executes client
get(key), del(key, T) // time stamp for at most 1 operation, then crashes before sending response
Concurrent writes to client • Client retries same operation with
Method 1. Use total order timestamp, e.g., logical another leader • Operation is executed twice
timestamp • v2 replaces v1, if T2 > T1; • Last writer Ensuring exactly-once: • State machine performs
wins, can lose data duplicate detection • Keeps [client -> (request ID,
Method 2: Use partial order timestamp, e.g., vector response)] , state machine checks, and returns
timestamp • v2 replaces v1, if T2 > T1; preserve response (without re-executing).
both {v1, v2} if T1 ∥ T2; • Complicated scheme, Leader for Read Ops:• Leader sends heartbeat
vector timestamps can become large messages to followers • Waits for a majority to
Quorum-based replication know if it is still the current leader • Responds to
Choose: R + W > N read-only operation (no logging needed).