0% found this document useful (0 votes)

3 views48 pages

Lect 2

The document outlines the fundamentals of distributed systems, including the relationship between processes (client-server and peer-to-peer), communication methods, and the challenges posed by variable communication times and differing clock rates. It also discusses types of failures, such as omission and arbitrary failures, and methods for detecting process failures through techniques like periodic pings and heartbeats. Key aspects include the distinction between synchronous and asynchronous systems and the metrics for failure detection.

Uploaded by

meepo.zhang

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

3 views48 pages

Lect 2

Uploaded by

meepo.zhang

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 48

Distributed Systems

CS425/ECE428
Logistics Related

• Undergraduates switching from T3 to T4

• Please email Heather Mihaly and Elsa Gunter
([email protected], [email protected]) with the
request and your UIN.
Today’s agenda

• System Model
• Chapter 2.4 (except 2.4.3), parts of Chapter 2.3

• Failure Detection
• Chapter 15.1
What is a distributed system?

process
thread,
node,
....

Independent components that are connected by a network

and communicate by passing messages to achieve a common
goal, appearing as a single coherent system.
Relationship between processes

• Two main categories:

• Client-server

• Peer-to-peer
Relationship between processes

• Client-server
Request

Client Server

Response
Clear difference in roles.
Relationship between processes

• Client-server
1. Request 2. Request

Client P Server

4. Response 3. Response
Relationship between processes

• Peer-to-peer
Peer

Peer Peer

Similar roles.
Run the same program/algorithm.
Relationship between processes

Server
Client

Server ...…

Client Server

peer-to-peer
Relationship between processes

• Two broad categories:

• Client-server

• Peer-to-peer
Distributed algorithm

• Algorithm on a single process

• Sequence of steps taken to perform a computation.
• Steps are strictly sequential.

• Distributed algorithm
• Steps taken by each of the processes in the system (including
transmission of messages).
• Different processes may execute their steps concurrently.
Key aspects of a distributed system
• Processes must communicate with one another to
coordinate actions. Communication time is variable.

• Different processes (on different computers) have different

clocks!

• Processes and communication channels may fail.

Key aspects of a distributed system
• Processes must communicate with one another to
coordinate actions. Communication time is variable.

• Different processes (on different computers) have different

clocks!

• Processes and communication channels may fail.

How processes communicate

• Directly using network sockets.

• Abstractions such as remote procedure calls,

publish-subscribe systems, or distributed share
memory.

• Differ with respect to how the message, the sender

or the receiver is specified.
How processes communicate

p m q
communication channel
Communication channel properties
L

p m q
communication channel
• Latency (L): Delay between the start of m’s transmission at p and the
beginning of its receipt at q.
• Time taken for a bit to propagate through network links.
• Queuing that happens at intermediate hops.
• Delay in getting to the network.
• Overheads in the operating systems in sending and receiving
messages.
• …..
Communication channel properties
size(m)/B

p m q

• Latency (L): Delay between the start of m’s transmission at p and the
beginning of its receipt at q.

• Bandwidth (B): Total amount of information that can be transmitted

over the channel per unit time.
• Per-channel bandwidth reduces as multiple channels share common
network links.
Communication channel properties

p m q

• Total time taken to pass a message is governed by latency

and bandwidth of the channel.

• Both latency and available bandwidth may vary over time.

Key aspects of a distributed system
• Processes must communicate with one another to
coordinate actions. Communication time is variable.

• Different processes (on different computers) have different

clocks!

• Processes and communication channels may fail.

Differing clocks

• Each computer in a distributed system has its own

internal clock.

• Local clock of different processes show different time

values.

• Clocks drift from perfect times at different rates.

Key aspects of a distributed system
• Processes must communicate with one another to
coordinate actions. Communication time is variable.

• Different processes (on different computers) have different

clocks!

• Processes and communication channels may fail.

Two ways to model
• Synchronous distributed systems:
• Known upper and lower bounds on time taken by each step in a
process.
• Known bounds on message passing delays.
• Known bounds on clock drift rates.

• Asynchronous distributed systems:

• No bounds on process execution speeds.
• No bounds on message passing delays.
• No bounds on clock drift rates.
Synchronous and Asynchronous

• Most real-world systems are asynchronous.

• Bounds can be estimated, but hard to guarantee.
• Assuming system is synchronous can still be useful.

• Possible to build a synchronous system.

Key aspects of a distributed system
• Processes must communicate with one another to
coordinate actions. Communication time is variable.

• Different processes (on different computers) have different

clocks!

• Processes and communication channels may fail.

Types of failure
• Omission: when a process or a channel fails to perform
actions that it is supposed to do.
• Process may crash.
How to detect a crashed process?

Periodic ping
p q
ack

Periodic
heartbeats
p q
How to detect a crashed process?

Periodic ping
p q
ack

∆1 time elapsed after sending ping, and no ack.

If synchronous, ∆1 = 2(max network delay)

If asynchronous, ∆1 = k(max observed round trip time)
How to detect a crashed process?

Periodic ping
p q
ack

Pings are sent every T seconds.

∆1 time elapsed after sending ping, and no ack, report crash.

If synchronous, ∆1 = 2(max network delay)

If asynchronous, ∆1 = k(max observed round trip time)
How to detect a crashed process?
Periodic
heartbeats
p q

(T + ∆2) time elapsed since last heartbeat.

t + min
t +T

t + T + max
How to detect a crashed process?
Periodic
heartbeats
p q

(T + ∆2) time elapsed since last heartbeat, report crash.

If synchronous, ∆2 = max network delay – min network delay

If asynchronous, ∆2 = k(observed delay)
Correctness of failure detection

• Completeness
• Every failed process is eventually detected.

• Accuracy
• Every detected failure corresponds to a crashed process
(no mistakes).
Correctness of failure detection
• Characterized by completeness and accuracy.

• Synchronous system
• Failure detection via ping-ack and heartbeat is both
complete and accurate.

• Asynchronous system
• Our strategy for ping-ack and heartbeat is complete.
• Impossible to achieve both completeness and accuracy.
• Can we have an accurate but incomplete algorithm?
• Never report failure.
Metrics for failure detection

• Worst case failure detection time

• Ping-ack: T + ∆1
• Heartbeat: ∆ + T + ∆2
Metrics for failure detection

• Worst case failure detection time

• Ping-ack: T + ∆1- ∆ (where ∆ is time taken for last ping from p to reach q)
• Heartbeat: ∆ + T + ∆2
Metrics for failure detection

• Worst case failure detection time

• Ping-ack: T + ∆1- ∆ (where ∆ is time taken for last ping from p to reach q)
• Heartbeat: ∆ + T + ∆2 (where ∆ is time taken for last message from q to reach p)
Metrics for failure detection
Try deriving these
• Worst case failure detection time before next class!
• Ping-ack: T + ∆1- ∆ (where ∆ is time taken for last ping from p to reach q)
• Heartbeat: ∆ + T + ∆2 (where ∆ is time taken for last message from q to reach p)
Metrics for failure detection