0% found this document useful (0 votes)

31 views33 pages

Fault Tolerance

Uploaded by

roy

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

31 views33 pages

Fault Tolerance

Uploaded by

roy

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 33

Availability, Reliability,

and Fault Tolerance

Guest Lecture for Software Systems Security

Tim Wood

Professor Tim Wood - The George Washington University

Distributed Systems have Problems
• Hardware breaks
• Software is buggy
• People are jerks

• But software is increasingly important

• Runs nuclear reactors
• Controls engines and wings on a passenger jet
• Runs your facebook wall!

• How can we make these systems more

reliable?
• Particularly large distributed systems
2
Inside a Data Center
• Giant warehouse filled with:
• Racks of servers
• Disk arrays

• Cooling infrastructure
• Power converters
• Backup generators

3
Modular Data Center
• ...or use shipping containers
• Each container filled with
thousands of servers
• Can easily add new
containers
• “Plug and play”
• Just add electricity

• Allows data center to be

easily expanded
• Pre-assembled, cheaper

4
Definitions
• Availability: whether the system is ready to use
at a particular time
• Reliability: whether the system can run
continuously without failure
• Safety: whether a disaster happens if the system
fails to run correctly at some point
• Maintainability: how easily a system can be
repaired after failure

5
Availability and Reliability
• System 1: crashes for 1 millisecond every hour

• System 2: never crashes, but has to be shutdown

two weeks a year

6
Availability and Reliability
• System 1: crashes for 1 millisecond every hour
• Better than 99.9999% availability
• Not very good reliability...

• System 2: never crashes, but has to be shutdown

two weeks a year
• "Perfectly" reliable
• Only 96% availability

Is one more important?

7
Quantifying Reliability
• MTTF: Mean Time To Failure
• The average amount of time until a failure occurs

• MTTR: Mean Time To Repair

• The average amount of time to repair after a failure

• MTBF: Mean Time Between Failures

<---MTTF---> <---MTTF--->
<-MTTR->
Time
<-------MTBF------->

8
Real Failure Rates
• Standard hard drive MTBF
• 600,000 hours = 68 years!
• 1/68 = 1.5% chance of failure per year

9
Real Failure Rates
• Standard hard drive MTBF
• 600,000 hours = 68 years!
• 1/68 = 1.5% chance of failure per year

• A big Google data center:

• Has 200,000+ hard drives
• 1.5% x 200,000 = 2,921 drive crashes per year
• or about 8 disk failures per day

9
Real Failure Rates
• Standard hard drive MTBF
• 600,000 hours = 68 years!
• 1/68 = 1.5% chance of failure per year

• A big Google data center:

• Has 200,000+ hard drives
• 1.5% x 200,000 = 2,921 drive crashes per year
• or about 8 disk failures per day

• Failures happen a lot

• Need to design software to be resilient to all types of
hardware failures
• Actual failure rates are closer to 3% per year

9
Reliability Challenges
• Typical failures in one year of a google data center:
• 1000 individual machine failures
• thousands of hard drive failures
• 1 PDU (Power Distribution Unit) failure (about 500-1000 machines suddenly
disappear, budget 6 hours to come back)
• 1 rack-reorganization (You have plenty of warning: 500-1000 machines powered
down, about 6 hours)
• 1 network rewiring (rolling 5% of machines down over 2-day span)
• 20 rack failures (40-80 machines instantly disappear, 1-6 hours to get back) 5
racks go wonky (40-80 machines see 50% packet loss)
• 8 network maintenances (4 might cause ~30-minute random connectivity losses)
• 12 router reloads (takes out DNS and external virtual IP address (VIPS) for a
couple minutes)
• 3 router failures (have to immediately pull traffic for an hour)
• 0.5% overheat (power down most machines in under five minutes, expect 1-2
days to recover)
• dozens of minor 30-second blips for DNS

http://static.googleusercontent.com/external_content/untrusted_dlcp/research.google.com/en/us/people/jeff/stanford-295-
talk.pdf
10
Types of Failures
• Systems can fail in different ways

• How?

11
Types of Failures
• Systems can fail in different ways

• Crash failure
• Timing failure
• Content failure
• Malicious failure

• Are some easier to deal with than others?

12
Fault Tolerance through Replication
• We can handle failures through redundancy

• Have multiple replicas run the program

• May want to keep them away from each other
• May want to use different hardware platforms
• May want to use different software implementations

13
Fault Detection
• Detecting a fault can be difficult
• Crash failure
• Timing failure
• Content failure
• Malicious failure

14
Fault Detection
• Detecting a fault can be difficult
• Crash failure
• Timing failure
• Content failure
• Malicious failure

• Approaches:
• Heartbeat messages
• Adaptive timeouts
• Voting / Quorums
• Authentication / signatures

14
Detection is Hard
• Or maybe even impossible

• How long should we set a timeout?

• How do we know heart beat messages will go

through?

15
Two Generals Problem

Ninjas The Enemy Pirates

• The Ninja general and the Pirate general need to

coordinate an attack
• Can (try to) send messengers back and forth
• Messengers can be shot
• How can they guarantee they will both attack at the
same time?
16
Two Generals Problem
• We need to worry about physical characteristics
when we build systems
• Packets can be lost, delayed, reordered
• Disks can be slow, fail, or crash

• Or things can be actively malicious

• Big trouble...

• What kinds of assumptions do we need to make?

• Network is ordered and reliable, but may be slow
• We can quantify the expected number of nodes that will fail

17
Fault Tolerance through Replication
• How to tolerate a crash failure?

• How to tolerate a content failure?

• How many replicas to tolerate f such failures?

18
Fault Tolerance through Replication
• How to tolerate a crash failure?
2+2=4
P1 output = 4
Inputs
P2 crash

x(
f+1 replicas

• How to tolerate a content failure?

2+2
P1 =4
2+2=5 Majority
Inputs P2
Voter output = 4
2 = 4
P3 2+
2f+1 replicas
19
Agreement without Voters
• We can't always assume there is a perfectly
correct voter to validate the answers
• Better: Have replicas reach agreement amongst
themselves about what to do
• Exchange calculated value and have each node pick winner

4
A B Replica Receives Action

5 4 A 4, 4, 5 = 4
5
4 B 4, 4, 5 = 4

C
4 C 4, 4, 5 = 4?

20
Byzantine Generals Problem
• There are N generals making plans for an attack
• They need to decide whether to Attack or Retreat
• Send your vote to everyone (0=retreat, 1=attack)
• But f generals may be traitors that lie and collude
• Can all correct replicas agree on what to do?
• Take majority vote of planned actions

0 Replica Receives Action

A B
A 1, 0, 1 Attack!
1 1
0 B 1, 0, 0 Retreat!
1
C
0 C 1, 0, ? ???

Majority voting doesn't work if a replica lies!

21
Byzantine Generals Solved!
• Need more replicas to reach consensus
• Requires 3f+1 replicas to tolerate f byzantine faults
• Step 1: Send your plan to everyone
• Step 2: Send learned plans to everyone
• Step 3: Use majority of each column
Replica Receives Vote A B
A: (1,0,1,1) A: 1
B: (1,0,0,1) B: 0
A C:
D:
(1,1,1,1)
(1,0,1,1)
C:
D:
1
1
x y
A: (1,0,1,1) A: 1
B
B: (1,0,0,1) B: 0 C D
C: (0,0,0,0) C: 0
D: (1,0,0,1) D: 1 z
22
Can we make this any easier?
• Fundamental challenge in BFT:
• Nodes can misrepresent what they heard from other nodes

• How can we fix this?

• Have nodes sign messages!

• Then liars can't forge messages with false information

23
Can we make this any easier?
• Fundamental challenge in BFT:
• Nodes can misrepresent what they heard from other nodes

• How can we fix this?

• Have nodes sign messages!

• Then liars can't forge messages with false information

• Crypto actually is useful!

23
Denial of Service
• Attack to reduce the availability of a service
• Can also cause crashes if software is poorly written

• "Unsophisticated but effective"

• Flood target with traffic
• No easy way to differentiate between a valid request and an
attacker

Amazon.com

24
Sept 2012 DDoS
• Six US banks attacked
• Attacks were announced in advance
• Banks still could not prevent the damage
• Attackers sent 65 gigabytes of data per second

• Iranian "Cyber Fighter" group claimed

responsibility
• Encouraged members to use Low Orbit Ion Cannon
software to flood banks with traffic
• Also used botnets as a traffic source

25
Sept 2012 DDoS
• But it's not clear if that was the real source...
• Botnet machines have relatively low bandwidth
• Would need 65,000+ compromised machines

• Most traffic to the banks was coming from about 200

IP addresses
• Appear to be a small set of compromised high powered web servers

• Not clear if Iranian hacker group did all of this or if

some other group was the mastermind
• Iranian government fighting against sanctions?
• Eastern European crime groups make fraudulent purchases and
then disrupt bank web activity long enough for them to go through

26
Anonymous (?)
• The Anonymous "hacktivist" group has used
DDoS for various political causes
• Members run LOIC software and target a specific site
• "Volunteer bot net"

• But be careful...

• In March 2012 the LOIC software had a trojan

• Ran a DDoS on the enemy...
• And stole your bank and gmail account info

27
Defending against DDoS
• Some DDoS traffic can be easily distinguished
• Most web apps can safely ignore ICMP and UDP traffic

• But performance impact will depend where filtering is

performed
• Firewall on server being attacked may limit impact on application, but still
clogs network
• Firewall at ISP is much better, but may be under someone else's control

ISP

Amazon.com

28
Summary
• Software systems must worry about:
• Hardware and software failures
• Service availability
• Malicious attacks that affect reliability and/or availability

• Approaches:
• Redundancy
• Fault mitigation
• Fault detection

Fault Tolerance Notes
No ratings yet
Fault Tolerance Notes
101 pages
3E4495 Install Note T20 Alarms Terminal
No ratings yet
3E4495 Install Note T20 Alarms Terminal
26 pages
MC Female Home Challenge 6.0 Cut
100% (2)
MC Female Home Challenge 6.0 Cut
22 pages
Maths
No ratings yet
Maths
114 pages
Fractions Worksheet
No ratings yet
Fractions Worksheet
2 pages
Fault Tolerance Techniques: Unit 3
No ratings yet
Fault Tolerance Techniques: Unit 3
40 pages
Distributed Systems Resilience
No ratings yet
Distributed Systems Resilience
25 pages
Fault Tolerance FDCC
No ratings yet
Fault Tolerance FDCC
76 pages
BDS Session 3
No ratings yet
BDS Session 3
68 pages
SW Architecture - Lecture - 03
No ratings yet
SW Architecture - Lecture - 03
46 pages
Lecture 3
No ratings yet
Lecture 3
118 pages
Fault Tolerance:-: Introduction, Process Resilience, Distributed Commit, Recovery
No ratings yet
Fault Tolerance:-: Introduction, Process Resilience, Distributed Commit, Recovery
52 pages
Fault Tolerance in Distributed Systems
No ratings yet
Fault Tolerance in Distributed Systems
21 pages
Distributed System Patterns
No ratings yet
Distributed System Patterns
31 pages
Fault Tolerance
No ratings yet
Fault Tolerance
13 pages
Lect8 FaultTolerance
No ratings yet
Lect8 FaultTolerance
37 pages
Physical Education Class 12 Important Questions Chapter 10 Kinesiology Biomechanics and Sports - Learn CBSE
No ratings yet
Physical Education Class 12 Important Questions Chapter 10 Kinesiology Biomechanics and Sports - Learn CBSE
14 pages
Tomato Processing Guide by Mynampati Sreenivasa Rao
No ratings yet
Tomato Processing Guide by Mynampati Sreenivasa Rao
4 pages
Distributed Systems (Cosc 6003) : Chapter 1 - Introduction
No ratings yet
Distributed Systems (Cosc 6003) : Chapter 1 - Introduction
37 pages
Class 05 - Constant Failure Rate Model
No ratings yet
Class 05 - Constant Failure Rate Model
22 pages
NPTEL CC Assignment 8
50% (2)
NPTEL CC Assignment 8
4 pages
Distributed Systems REPORT
No ratings yet
Distributed Systems REPORT
39 pages
Distributed Resource Management: Distributed Shared Memory
No ratings yet
Distributed Resource Management: Distributed Shared Memory
20 pages
Cryptography: McEliece System Study
No ratings yet
Cryptography: McEliece System Study
19 pages
How Risk Management Affects
No ratings yet
How Risk Management Affects
61 pages
Introduction To Fault Tolerance
No ratings yet
Introduction To Fault Tolerance
20 pages
CONTOH SKRIPSI (Analisa Penempatan Shear Wall)
No ratings yet
CONTOH SKRIPSI (Analisa Penempatan Shear Wall)
61 pages
Fault Tolerance for Engineers
No ratings yet
Fault Tolerance for Engineers
104 pages
Windows System
No ratings yet
Windows System
52 pages
Chapter Three Searching and Sorting Algorithm
100% (1)
Chapter Three Searching and Sorting Algorithm
47 pages
Automatic Door Solutions Guide
No ratings yet
Automatic Door Solutions Guide
5 pages
Studies in The Psychology of Sex, Volume 3 Analysis of The Sexual Impulse Love and Pain The Sexual Impulse in Women by Ellis, Havelock, 1859-1939
100% (3)
Studies in The Psychology of Sex, Volume 3 Analysis of The Sexual Impulse Love and Pain The Sexual Impulse in Women by Ellis, Havelock, 1859-1939
242 pages
Meaning and Discourse: Dr. Manjet Kaur Dr. Omer Mahfoodh
No ratings yet
Meaning and Discourse: Dr. Manjet Kaur Dr. Omer Mahfoodh
59 pages
Homework 1 Solutions
No ratings yet
Homework 1 Solutions
1 page
Patrolling
No ratings yet
Patrolling
31 pages
Well Productivity in An Iranian Gas-Cond
No ratings yet
Well Productivity in An Iranian Gas-Cond
11 pages
FailureDetector ds14
No ratings yet
FailureDetector ds14
33 pages
Cre Project Group 2 Sec01
No ratings yet
Cre Project Group 2 Sec01
28 pages
Fault Tolerance Exam
No ratings yet
Fault Tolerance Exam
14 pages
16 Fault Tolerance
No ratings yet
16 Fault Tolerance
34 pages
Fault Tolerant System Design
100% (1)
Fault Tolerant System Design
44 pages
DS Lecture1 PDF
No ratings yet
DS Lecture1 PDF
41 pages
Kavi Bhai Santokh Singh
No ratings yet
Kavi Bhai Santokh Singh
4 pages
Rushing Attack and Its Prevention Techniques: Volume 2, Issue 4, April 2013
No ratings yet
Rushing Attack and Its Prevention Techniques: Volume 2, Issue 4, April 2013
4 pages
Slides 08 PDF
No ratings yet
Slides 08 PDF
95 pages
Jupyterlab
100% (1)
Jupyterlab
91 pages
Edge Computing
No ratings yet
Edge Computing
22 pages
Tính Toán Phân Tán
No ratings yet
Tính Toán Phân Tán
79 pages
1 Chapter 13 Dependability Engineering
No ratings yet
1 Chapter 13 Dependability Engineering
50 pages
Fault Avoidance and Tolerance Technique
No ratings yet
Fault Avoidance and Tolerance Technique
15 pages
Distributed Computing Models
No ratings yet
Distributed Computing Models
31 pages
Unit 11 Dependability-and-Security
No ratings yet
Unit 11 Dependability-and-Security
39 pages
II Fault Tolerant Techniques
No ratings yet
II Fault Tolerant Techniques
101 pages
Capacity Planning For Application Design: White Paper
No ratings yet
Capacity Planning For Application Design: White Paper
10 pages
Ch03 Types and Application of Virtualization
No ratings yet
Ch03 Types and Application of Virtualization
17 pages
Fundamentals of Distributed Systems
100% (1)
Fundamentals of Distributed Systems
20 pages
Rest & Restful Web Services
No ratings yet
Rest & Restful Web Services
38 pages
Structure Syllabi
No ratings yet
Structure Syllabi
19 pages
A Comprehensive Survey On IoT Towards 5G Wireless Systems PDF
No ratings yet
A Comprehensive Survey On IoT Towards 5G Wireless Systems PDF
18 pages
Unit 1
No ratings yet
Unit 1
10 pages
Fault Detection and Fault Tolerant Control
No ratings yet
Fault Detection and Fault Tolerant Control
207 pages
Bacterii
No ratings yet
Bacterii
11 pages
Haivision White Paper SRT Open Source Streaming
No ratings yet
Haivision White Paper SRT Open Source Streaming
8 pages
Sample ICT Action Plan
100% (2)
Sample ICT Action Plan
2 pages
Distributed Systems
67% (3)
Distributed Systems
331 pages
A Survey of Multi-Access Edge Computing in 5G and
No ratings yet
A Survey of Multi-Access Edge Computing in 5G and
59 pages
UNIT-5-EC8702-Adhoc and Wireless Sensor Networks
No ratings yet
UNIT-5-EC8702-Adhoc and Wireless Sensor Networks
25 pages
Business 70 PDF
No ratings yet
Business 70 PDF
1 page
Lesson 4 Interpret Plans and Drawings
No ratings yet
Lesson 4 Interpret Plans and Drawings
48 pages
Computer Security and Safety, Ethics, and Privacy
No ratings yet
Computer Security and Safety, Ethics, and Privacy
51 pages
Distributed Systems
No ratings yet
Distributed Systems
238 pages
WorkloadCharacterizationAndModeling 2005 Feitelson
No ratings yet
WorkloadCharacterizationAndModeling 2005 Feitelson
508 pages
Hoc Sinh Gioi 8 - 2022
No ratings yet
Hoc Sinh Gioi 8 - 2022
10 pages
Markov Analysis for Engineers
No ratings yet
Markov Analysis for Engineers
10 pages
Distributed System
No ratings yet
Distributed System
37 pages
TECH-5 - Rahul Dhall CV
No ratings yet
TECH-5 - Rahul Dhall CV
3 pages
Global State & Snapshot Algorithms
No ratings yet
Global State & Snapshot Algorithms
51 pages
Distributed Systems Overview
100% (1)
Distributed Systems Overview
14 pages
Qkhttiepdiendeso 01
No ratings yet
Qkhttiepdiendeso 01
2 pages
Encrypted Data Analysis
No ratings yet
Encrypted Data Analysis
30 pages
Agal-8mkqdm r0 en
No ratings yet
Agal-8mkqdm r0 en
16 pages
Hull For: Aerodynamic Design HASPA LTA Optimization
No ratings yet
Hull For: Aerodynamic Design HASPA LTA Optimization
5 pages
Technical Datasheet Modula - EN24062013
No ratings yet
Technical Datasheet Modula - EN24062013
2 pages
Anthony 8
No ratings yet
Anthony 8
2 pages
Continuous Random Variables Guide
No ratings yet
Continuous Random Variables Guide
91 pages
UV Stable Waterproof Membrane Guide
No ratings yet
UV Stable Waterproof Membrane Guide
3 pages
Reto 4
No ratings yet
Reto 4
5 pages
7708SFP
No ratings yet
7708SFP
1 page
Netdev Tutorial
No ratings yet
Netdev Tutorial
40 pages
Distributed Con Currency Control - 2 of 3
100% (1)
Distributed Con Currency Control - 2 of 3
46 pages
Concurrency Control in DDBMS
No ratings yet
Concurrency Control in DDBMS
37 pages
Cybersecurity 5
No ratings yet
Cybersecurity 5
34 pages
Distributed Systems Characterization and Design
No ratings yet
Distributed Systems Characterization and Design
35 pages