CSC310
Fault Tolerance
SPRING 2021
Lecture 01- Introduction
Instructor: Dr. Tarek Abdul Hamid
Motivation
What is Fault-Tolerance?
A “fault-tolerant system” is one that continues to perform at
desired level of service in spite of failures in some components
that constitute the system.
2 Fault Tolerant Computing Dr. Tarek abdul Hamid
Motivation
Key attributes
Fault - Error - Failure
Performance - Availability - Reliability
More recently concept of “survivability”
Inclusions of these constraints at design stage is likely to be
more cost effective.
3 Fault Tolerant Computing Dr. Tarek abdul Hamid
Motivation
Who is concerned about fault-tolerance?
System Users – irrespective of the application but some are a lot
more concerned than others
Who is concerned at design stages?
Universities
R, d, and a (Research, development, applications)
Industry
r, D, and A (research, Development, Applications)
Issues
Design, Analysis/Validation, Implementation, Testing/Validation,
Evaluation
4 Fault Tolerant Computing Dr. Tarek abdul Hamid
Motivation
Examples
General Purpose Systems
PCs: RAMs with parity checks and possibly ECC (Error
Correction Code)
(consideration of re-execution on failure detection is being investigated)
Workstations/Servers: error detection (HW), occasional corrective
action (SW), Even ECC (HW), keeping log (SW)
5 Fault Tolerant Computing Dr. Tarek abdul Hamid
Motivation
Examples
Reliable Systems
Telephone systems
Banking systems e.g. ATM
Stock market
CAE (Cambridge English) - exams/projects
Football games - display/ticketing
6 Fault Tolerant Computing Dr. Tarek abdul Hamid
Motivation
Examples
Critical and Life Critical Systems
Manned and unmanned space borne systems
Aircraft control systems
Nuclear reactor control systems
Life support systems
7 Fault Tolerant Computing Dr. Tarek abdul Hamid
Motivation
Examples
Reliable -> Critical Systems
911 telephone switching system
Traffic light control system
Automotive control systems (ABS, Fuel injection system)
8 Fault Tolerant Computing Dr. Tarek abdul Hamid
Introduction
New initiatives
Goals of fault-tolerance
Applications of fault-tolerance
9 Fault Tolerant Computing Dr. Tarek abdul Hamid
Introduction
New initiatives
Density of devices more failures likely
Power issue – scheduler, on-chip sensors
Failures due to soft-errors, life time degradations
- hardening, re-exection,
- on-chip ECC
- reconfiguration
- micro-architectural solutions
- architectural solutions
10 Fault Tolerant Computing Dr. Tarek abdul Hamid
Introduction
New initiatives (contd.)
Deep submicron technology and time to market pressure
designs not fully verified
Implementation of numerous functionalities on
chip/board/system possibility of system hang-up
Speculative execution results may need to be re-checked
Low cost of HW and SW affordable/ecnomical
Hot issues: Soft errors, Life-time failures, Power and
Thermal Management
11 Fault Tolerant Computing Dr. Tarek abdul Hamid
Introduction
Goals - different goals for different applications
The key word is “reliability” – has different meaning for different users and
applications
Intuitive explanations
Dependability
Service
Specification
12 Fault Tolerant Computing Dr. Tarek abdul Hamid
Introduction
Intuitive concepts
Reliability – continues to work
Availability – works when I need it
Safety – does not put me in jeopardy
Performability
Maintainability
Testability
Survivability – will the system survive catastrophic events?
Security
13 Fault Tolerant Computing Dr. Tarek abdul Hamid
Introduction
Applications
Space borne system
long life system
Airplane control system
critical system
Transaction processing system
high availability system
Switching system
high availability over certain level of performance
14 Fault Tolerant Computing Dr. Tarek abdul Hamid
Terminology and definitions
Reliability and concept of probability
R(t): conditional probability that a system provides continuous
proper service in the interval [0,t] given that it provided desired
service at time 0.
Availability
Performabiltiy
An Example
Dependability
Security
15 Fault Tolerant Computing Dr. Tarek abdul Hamid
System Defined (1/4)
“. . . an entity that interacts with other entities”
First entity (system) – limited to be “electronic (mostly digital)” or “computer
based”
Second entity
Hardware, software, human, other systems, .. (can also be called “environment”)
Characterization and fundamental properties
Functionality
Performance
Dependability and security
Cost
(usuability, managability, adaptabilty : not directly included in the paper)
16 Fault Tolerant Computing Dr. Tarek abdul Hamid
System Defined (2/4)
Function – “ what the system is intended to do”
functional specifications: describe it in terms of functionality and performance
behavior – described as a sequence of states to implement the functionality
Total states – set of states as system evolves
Internal states
External states – as viewed by the environment and users
Structure – “What enables system behavior (function)”
Interconnected components – recursively defined to “atomic” level
17 Fault Tolerant Computing Dr. Tarek abdul Hamid
System Defined (3/4)
System Life Cycle
Development phase
Use phase
Service – what is delivered by the system to its “environment” (user)
Environment sees only the “external states”
Development Phase – activities from concept to decision that system is ready
for “use phase”
Use Phase - More meaningful and includes service delivery, service outage,
service shutdown, maintenance
18 Fault Tolerant Computing Dr. Tarek abdul Hamid
System Defined (4/4)
Development phase environment
Physical world
Human developers
Development tools
Production and test facilities
User phase environment
Physical world
Administrators – maintainers
Users and intruders
Providers and infrastructure
19 Fault Tolerant Computing Dr. Tarek abdul Hamid
Dependability/Security Attributes (1/6)
Original definition: “ability to deliver service that can justifiably be
trusted”
Encompassing the following attributes
Availability
Reliability
Safety
Integrity
maintainability
20 Fault Tolerant Computing Dr. Tarek abdul Hamid
Dependability/Security Attributes (2/6)
New definition: “ability to avoid service failures that are more frequent or
more severe than is acceptable” - deliver service that can justifiably be
trusted
Reason for modification
Security related issues
This recognizes that a system can fail and it usually does fail and it still can be called
dependable
This definition also enables a connection with “development failures”
21 Fault Tolerant Computing Dr. Tarek abdul Hamid
Dependability/Security Attributes
(3/6)
Dependability
availability: readiness for correct service.
reliability: continuity of correct service.
safety: absence of catastrophic consequences on the user(s) and the
environment.
integrity: absence of improper system alterations.
maintainability: ability to undergo modification and repairs
When addressing security, an additional attribute confidentiality:
the absence of unauthorized disclosure
22 Fault Tolerant Computing Dr. Tarek abdul Hamid
Dependability/Security Attributes
(4/6)
Security is concurrent existence of composite of the attributes
1) availability (for authorized actions only),
2) confidentiality, and
3) integrity (with “improper” meaning “unauthorized”)
23 Fault Tolerant Computing Dr. Tarek abdul Hamid
F
Dependability/Security Attributes
(5/6)
24 Fault Tolerant Computing Dr. Tarek abdul Hamid
Dependability/Security Attributes
(6/6)
Other related concepts are
Dependability
High confidence
Survivability
Trustworthiness
Example: all these have similar goals such as
1): ability to deliver service,
2): predictable service,
3): fulfill mission,
4): assurance of expected service delivery
25 Fault Tolerant Computing Dr. Tarek abdul Hamid
Threats and modeling threats
(1/12)
Different phases are open to different types of threats – generally termed
as “Faults”
Faults lead to “Errors” – a total state of the system different from the
“true total state”
Errors can lead to “Failure” – the service deviates from the desired
service
This creates a FEF chain – a hierarchical phenomenon
26 Fault Tolerant Computing Dr. Tarek abdul Hamid
Threats and modeling threats
(2/12)
Fault activation – Error manifestation – Failure
Fault –
active or dormant
Error – Failure
Fau
masked or latent lt
Error
Failure –
incorrect response
27 Fault Tolerant Computing Dr. Tarek abdul Hamid
Threats and modeling threats
(3/12)
FEF Chain in an hierarchy
28 Fault Tolerant Computing Dr. Tarek abdul Hamid
Threats and modeling threats
(4/12)
Fault classes
Groups (not exclusive)
Development, Physical – (that affect hardware ), Interaction
Viewpoints:
phase, system boundary, cause, dimension, objective, intent, capability,
persistence
29 Fault Tolerant Computing Dr. Tarek abdul Hamid
Threats and modeling threats
(5/12)
Fault Taxonomy and Examples
Production defect: physical, hardware, natural
Bug: physical, software, natural
Omission (absence of an action): Humam made, system generated
Melicious (meant to cause harm): Human made, Hardware or software
30 Fault Tolerant Computing Dr. Tarek abdul Hamid
Threats and modeling threats
(6/12)
Fault Taxonomy (contd.)
Permanent faults
Intermittent faults – repeat at some interval
Transient faults – no specific interval
Malicious logic faults – caused be natural faults
Intrusion attempts – caused by humans
Interaction faults – may be development phase or use phase
Configuration faults – incorrect setting of parameters
31 Fault Tolerant Computing Dr. Tarek abdul Hamid
Threats and modeling threats (7/12)
Errors classes
Detected
Latent
An example
An adder gives incorrect sum for certain operands
Fault is active when those operands appear, otherwise it is dormant
Incorrect sum is latent unless used or checked for correctness
32 Fault Tolerant Computing Dr. Tarek abdul Hamid
Threats and modeling threats
(8/12)
Failure classes
Development failures
Service failures
Security failures
33 Fault Tolerant Computing Dr. Tarek abdul Hamid
Threats and modeling threats
(9/12)
Development failures – introduced during the development phase
Human developers
Tools
Production facility
Budgetary reasons
Scheduling issue (time to market)
(basically the system delivered is a downgraded system)
34 Fault Tolerant Computing Dr. Tarek abdul Hamid
Threats and modeling threats
(10/12)
Service failures - delivery of incorrect service – Four viewpoints
1. Failure domain
Content failure
Timing failure – early or late delivery of the service(s)
Special case: silent failure, halt failure, crash failure
Erratic failure (like Byzantine failure)
35 Fault Tolerant Computing Dr. Tarek abdul Hamid
Threats and modeling threats
(11/12)
2. Failure detectability
Signal provided by some checking mechanism
Signaled failure
Unsignaled failure
False alarm
3. Consistency
Consistent failure – all services see the same data
Inconsistent – different services see different data (like Byzantine
failure)
36 Fault Tolerant Computing Dr. Tarek abdul Hamid
Threats and modeling threats
(12/12)
4. Consequence of failure
Need to rate the failure and hence develop criteria – examples:
Outage of duration (availability related)
Lives being endangered (safely related)
Extent of corrupted service (integrity related)
Amount of information disclosed (confidentiality related)
37 Fault Tolerant Computing Dr. Tarek abdul Hamid
Means to attain dependability
(1/6)
• Fault Prevention or Fault Avoidance
Improvement of development process
Elimination of causes that can induce faults
• Fault Tolerance
Techniques and implementations (more later)
38 Fault Tolerant Computing Dr. Tarek abdul Hamid
Means to attain dependability (2/6)
• Fault Removal
Remove faults during development phase – extensive simulation and validation
Testing
• Deterministic testing
• Random and statistical testing
• Back to back testing
Test/validation quality: fault injection, design for
test/verification
39 Fault Tolerant Computing Dr. Tarek abdul Hamid
Means to attain dependability
(3/6)
• Fault Forecasting – evaluate the system behavior and then use
one or more methods previously discussed to improve
dependability
Qualitative evaluation
Quantitative evaluation
Use benchmarks
Use of simulators
Examples: 1) Error and failure logs
2) when and where commissioned
40 Fault Tolerant Computing Dr. Tarek abdul Hamid
Means to attain dependability
(4/6)
Fault Tolerance Techniques
• Error detection - need redundancy
Duplicate execution
Use of parity
Checker programs and/or hardware
More later
41 Fault Tolerant Computing Dr. Tarek abdul Hamid
Means to attain dependability
(5/6)
• Recovery - Key is redundancy
Error handling
• Masking and compensation
• Rollback
• Rollforward
Fault handling
• Diagnosis
• Isolation
• Reconfiguration
• Initialization
42 Fault Tolerant Computing Dr. Tarek abdul Hamid
Means to attain dependability
(6/6)
Key to fault tolerance
• Break FEF chain
• Use “redundancy” to improve “use phase” dependability and security
• See next “fundamental principles”
43 Fault Tolerant Computing Dr. Tarek abdul Hamid
Fundamental Principles
Hardware redundancy
Low level
High level
Software Redundancy
Time Redundancy
Information Redundancy
44 Fault Tolerant Computing Dr. Tarek abdul Hamid
Fundamental Principles
Hardware Redundancy - Low level
logic level
Example 1 - Self checking circuits
Example 2 - Arithmetic code
A modular adder using the mathematical principle
(A+B) mod k = ((A mod k) + (B mod k)) mod k
Hardware Redundancy - High level
Triplicate or 5-copies as in space shuttle
45 Fault Tolerant Computing Dr. Tarek abdul Hamid
Fundamental Principles
Software Redundancy
Use two different programs/algorithms
Time Redundancy
Re-compute or redo the task and compare the results
May or may not use the same hardware/software
Information Redundancy
backup information
Use of ECC
46 Fault Tolerant Computing Dr. Tarek abdul Hamid
Fault-Error-Failure
Intuitive definitions
Origins of faults
Methods to break FEF chain
Attribute of faults
47 Fault Tolerant Computing Dr. Tarek abdul Hamid
Fault-Error-Failure concept
Intuitive definitions
Fault -
An anomalous physical condition caused by a manufacturing
problem, fatigue, external disturbance (intentional or un-
intentional), desgin flaw, …
Causes
Error - Effect of activation of a fault
Failure - over-all system effect of an error
Fault -> Error -> Failure
48 Fault Tolerant Computing Dr. Tarek abdul Hamid
Fault-Error-Failure concept
Origins of faults
Physical device level (HW)
Logic level (HW)
Chip level (HW)
System level (HW/SW)
interfacing, specifications, …
Why systems fail
49 Fault Tolerant Computing Dr. Tarek abdul Hamid
Fault-Error-Failure concept
Methods to break FEF chain
Flow FEF
Barriers
Fault avoidance
Fault masking
Fault removal
Fault forecasting
50 Fault Tolerant Computing Dr. Tarek abdul Hamid
Fault-Error-Failure concept
51
Fault-Error-Failure concept
52
Fault-Error-Failure concept
Attribute of faults
Cause
Nature
Duration
Extent
Value
53 Fault Tolerant Computing Dr. Tarek abdul Hamid