Fault Tolerant Software
Safety Critical Computer
Systems
Systems
By: Nima Jafari Navimipour (Ph.D.)
[email protected]Safety Critical Computer Systems BY: NIMA JAFARI NAVIMIPOUR 1
Chapter 1
Introduction
Safety Critical Computer Systems BY: NIMA JAFARI NAVIMIPOUR 2
Introduction
▪Computer-based systems have increased dramatically.
▪Most industries are highly dependent on computers for their basic day-to-day functioning.
▪Safe and reliable software operation is a significant requirement for many types of systems
▪ aircraft and air traffic control
▪ medical devices
▪ Nuclear safety
▪ petrochemical control
▪ electronic banking and commerce
▪ automated manufacturing
Safety Critical Computer Systems BY: NIMA JAFARI NAVIMIPOUR 3
Introduction
▪The cost and consequences of the systems failing can range from mildly annoying to
catastrophic
▪The processes by which software is conceptualized, created, analyzed, and tested
▪ would have advanced to the point where software could be developed without errors.
▪The current state-of-the-practice is such that fewer errors are introduced
▪ but unfortunately not all errors are prevented
▪Even if the best people, practices, and tools are used
▪ it would be very risky to assume the software developed is error-free.
▪There may also be cases in which an error, found late in the systems life cycle
▪ expensive to repair
Safety Critical Computer Systems BY: NIMA JAFARI NAVIMIPOUR 4
Examples
▪The AT&T system
▪ In January 1990, it suffered a nine-hour United States wide blockade when one switch experienced
abnormal behavior and attempted recovery. Because of a flaw in recovery recognition software (in all
114 switches) and a network design that permitted propagation of the effects, the problem spread to all
switches.
▪The aerospace industries
▪ Apollo 11
▪ Atlantis (STS-36)
▪ Endeavor (STS-49)
▪ Intelstat 6
Safety Critical Computer Systems BY: NIMA JAFARI NAVIMIPOUR 5
Examples
▪During the Persian Gulf War
▪ clock drift in the Patriot system caused it to miss a scud missile that hit an American barracks in
Dhahran. The missile hit killed 29 people and injured 97 others. The clock drift was reportedly caused by
the software's use of two different and unequal representations (24-bit and 48-bit). As with most
complex systems, the source of the resulting problem was multifaceted, in this case with software one
of several problem sources.
▪Several Airbus A320 problems
▪ blamed on the pilots and their skills in handling anomalous situations.
▪ the role software may have played in these incidents.
▪Therac-25
Safety Critical Computer Systems BY: NIMA JAFARI NAVIMIPOUR 6
Software vs. Hardware faults
▪Hardware faults are primarily physical faults
▪ can be characterized and predicted over time.
▪Software faults do not physically wear out or otherwise physically deteriorate with time.
▪Software has only logical faults
▪ which are difficult to visualize, classify, detect, and correct.
▪Software faults may be traced to incorrect requirements
▪ where the software matches the requirements, but the behavior specified in the requirements is not
appropriate.
▪ the implementation not satisfying the requirements.
▪To protect against these faults
▪ redundancy
Safety Critical Computer Systems BY: NIMA JAFARI NAVIMIPOUR 7
A Few Definitions
▪A fault
▪ is the identified cause of an error
▪An error
▪ is part of the system state that is liable to lead to a failure
▪ may propagate, that is, produce other errors.
▪A failure
▪ occurs when the service delivered by the system deviates from the specified service
▪ otherwise termed an incorrect result
▪Software fault tolerance
▪ prevent failures by tolerating faults whose occurrences are known when errors are detected.
Safety Critical Computer Systems BY: NIMA JAFARI NAVIMIPOUR 8
Dependability concept
classification
Safety Critical Computer Systems BY: NIMA JAFARI NAVIMIPOUR 9
Fault Avoidance or Prevention
▪reduce the number of faults introduced during construction
▪contribute to system dependability through rigorous specification of system
requirements
▪use of
▪ structured design and programming methods
▪ formal methods with languages and tools
▪ and software reusability.
Safety Critical Computer Systems BY: NIMA JAFARI NAVIMIPOUR 10
Fault Avoidance or Prevention
▪System Requirements Specification
▪ A system failure may occur due to logic errors incorporated into the
requirements.
▪ Is often occurs
▪ since the majority of safety problems arise from software requirements errors
and not coding errors
▪ existing software engineering techniques addresses only the errors that occur when the
design and implementation of the requirements in a programming language do not match or
satisfy the system requirements.
Safety Critical Computer Systems BY: NIMA JAFARI NAVIMIPOUR 11
Fault Avoidance or Prevention
▪Structured Design and Programming Methods
▪ reduce the complexity and interdependency of component
▪ principles of decoupling and modularization
▪ encapsulate design decisions and hide them from other
▪ reduces overall complexity of the software
▪ making it easier to understand and implement
▪ reduces the introduction of faults into the software.
Safety Critical Computer Systems BY: NIMA JAFARI NAVIMIPOUR 12
Fault Avoidance or Prevention
▪Formal Methods
▪ improve software dependability during construction
▪ requirements specifications are developed and maintained using
mathematically tractable languages and tools
▪ four goals of current formal methods studies:
▪ executable specifications for systematic and precise evaluation
▪ proof mechanisms for software verification and validation
▪ development procedures that follow incremental refinement for step-by-step verification
▪ every work item is subject to mathematical verification for correctness and appropriateness
▪ have not been generally used on large projects.
Safety Critical Computer Systems BY: NIMA JAFARI NAVIMIPOUR 13
Fault Avoidance or Prevention
▪Software Reuse
▪implies a savings in development cost
▪increasing dependability
▪object-oriented paradigms and techniques encourage and support
software reuse
▪Different measures of dependability may not be improved equally
by reuse of software
▪ For example, highly reliable software may not necessarily be safe.
Safety Critical Computer Systems BY: NIMA JAFARI NAVIMIPOUR 14
Fault Removal
▪Despite fault prevention efforts, faults are created, so fault
removal is needed.
▪improve software dependability by detecting existing faults
▪ using verification and validation (V&V) methods
▪eliminating the detected faults.
▪contribute to system dependability using:
▪ software testing
▪ formal inspection
▪ formal design proofs
Safety Critical Computer Systems BY: NIMA JAFARI NAVIMIPOUR 15
Fault Removal
▪ Software testing
▪The most common fault removal techniques
▪the prohibitive cost and complexity of exhaustive testing
▪Additional testing
▪ critical component
▪ reveal unforeseen problems
Safety Critical Computer Systems BY: NIMA JAFARI NAVIMIPOUR 16
Fault Removal
▪ Formal inspection
▪ widely implemented in industry
▪ rigorous process, accompanied by documentation
▪ focuses on examining source code
▪ to find faults
▪ correcting the faults
▪ verifying the corrections
Safety Critical Computer Systems BY: NIMA JAFARI NAVIMIPOUR 17
Fault Removal
▪ Formal design proofs
▪related to formal methods
▪mathematical proof of correctness
▪costly and complex technique to use
▪It is feasible
▪ if performed on a relatively small portion of the code
▪give the designer a high degree of confidence
Safety Critical Computer Systems BY: NIMA JAFARI NAVIMIPOUR 18
Fault/Failure Forecasting
▪estimate the presence of faults
▪focuses on the reliability measure of dependability
▪Formulation of a fault/failure relationship
▪understanding of the operational environment
▪establishment of reliability models
▪collection of failure data
▪application of reliability models by tools
▪selection of appropriate models
▪analysis and interpretation of results
▪guidance for management decisions
Safety Critical Computer Systems BY: NIMA JAFARI NAVIMIPOUR 19
Fault/Failure Forecasting
▪reliability estimation
▪ determines software reliability
▪ by applying statistical inference techniques
▪ has been achieved to the time of estimation
▪reliability prediction
▪ future software reliability based upon available software metrics and
measures
Safety Critical Computer Systems BY: NIMA JAFARI NAVIMIPOUR 20
Fault Tolerance
▪one way to reduce the risks of software design faults
▪enhance software dependability
▪When a fault occurs
▪ prevent system failure from occurring
▪provides service complying with the relevant specification in spite of
faults
▪have been used in the
▪nuclear power, healthcare, telecommunications and ground
transportation
Safety Critical Computer Systems BY: NIMA JAFARI NAVIMIPOUR 21
Fault Tolerance
▪Single Version Software Environment
▪ tolerate software design faults
▪ decision verification
▪ exception handling
▪Multiple Version Software Environment
▪ utilize functionally equivalent yet independently developed software versions
▪ N-version programming
▪Multiple Data Representation Environment
▪ utilize different representations of input data
▪ N-copy programming
Safety Critical Computer Systems BY: NIMA JAFARI NAVIMIPOUR 22
Recovery
▪software fault tolerance process
▪error detection, diagnosis, isolation or containment, and recovery
▪Types of Recovery
▪ backward recovery
▪ forward recovery
Safety Critical Computer Systems BY: NIMA JAFARI NAVIMIPOUR 23
Backward Recovery
Safety Critical Computer Systems BY: NIMA JAFARI NAVIMIPOUR 24
Backward Recovery
▪Advantages
▪ can handle unpredictable errors if the errors do not affect the recovery
mechanism
▪ provides a general recovery scheme
▪ requires no knowledge of the errors in the system state
▪ The only knowledge required by backward recovery is that the relevant prior
state is error-free
▪ is particularly suited to recovery of transient faults
Safety Critical Computer Systems BY: NIMA JAFARI NAVIMIPOUR 25
Backward Recovery
▪Disadvantages
▪requires significant resources
▪the system be halted temporarily
▪a domino effect may occur
Safety Critical Computer Systems BY: NIMA JAFARI NAVIMIPOUR 26
Forward Recovery
Safety Critical Computer Systems BY: NIMA JAFARI NAVIMIPOUR 27
Forward Recovery
▪Advantages
▪ is fairly efficient in terms of the overhead
▪ Faults involving missed deadlines may be better recovered from using forward recovery
▪Disadvantages
▪ is application-specific
▪ can only remove predictable errors
▪ requires knowledge of the error
▪is primarily used when there is no time for backward recovery.
Safety Critical Computer Systems BY: NIMA JAFARI NAVIMIPOUR 28
Types of Redundancy for Software Fault
Tolerance
▪Redundancy
▪ A key supporting concept for fault tolerance
▪Hardware
▪Software
▪Information
▪time
Safety Critical Computer Systems BY: NIMA JAFARI NAVIMIPOUR 29
Hardware Redundancy
▪includes replicated and supplementary hardware added to the system to
support fault tolerance
▪Redundant or diverse software can reside on the redundant hardware to
tolerate both hardware and software faults
Safety Critical Computer Systems BY: NIMA JAFARI NAVIMIPOUR 30
Software redundancy
▪includes additional programs, modules, functions, or objects used to support fault tolerance
▪was borrowed from hardware fault tolerance approaches
▪hardware faults are typically random, due to component aging and environmental effects
▪software faults overwhelmingly arise from specification and design errors or implementation
(coding) mistakes.
▪To tolerate software faults
▪ failures arising from these design and implementation problems must be detected
▪diversity into the software replicas
Safety Critical Computer Systems BY: NIMA JAFARI NAVIMIPOUR 31
Software redundancy
Safety Critical Computer Systems BY: NIMA JAFARI NAVIMIPOUR 32
Information or data redundancy
▪use of information with data and the use of additional forms of data
to assist in fault tolerance
▪error-detecting and error-correcting codes
▪a data re-expression algorithm (DRA) produces different
representations of a modules input data
Safety Critical Computer Systems BY: NIMA JAFARI NAVIMIPOUR 33
Time Redundancy
▪use of additional time
▪used for both hardware and software fault tolerance
▪repeating an execution using the same software and hardware resources
▪is typical of hardware backward recovery
▪has a great advantage for some application.
▪be used in applications in which time is readily available, such as many human-interactive
programs
▪the additional time used for re-execution may cause missed deadlines.
Safety Critical Computer Systems BY: NIMA JAFARI NAVIMIPOUR 34