Evaluation Techniques
Two approaches
• Qualitative evaluation
– aims to identify, classify and rank the failure
modes, or event combinations that would lead
to system failures
• Quantitative evaluation
– aims to evaluate in terms of probabilities the
attributes of dependability (reliability,
availability, safety)
p. 2 - Design of Fault Tolerant Systems - Elena Dubrova, ESDlab
Common dependability measures
• failure rate
• mean time to failure
• mean time to repair
• mean time between failures
• fault coverage
p. 3 - Design of Fault Tolerant Systems - Elena Dubrova, ESDlab
Failure rate
• failure rate
– expected # of failures per time-unit
– example
• 1000 controllers working at t0
• after 10 hours: 950 working
• failure rate for each controller:
0.005 failures / hour
(50 failures / 1000 controllers) / 10 hours
p. 4 - Design of Fault Tolerant Systems - Elena Dubrova, ESDlab
Failure rate and reliability
Reliability R(t) is the conditional probability that the
system will perform correctly throughout [0,t], given
that it worked at time 0
N operating(t )
R (t )
N operating(t ) N failed(t )
p. 5 - Design of Fault Tolerant Systems - Elena Dubrova, ESDlab
Failure rate
• typical evolution of l(t) for hardware:
l(t)
I III
II
t
• bathtub: I infant mortality, II useful life, III wear-out
• for useful life period l = constant, the reliability is
given by
R ( t ) e lt
p. 6 - Design of Fault Tolerant Systems - Elena Dubrova, ESDlab
Exponential failure law
If l is constant, R(t) varies
R ( t ) e lt
exponentially as a function of time
1
0.8
0.6
0.4
0.2
0
p. 7 - Design of Fault Tolerant Systems - Elena Dubrova, ESDlab
Time varying failure rate
• Failure rate is not always constant
– software failure rate decreases as package matures
• Weibull distribution:
a 1
z(t ) al ( lt )
• if a=1, then z(t) = constant = l
if a>1, then z(t) increases as time increases
if a<1, then z(t) decreases as time increases
( lt )a
R (t ) e
p. 8 - Design of Fault Tolerant Systems - Elena Dubrova, ESDlab
Failure rate calculation
• determined for components
– systems: combination of components
– l of the system = sum of l of the components
• determine l experimentally
– slow
• e.g. 1 failure per 100 000 hours (=11.4 years)
– expensive
• many components required for significance
• use standards for l
p. 9 - Design of Fault Tolerant Systems - Elena Dubrova, ESDlab
MTTF
• MTTF: mean time to failure
– expected time until the first failure occurs
• If we have a system of N identical
components and we measure the time ti
before each component fails, then MTTF is
given by
N
MTTF N
1 . ti
i 1
p. 10 - Design of Fault Tolerant Systems - Elena Dubrova, ESDlab
MTTF
MTTF is defined in terms of reliability as:
MTTF R (t )dt
0
If R(t) obeys the exponential failure law, then
MTTF is the inverse of the failure rate:
lt 1
MTTF e dt
0 l
p. 11 - Design of Fault Tolerant Systems - Elena Dubrova, ESDlab
MTTF
R ( t ) e lt
R(t) 1
0.8
0.6
0.4
0.2
0
1/l 2/l 3/l
t
p. 12 - Design of Fault Tolerant Systems - Elena Dubrova, ESDlab
MTTF
• MTTF is meaningful only for systems which
operate without repair until they experience a
failure
• Most of mission-critical systems a undergo a
complete check-up before the next mission
– all failed redundant components are replaced
– system is returned to fully operational state
• When evaluating reliability of such system,
mission time rather then MTTF is used
p. 13 - Design of Fault Tolerant Systems - Elena Dubrova, ESDlab
MTTR
• MTTR: mean time to repair
– expected time until repaired
• If we have a system of N identical
components and ith component requires
time ti to repair, then MTTR is given by
N
MTTR 1
N . ti
i 1
p. 14 - Design of Fault Tolerant Systems - Elena Dubrova, ESDlab
MTTR
• difficult to calculate
• determined experimentally
• normally specified in terms of repair rate
repair rate m, which is the average number
of repairs that occur per time period
1
MTTR
m
p. 15 - Design of Fault Tolerant Systems - Elena Dubrova, ESDlab
MTTR
• Low MTTR requirement implies high
operational cost
– if hardware spares are kept on cite and the cite
is maintained 24hr a day, MTTR=30min
– if the cite is maintained 8hr 5 days a week,
MTTR = 3 days
– if system is remotely located MTTR = 2 weeks
p. 16 - Design of Fault Tolerant Systems - Elena Dubrova, ESDlab
MTBF
• MTBF: mean time between failures
• functional + repair
• MTBF = MTTF + MTTR
– small time difference: MTBF MTTF
– conceptual difference
time of 1st failure time of 2nd failure
MTBF
time
MTTF MTTR MTTF MTTR
p. 17 - Design of Fault Tolerant Systems - Elena Dubrova, ESDlab
Fault coverage
• Fault detection coverage is the conditional
probability that, given the existence of a fault, the
fault is detected
• Difficult to calculate
• Usually computed as
number of faults which can be detected
C= total number of faults
p. 18 - Design of Fault Tolerant Systems - Elena Dubrova, ESDlab
Example
• Suppose your circuit has 10 lines and you
use single-stuck at fault as a model
• Then the total number of faults is 20
• Suppose you have 1 undetectable fault
• Then the coverage is
19
C=
20
p. 19 - Design of Fault Tolerant Systems - Elena Dubrova, ESDlab
Dependability modelling
• up to now: l and R(t) for components
• systems are sets of components
• system evaluation approaches:
– reliability block diagrams
– Markov processes
p. 20 - Design of Fault Tolerant Systems - Elena Dubrova, ESDlab
Serial system
• system functions
if and only if all components function
reliability block diagram
(RBD)
p. 21 - Design of Fault Tolerant Systems - Elena Dubrova, ESDlab
Serial system
C1 C2 CN
if Ci are independent: R series ( t ) R i ( t )
N
lseries li
i 1
p. 22 - Design of Fault Tolerant Systems - Elena Dubrova, ESDlab
Parallel system
• system works
C1
as long as
one component
works C2
CN
p. 23 - Design of Fault Tolerant Systems - Elena Dubrova, ESDlab
Parallel system
unreliabity: Q(t) = 1 - R(t)
N
if Ci are independent: Qparallel (t ) Qi (t )
i 1
N
Rparallel (t ) 1 1 Ri (t )
i 1
p. 24 - Design of Fault Tolerant Systems - Elena Dubrova, ESDlab
Reliability block diagram
• RBD
– may be difficult to build
– equations get complex
– difficult to take coverage into account
– difficult to represent repair
– not possible to represent dependency between
components
p. 25 - Design of Fault Tolerant Systems - Elena Dubrova, ESDlab
Markov chains
• Markov chains
– illustrated by state transition diagrams
• idea:
– states
• components working or not
– state transitions
• when components fail or get repaired
p. 26 - Design of Fault Tolerant Systems - Elena Dubrova, ESDlab
Single-component system, no repair
• Only two states
– one operational (state 1) and one failed (state 2)
– if no repair is allow, there is a single, non-reversible
transition between the states (used in availability
analysis)
– label l corresponds to the failure rate of the component
l
1 2
p. 27 - Design of Fault Tolerant Systems - Elena Dubrova, ESDlab
Single-component system with repair
• If repair is allowed (used in availability
analysis)
– then a transition between the failed and the
operational state is possible
– the label is the repair rate m
l
1 2
m
p. 28 - Design of Fault Tolerant Systems - Elena Dubrova, ESDlab
Failed-safe and failed-unsafe
• In safety analysis, we need to distinguish between failed-
safe and failed-unsafe states
– let 2 be a failed-safe state and 3 be a failed-unsafe state
– the transition between the 1 and 2 depends on failure rate
and the probability that, if a fault occurs, it is detected and
handled appropriately (i.e. fault coverage C)
– if C is the probability that a fault is detected, 1-C is the
probability that a fault is not detected
lC 2
1
l(1-C) 3
p. 29 - Design of Fault Tolerant Systems - Elena Dubrova, ESDlab
Two-component system
• Has four possible states
2 l2
O O state 1 l1
F O state 2
O F state 3 1 4
F F state 4
l2 3 l1
• Components are assumed to be independent and non-
repairable
• If components are in serial
– state 1 is operational state, states 2,3,4 are failed states
• If components are in parallel
– states 1,2,3 are operational states, state 4 is failed state
p. 30 - Design of Fault Tolerant Systems - Elena Dubrova, ESDlab
State transition diagram
simplification
• Suppose two components are in parallel
• Suppose l1 = l2 = l
• Then, it is not necessary to distinguish between between
the states 2 and 3
– both represent a condition where one component is
operational and one is failed
– since components are independent events, transition rate
from state 1 to 2 is the sum of the two transition rates
2l l
1 2 3
p. 31 - Design of Fault Tolerant Systems - Elena Dubrova, ESDlab
Markov chain analysis
• The aim is to compute Pi(t), the probability that
the system is in the state i at time t
• Once Pi(t) is known, the reliability, availability or
safety of the system can be computed as a sum
taken over all operating states
• To compute Pi(t), we derive a set of differential
equations, called state transition equations, one
for each state of the system
p. 32 - Design of Fault Tolerant Systems - Elena Dubrova, ESDlab
Transition matrix
• State transition equations are usually presented
in matrix form
• Transition matrix M has entries mij, representing
the rates of transition between the states i and j
– index i is used for the number of columns
– index j is used for the number of rows
m11 m21
M=
m12 m22
p. 33 - Design of Fault Tolerant Systems - Elena Dubrova, ESDlab
Single-component system, no repair
l
1 2
• Transition matrix M has the form:
-l 0
M=
l 0
• entries in each columns must sum up to 0
– entries mii, corresponding to self-transitions, are
computed as –(sum of other entries in this column)
p. 34 - Design of Fault Tolerant Systems - Elena Dubrova, ESDlab
Single-component system with repair
l
1 2
m
• Transition matrix M has the form:
-l m
M=
l -m
p. 35 - Design of Fault Tolerant Systems - Elena Dubrova, ESDlab
Single-component system, safety
analysis
lC 2
1
l(1-C) 3
• Transition matrix M has the form:
-l 0 0
M = lC 0 0
l(1-C) 0 0
p. 36 - Design of Fault Tolerant Systems - Elena Dubrova, ESDlab
Two-component parallel system
2l l
1 2 3
• Transition matrix M has the form:
-2l 0 0
M = 2l -l 0
0 l 0
p. 37 - Design of Fault Tolerant Systems - Elena Dubrova, ESDlab
Important properties of matrix M
• Sum of the entries in each column is 0
• Positive sign of an ijth entry indicates that
the transition originates from the ith state
• In reliability analysis, M allows us to
distinguish between the operational and
failed states
– each failed state i has a zero diagonal element
mii (a failed state cannot leaved)
p. 38 - Design of Fault Tolerant Systems - Elena Dubrova, ESDlab
State transition equations
• Let P(t) be a vector whose ith element is the
probability Pi(t), the probability that the
system is in the state i at time t
• The matrix representation of a system of
state transition equations is given by
d
P(t) = M • P(t)
dt
p. 39 - Design of Fault Tolerant Systems - Elena Dubrova, ESDlab
Two-component parallel system
• Using transition matrix derived earlier, we get:
P1(t) -2l 0 0 P1(t)
d
P2(t) = 2l -l 0 · P2(t)
dt
P3(t) 0 l 0 P3(t)
• This represents the following system of equations
d
dt P1(t) = -2lP1(t)
d
dt P2(t) = 2lP1(t) - lP2(t)
d
dt P3(t) = lP2(t)
p. 40 - Design of Fault Tolerant Systems - Elena Dubrova, ESDlab
Solving state transition equations
• By solving these equations, we get
P1(t) = e-2lt
P2(t) = 2e-lt - 2e-2lt
P3(t) = 1- 2e-lt + e-2lt
• Since the Pi(t) are known, we can compute the reliability of
the system as a sum of probabilities taken over all
operating states
Rparallel(t) = P1(t) + P2(t) = 2e-lt - e-2lt
p. 41 - Design of Fault Tolerant Systems - Elena Dubrova, ESDlab
Comparison to RBD result
• Since R = e-lt, the previous equation can be
written as
Rparallel(t) = 2R – R2
• which agrees with the expression derived using
RBD
• two results are the same because we assumed
that the failure rates of the two components are
independent
p. 42 - Design of Fault Tolerant Systems - Elena Dubrova, ESDlab
Dependant component case
• The value of Markov chains become evident
when component failures cannot be assumed to
be independent
– load-sharing components
– examples: electrical load, mechanical load, information
load
• If two components share the same load and one
fails, the additional load on the second
component increases its failure rate
p. 43 - Design of Fault Tolerant Systems - Elena Dubrova, ESDlab
Parallel system with load sharing
• As before, we have four states, but after the 1st
component failure, the failure rate of the 2nd
component increases
2 l'2
l1
1 4
l2 3 l'1
p. 44 - Design of Fault Tolerant Systems - Elena Dubrova, ESDlab
Parallel system with load sharing
• State transition equations are:
P1(t) -l1-l2 0 0 0 P1(t)
d P2(t) l1 -l'2 0 0 P2(t)
= ·
dt
P3(t) l2 0 -l'1 0 P3(t)
P4(t) 0 l'2 l'1 0 P4(t)
d
dt P1(t) = (-l1-l2)P1(t)
d P (t)
dt 2
= l1P1(t) -l'2P2(t)
d P (t)
3 = l2P1(t) -l'1P3(t)
dt
d P4(t) = l'2P2(t)+l'1P3(t)
dt
p. 45 - Design of Fault Tolerant Systems - Elena Dubrova, ESDlab
Effect of the load
• If l'1= l1 and l'2= l2 , the equation of load
sharing parallel system reduces to well-
known
Rparallel(t) = 2e-lt - e-2lt
p. 46 - Design of Fault Tolerant Systems - Elena Dubrova, ESDlab
Availability evaluation
• Difference with reliability analysis:
– in reliability analysis components are allowed to be
repaired as long as the system has not failed
– in availability analysis components can also be
repaired after the system failure
p. 47 - Design of Fault Tolerant Systems - Elena Dubrova, ESDlab
Two-component standby system
• First component is primary
• Second is held in reserve and only brought to
operation if the first component fails
• We assume that
– fault detection unit which detect failure of the primary
component are replace is with standby is perfect
– standby component cannot fail while in the standby
mode
p. 48 - Design of Fault Tolerant Systems - Elena Dubrova, ESDlab
State transition diagram for reliability
analysis with repair
l1 state 1: both OK
l2
state 2: primary failed and
1 2 3
m replaced by spare
state 3: both failed
Repair replaces a broken
-l1 m 0 component by a working
M = l1 -l2-m 0 one.
0 l2 0
p. 49 - Design of Fault Tolerant Systems - Elena Dubrova, ESDlab
State transition diagram for
availability analysis with repair
l1 l2 States are the same.
1 2 3 Repair replaces a broken
m m component by a working
one. Here we assume that
there is only one repair
-l1 m 0 team.
M = l1 -l2-m m
0 l2 -m
p. 50 - Design of Fault Tolerant Systems - Elena Dubrova, ESDlab
State transition diagram for
availability analysis with repair
l1 l2 If we assume that there are
two independent repair
1 2 3 teams, then m on the edge
m 2m
from 3 to 2 gets the coefficient
2 (the rate doubles).
-l1 m 0
M = l1 -l2-m 2m
0 l2 -2m
p. 51 - Design of Fault Tolerant Systems - Elena Dubrova, ESDlab
Availability analysis
• None of the diagonal elements of M are 0
• By solving the system, we can get Pi(t) are compute the
availability as a sum of probabilities taken over all
operating states
• Usually steady-state availability rather than time
dependent one is of interest
• As time approaches infinity, the derivative of the right-
hand side of the equation d/dt P(t) = M • P(t) vanishes and
we get time-independent relationship
M • P() = 0
p. 52 - Design of Fault Tolerant Systems - Elena Dubrova, ESDlab
Two-component standby system
• Using transition matrix derived earlier, we get the following
system of equations
-l1P1() + mP2() = 0
l1P1() – (l2+ m)P2() + mP3() = 0
l2P2() – mP3() = 0
• By solving the equations, we get
A() 1 - (l/m)2
p. 53 - Design of Fault Tolerant Systems - Elena Dubrova, ESDlab
Safety evaluation
lC 2
1
l(1-C) 3
• The state transition equations are:
P1(t) -l 0 0 P1(t)
d
P2(t) = lC 0 0 · P2(t)
dt
P3(t) l(1-C) 0 0 P3(t)
p. 54 - Design of Fault Tolerant Systems - Elena Dubrova, ESDlab
Safety evaluation
• By solving these equations, we get
P1(t) = e-lt
P2(t) = C(1- e-lt)
P3(t) = (1-C) – (1-C)e-l t
• Since the Pi(t) are known, we can compute the reliability of
the system as a sum of probabilities of neing the
operational and fail-safe states
R(t) = P1(t) + P2(t) = C + (1-C)e-lt
• At time t=0, the safety is 1. As time approaches infinity,
the safety approaches C
p. 55 - Design of Fault Tolerant Systems - Elena Dubrova, ESDlab
How to deal with cases of systems
with “k out of n choices”
• Suppose we want to solve the following task:
What is the probability that more than two engines in a
4-engine airplane will fail during a t-hour flight if the
failure rate of a single engine is l per hour?
• The probability that more than two engines fail can
be expressed as:
P2 failed = ( 41 )P1 works 3 failed + P4 failed
= 1 – (P4 work + ( 4 )P3 work 1 failed + ( 4 )P2 work 2 failed )
3 2
• Only probabilities of mutually exclusive events can
be summed up like this
p. 56 - Design of Fault Tolerant Systems - Elena Dubrova, ESDlab
“k out of n choices”
• “k out of n choices” can be computed as
n n!
( )=
k (n-k)! k!
• For example
4 4!
( )=
2
=6
(4-2)! 2!
p. 57 - Design of Fault Tolerant Systems - Elena Dubrova, ESDlab
Example cont.
So, we get
P2 failed = 4 P1 works 3 failed + P4 failed
where
P1 works 3 failed = R (1-R)3
P4 failed = (1-R)4
where R is the reliability of a single engine
computed as R = e-lt
p. 58 - Design of Fault Tolerant Systems - Elena Dubrova, ESDlab
Summary
• Methods for evaluating the reliability,
availability and safety of a system
– RBDs
– Markov chains
p. 59 - Design of Fault Tolerant Systems - Elena Dubrova, ESDlab
Next lecture
• Hardware redundancy
Read chapter 4
of the text book
p. 60 - Design of Fault Tolerant Systems - Elena Dubrova, ESDlab