Basic Computer System Design
System Reliability
Muhammad Tariq Mahmood
[email protected]
School of Computer Science and Engineering
Quality is never an accident. It is always the result of intelligent effort. --John Ruskin
Note: These notes are prepared from the following resources.
I (main text) Design for Electrical and Computer Engineers by Ralph Ford, Chris Coulston (McGraw-Hill Education 2008)
I Wiegers, Karl, and Joy Beatty. Software requirements. 3rd Edition, Pearson Education, 2013.
I Kendall, Kenneth E., and Julie E. Kendall. Systems analysis and design. Pearson Education, 2019 10Ed.
1 / 28
Contents
Learning Objectives
1 System Reliability I Have a familiarity with the basic principles of
probability and understand how they apply to
2 Probability Theory Review reliability theory.
Probability Density Functions I Understand the mathematical definition and
Common Probability Density Functions meaning of failure rate, reliability, and mean time
to failure.
3 Reliability Prediction I Understand how to determine the reliability of a
Mean Time to Failure component.
Limits of Reliability Estimation
I Understand how to derate the power of electronic
components for use under different operating
4 System Reliability
temperatures.
Series Systems
Parallel Systems I Understand how to determine the reliability of
Combination Systems different system configurations.
2 / 28
System Reliability
System Reliability
I In industry, engineers develop systems that are used by the public at large, and issues beyond the
functionality, such as reliability, safety, and maintainability, become important factors in the success of the
design.
I Industry has made a great shift to address reliability through the adoption of processes such as quality
functional deployment (QFD), six sigma , and robust design.
I Reliability attempts to answer the question of how long a system will operate without failing.
I Answering this question has inherent uncertainty and requires the use of probability and statistics.
I This chapter presents a review of basic probability theory and applies it to estimate the behavior of
real-world devices. Reliability at the component and system levels is considered.
3 / 28
Probability Theory Review
Probability Theory Review
I Probability theory provides a formal framework to study chance events. It is a powerful tool for modeling
engineering systems and is a requisite for reliability estimation.
I An experiment is the process of measuring or quantifying the state of the world.
I The particular outcome of an experiment is an event (e).
I In a discrete event space, the union of all the possible experimental outcomes defines the event space. If ei
is the ith event in a discrete event space, then the event space is given by the union
E = ei ∪
I The event space (E) is the set of all possible outcomes of the experiment
I Examples
• The experiment is rolling the die and observing the outcome, the event is the particular outcome observed, and
the event space for the experiment is the set E = {1, 2, 3, 4, 5, 6}.
• Another experiment is tossing a coin, in which case the event space is E = {heads, tails}.
I The probability of an event indicates how likely it is for an event to occur.
4 / 28
Probability Theory Review (cont...)
I The probability is the percentage of times that an event would occur if the experiment were repeated an
infinite number of times (the law of large numbers).
I Two of the three fundamental axioms on which probability theory is built are
P(ei ) ≥ 0
P(E ) = 1
I Consider an experiment where the objective is to measure temperature. Clearly, such a measurement
requires a variable having a continuous range of possible values.
I A random variable is defined as the outcome of an experiment that has a continuum of possible values.
5 / 28
Probability Density Functions
Probability Density Functions
I Random variables have a mathematical function known as the probability density function (PDF) associated
with them, which when integrated, yields the probability of a range of events.
I A PDF is typically denoted as p(x ), where X takes values over the event space.
I Consider the case where the objective is to
determine the probability that the random variable
X lies between two values a and b.
I Written with the probability operator, this is
indicated as P(a ≤ X ≤ b). It is determined from
the PDF as follows:
∫ b Figure 1: A probability density function. The area under
P(a ≤ X ≤ b) = PX (x )dx the curve represents the probability that the random
a variable X lies in the interval [a, b].
I Conceptually, this probability represents the area under the PDF between the two limits of integration as
shown in Figure
6 / 28
Probability Density Functions (cont...)
I Important properties of probability density functions
1. The probability of the event space occurring is equal to one. This is known as the normalization property, and it
is expressed as
∫ ∞
pX (x )dx = 1
−∞
2. Another interesting result is obtained by trying to determine the probability that a random variable takes on an
exact value, for example P(X = a).
∫ a
P(X = a) = pX (x )dx = 0
a
I Example: Consider an experiment where the objective is to measure a voltage value for a random variable
V. Now consider the question, "What is the probability that the result of a voltage measurement equals it
(the irrational number) volts?"
∫ π+△v
P(π < V < π + △v ) = pV (v )dv ≈ PV (π)△v
π
7 / 28
Probability Density Functions (cont...)
Mean and Variance
I Two useful and well-known statistics that are determined from the PDF are the mean µ and variance σ 2 .
They are found from the PDF as follows:
∫ ∞
µX = xpX (x )dx
−∞
∫ ∞
σX2 = (x − µX )pX (x )dx
−∞
I The mean is analogous to the center of mass of the PDF; it is also known as the average value.
I The variance is the average of the squared difference between the mean and the values of the PDF, where
the squared term ensures that a positive difference is taken.
I The square root of the variance is known as the standard deviation σ.
√∫
∞
σX = (x − µX )pX (x )dx
−∞
8 / 28
Common Probability Density Functions
I Cumulative Distribution Functions: An important class of questions can be phrased as, "What is the
probability that a random variable X is less than value a?"
I For example, the objective might be to determine the probability that an electronic component will
malfunction within 2 years.
I Returning to the first question, it is clear that the goal is to determine the probability P(X < a), which is
found by integrating the PDF.
I This result is generalized by allowing the upper limit of integration to take on an arbitrary value that spans
the range of the random variable.
I This produces a new function, known as the cumulative distribution function (CDF), which is the integral
function of the PDF and is defined as
∫ x
CDF (x ) ≡ pX (y )dy
−∞
9 / 28
Common Probability Density Functions (cont...)
Probability Density Functions:
1. The Normal Density 2. The Uniform Density
• The most common density function encountered in • The uniform density models the outcome of an
the physical sciences and engineering is the normal experiment where all outcomes are equally likely.
density.
• Mathematically, the PDF for a uniform density is
• The normal density is defined as
given by
1 (x −µ)2
px (x ) = √ e− 2σ
1
2πσ pX (x ) = ,a ≤ x ≤ b
• Varying σ allows the overall function to be shifted a−b
along the x axis, while increasing o-spreads (or where a and b are selected to meet the demands of
flattens) the function out. a particular problem.
Figure 2: A normal density function with the mean
µ and standard deviation σ shown Figure 3: The uniform density on the interval [a, b]
10 / 28
Common Probability Density Functions (cont...)
3. The Exponential Density
• Exponential densities are often utilized to model
time-dependent functions, such as inter-arrival times
between data packets in communication systems.
• the exponential density also describes the behavior
of component failures as a function of time.
• The mathematical description of an exponential
density is
Figure 4: The exponential density for two different
pX (x ) = λe −λx , λ ≥ 0, x ≥ 0 2 values
The PDF is characterized by the parameter λ, which
affects the shape of the curve
11 / 28
Reliability Prediction
Reliability Prediction
I The following is a formal mathematical definition of reliability.
Definition: Reliability, R(t), is the probability that a device is functioning properly (has not failed) at time t.
I The failure rate
1. The failure rate, λ(t), of a device is the expected number of failures per unit time. The failure rate is measured
by operating a batch of devices for a given time interval and noting how many fail during that interval.
2. A typical graph of failure rate versus time has the bathtub shape shown in Figure. The high initial failure rate is
a result of manufacturing defects and is often referred to as infant mortality.
Figure 5: Failure rate as a function of time, also known as the bathtub curve.
12 / 28
Reliability Prediction (cont...)
3. After the infant mortality phase, devices enter a phase of constant failure rate, where λ(t) = λ known as the
service life.
4. Estimates for λ are determined empirically by testing a large number of components.
5. After some period of time, devices start to wear out and the failure rate increases. This usually happens as a
result of mechanical wearing with age and use.
6. Properly designed electronic devices will not have a wear-out region, instead continuing on at a constant failure
rate.
I The failure time
• A PDF for the failure time of the device, fT (t), is defined, where the random variable is time T .
• This function allows the question to be asked "What is the probability that a device will fail between time t1 and
t2 ?"
• A CDF for fT (t), is determined as
∫ t
F (t) = fT (τ )dτ
0
• The failure rate tells us the average rate that a collection of identical devices will fail at a given time t, while
fT (t) is a PDF used to determine the probability that a given device will fail within a specified time period.
13 / 28
Reliability Prediction (cont...)
• F (t) answers the question "What is the probability that the device has failed by time t?" and it is also known as
the failure function.
• The relationship between system reliability R(t) and F (t) is given as
R(t) = 1 − F (t)
Figure 6: Example reliability and failure functions.
I Since λ(t) represents data that is measured empirically, it is useful to establish a relationship between λ(t)
and the ultimate goal of reliability R(t).
14 / 28
Reliability Prediction (cont...)
I A relationship between λ(t), R(t), and fT (t) is established as follows. Consider a small period of time
between t and t + ∆t , and determine the probability of device failure during this period.
p(failure between t and ∆t) ≈ fT (t)∆t = fT (t)∆t
or
p(failure between t and ∆t) = R(t)λ(t)∆t
I The desired relationship between the three quantities
fT (t) = R(t)λ(t)
I The PDF fT (t) eliminated as follows
d d d
fT (t) = F (t) = [1 − R(t)] = − R(t)
dt dt dt
d
− R(t) = R(t)λ(t)
dt
− d R(t)
⇒ λ(t) = dt
R(t)
15 / 28
Reliability Prediction (cont...)
I Integrating both sides gives
∫ ∫ t[ ] ∫ t
t
− dτ
d
R(τ )
λ(τ )dτ = dτ ⇒ −ln(R(t)) = λ(τ )dτ
0 0 R(τ ) 0
I Final result for reliability as a function of λ(t),
[ ∫ t ]
R(t) = exp − λ(τ )dτ
0
I During service time, the failure rate is constant, so simplified formula is
R(t) = exp(−λt)
16 / 28
Reliability Prediction (cont...)
I Example: A Transistor Reliability
• Problem: Consider a transistor with a constant failure rate of λ = 1/106 hours hours What is the probability
that the transistor will be operable in 5 years?
• Solution: This solution is found using the reliability function for a constant failure rate as follows.
R(t) = exp(−λt)
[ ]
1 24 hours 365 days
R(5years) = exp − 6 × × × 5years
10 hours day year
= exp(−0.0438) = 0.957 = 95.7%
17 / 28
Mean Time to Failure
Mean Time to Failure
I The mean time to failure (MTTF) is a quantity that answers the question, "On average how long does it
take for a device to fail?"
I From its definition, it is apparent that the MTTF is the mean value of the random variable T (failure time).
It is determined from the PDF and the definition of the mean in (8) as follows
∫ ∞
MTTF = tfT (t)dt
0
I Assuming the form of R(t) a constant failure rate gives
d
fT (t) = − R(t) = λe −λt
dt
I This means that under the condition of a constant failure rate, the failure PDF follows an exponential
density. The MTTF is found from fT (t) via integration by parts to be
∫ ∞
1
MTTF = tλe −λt dt =
0 λ
18 / 28
Mean Time to Failure (cont...)
I Example: Transistor MTTF
• Consider a transistor with a constant failure rate of λ = 1/106 hours hours. Determine
(a) the MTTF,
(b) the reliability at the MTTF.
• Solution (a):
1 1
MTTF = = = 106 hours = 114 years
λ 1/106 hours
• Solution (b):
R(t) = exp(−λt)
[ ]
106 hours
R(114 years) = exp − 6 = exp(−1) = 0.368 = 36.8%
10 hours
19 / 28
Mean Time to Failure (cont...)
I Example: Human lifespan estimation
• Problem: Data shows that for a 30-year-old population, the failure (death) rate is constant with approximately
1.1 deaths per 1000 people per year. Given this data, estimate the MTlf of huňmans.
• Solution: In order to find MTTF, λ is needed. From the information given it is
1.1/1000 failures 1.1 failures 1 failure
λ= = =
1 years 103 years 909 years
1
MTTF = = 909 years
λ
20 / 28
Limits of Reliability Estimation
Limits of Reliability Estimation
I It must be kept in mind that the reliability estimates are just that, estimates, and there are limitations in
their use.
I First, realize that the failure rate data comes from accelerated stress tests, where devices are put under
stress beyond normal operating conditions, and from these the failure rates are estimated.
I Second, there are other factors that influence reliability that are not addressed by λ, such as the
manufacturing processes used, the quality of manufacturing technologies, shock, and corrosion.
I Part of the value of reliability estimation is for comparative purposes in evaluating different design options.
I Applying these methods forces the designer to consider the operating conditions and factor them into the
design.
21 / 28
Series Systems
Series Systems
I Failure of any one component in the circuit would lead to the failure of the overall system or circuit.
I Conceptually, a system in which the failure of a single component (or subsystem) leads to failure of the
overall system is known as a series system
Figure 7: A series system consisting of components, or subsystems S1 , S2 , ..., Sn .
I To compute the overall reliability of a series system, R(t), it is assumed that the failure of subsystems or
components are independent events.
22 / 28
Series Systems (cont...)
I The system is operable only if subsystems S1 and S2 ... and Sn are all simultaneously operating.
∏
i=n
Rs (t) = R1 (t)R2 (t)...Rn (t) = Ri (t)
i=1
∑
i=n
= e −λ1 t e −λ2 t ...e −λn t = exp(− λi t)
i=1
I This leads to a series system failure rate and MTTF of
∑
i=n
λs = λi
i=1
1
MTTFs =
λs
23 / 28
Parallel Systems
Parallel Systems
I it is clear that as more components are added to a
series system, the reliability decreases.
I The use of redundancy gives us a method to answer
in the affirmative.
I A design has redundancy if it contains multiple
modules performing the same function where a
single module would suffice.
I By its very nature redundancy allows improperly
functioning modules to be switched out of the
system without affecting its behavior.
I With redundancy the overall system functions Figure 8: A parallel, or redundant, system consisting of
correctly when any one of the submodules is subsystems S1 , S2 , ..., Sn .
functioning.
24 / 28
Parallel Systems (cont...)
I In order to compute the reliability of a parallel system, note that a parallel system funcňtions correctly when
S1 is functioning correctly, or S2 is functioning correctly, . . . or Sn , is functioning correctly.
I It would be nice if it were possible to write an equation stating that RS (t) = R2 (t) + ... + Rn (t), where + is
the logical OR operator. Unfortunately, there is no direct way to realize the OR operation in probability
theory.
I This is resolved by working with failure function F (t) instead.
I The probability that the system will fail by time t, Fs (t) , is equal to the probability that subsystem S1 will
fail and S2 , will fail and . . Sn will fail
I This probability is expressed mathematically as
∏
n ∏
n
Fs (t) = F1 (t)F2 (t)...Fn (t) = Fi (t) = [1 − Ri (t)]
i=1 i=1
∏
n
Rs (t) = 1 − Fs (t) = 1 − [1 − Ri (t)]
i=1
25 / 28
Parallel Systems (cont...)
I Example: Reliability of a Redundant Array of Independent Disks (RAID).
• Problem: In a RAID, multiple hard drives are used to store the same data, thus achieving redundancy and
increased reliability. One or more of the disks in the system can fail and the data can still be recovered. However,
if all disks fail, then the data is lost. For this problem, assume that the individual disk drives have a failure rate
of λ = 10 failures/106 hours. How many disks must the system have to achieve a reliability of 98% in 10 years?
• Solution: Since all of the disks are identical, the expression simplifies to
Rs (t) = 1 − [1 − Ri (t)]n
[ [ ]]n
10 24 hours 365 days
0.98 ≤ 1 − exp × × × 10 years
106 hours day year
0.98 ≤ 1 − [1 − 0.42]n ⇒ 0.02 ≤ (0.58)n
log(0.02) ≤ nlog(0.58) ⇒ n ≥ 7.2
26 / 28
Combination Systems
Combination Systems
I Many real systems do not fit neatly into either all of the subsystem combinations.
parallel or series reliability models, Rather, they may
be a combination of the two, and such systems will
be referred to as combination systems.
I One way to determine the reliability of a
combination system is to utilize the results obtained
for series and parallel systems
I The system network is reduced by combining parallel
subsystems into a single block, while series
subsystems are reduced to a single block.
I This is conceptually analogous to combining series Figure 9: A combination series-parallel system.
S1 , S2 , S3 , and S4 are redundant parallel systems.
and parallel resistances in electrical circuits
I The network is continually reduced until only a
Rs2,s3,s4 = 1 − (1 − R2 (t))(1 − R3 (t))(1 − R4 (t))
single block remains whose reliability is known from
Rs = R1 (t) × [1 − (1 − R2 (t))(1 − R3 (t))(1 − R4 (t))]
27 / 28
Combination Systems (cont...)
I Example: Combination System Reliability.
• Problem: Consider the system shown below with the
following re-liabilities at a fixed time t,
R1 = R2 = 80%. Determine the reliability that
subsystems R3 and R4 must have so that the overall
system reliability is greater than 95
• Solution: The parallel systems can be combined into single systems whose re-liabilities are
Rs1,s2 = 1 − (1 − R1 )(1 − R2 )
Rs3,s4 = 1 − (1 − R3 )(1 − R4 )
Rs = [1 − (1 − R1 )(1 − R2 )] × [1 − (1 − R3 )(1 − R4 )]
• Substituting values and assuming R3 = R4 gives
0.95 = [1 − (0.2)2 ] × [1 − (1 − R3,4 )2 ]
R3 , R4 = 0.90
28 / 28