Life Cycle Reliability Engineering
Life Cycle Reliability Engineering
Guangbin Yang
Ford Motor Company
No part of this publication may be reproduced, stored in a retrieval system, or transmitted in any
form or by any means, electronic, mechanical, photocopying, recording, scanning, or otherwise,
except as permitted under Section 107 or 108 of the 1976 United States Copyright Act, without
either the prior written permission of the Publisher, or authorization through payment of the
appropriate per-copy fee to the Copyright Clearance Center, Inc., 222 Rosewood Drive, Danvers,
MA 01923, (978) 750-8400, fax (978) 750-4470, or on the web at www.copyright.com. Requests
to the Publisher for permission should be addressed to the Permissions Department, John Wiley &
Sons, Inc., 111 River Street, Hoboken, NJ 07030, (201) 748-6011, fax (201) 748-6008, or online at
http://www.wiley.com/go/permission.
Limit of Liability/Disclaimer of Warranty: While the publisher and author have used their best
efforts in preparing this book, they make no representations or warranties with respect to the
accuracy or completeness of the contents of this book and specifically disclaim any implied
warranties of merchantability or fitness for a particular purpose. No warranty may be created or
extended by sales representatives or written sales materials. The advice and strategies contained
herein may not be suitable for your situation. You should consult with a professional where
appropriate. Neither the publisher nor author shall be liable for any loss of profit or any other
commercial damages, including but not limited to special, incidental, consequential, or other
damages.
For general information on our other products and services or for technical support, please contact
our Customer Care Department within the United States at (800) 762-2974, outside the United
States at (317) 572-3993 or fax (317) 572-4002.
Wiley also publishes its books in a variety of electronic formats. Some content that appears in print
may not be available in electronic formats. For more information about Wiley products, visit our
web site at www.wiley.com.
10 9 8 7 6 5 4 3 2 1
To Ling, Benjamin, and Laurence
CONTENTS
Preface xi
1 Reliability Engineering and Product Life Cycle 1
1.1 Reliability Engineering, 1
1.2 Product Life Cycle, 2
1.3 Integration of Reliability Engineering into
the Product Life Cycle, 5
1.4 Reliability in the Concurrent Product Realization Process, 6
Problems, 7
vii
viii CONTENTS
Index 511
PREFACE
the concepts of reliability engineering and product life cycle and the integration
of reliability techniques into each stage of the life cycle. Chapter 2 delineates
the reliability definition, metrics, and product life distributions. In Chapter 3 we
present techniques for analyzing customer expectations, establishing reliability
targets, and developing effective reliability programs to be executed throughout
the product life cycle. Chapter 4 covers methodologies and practical applications
of system reliability modeling and allocation. Confidence intervals for system
reliability are also addressed in this chapter. Chapter 5 is one of the most impor-
tant chapters in the book, presenting robust reliability design techniques aimed at
building reliability and robustness into products in the design and development
phase. In Chapter 6 we describe reliability tools used to detect, assess, and erad-
icate design mistakes. Chapter 7 covers accelerated life test methods, models,
plans, and data analysis techniques illustrated with many industrial examples. In
Chapter 8 we discuss degradation testing and data analysis methods that cover
both destructive and nondestructive inspections. In Chapter 9 we present relia-
bility techniques for design verification and process validation and in Chapter 10
address stress screening topics and describe more advanced methods for degra-
dation screening. The last chapter, Chapter 11, is dedicated to warranty analysis,
which is important for manufacturers in estimating field reliability and warranty
repairs and costs.
The book has the following distinct features:
The book is designed to serve engineers working in the field of reliability and
quality for the development of effective reliability programs and implementation
PREFACE xiii
of programs throughout the product life cycle. It can be used as a textbook for
students in industrial engineering departments or reliability programs but will
also be useful for industry seminars or training courses in reliability planning,
design, testing, screening, and warranty analysis. In all cases, readers need to
know basic statistics.
I am indebted to a number of people who contributed to the book. Mr. Z.
Zaghati, Ford Motor Company, encouraged, stimulated, and assisted me in com-
pleting the book. I am most grateful to him for his continuing support.
I would like to specially acknowledge Dr. Wayne Nelson, a consultant and
teacher in reliability and statistics. He gave me detailed feedback on parts of this
book. In addition, Dr. Nelson generously shared with me some of his unpublished
thoughts and his effective and valuable book-writing skills.
A number of people provided helpful suggestions and comments on parts of
the book. In particular, I am pleased to acknowledge Prof. Thad Regulinski,
University of Arizona; Dr. Joel Nachlas, Virginia Polytechnic Institute and State
University; Dr. Ming-Wei Lu, DaimlerChrysler Corporation; Prof. Fabrice Guerin
and Prof. Abdessamad Kobi, University of Angers, France; and Dr. Loon-Ching
Tang, National University of Singapore. I am also grateful for contributions from
Dr. Vasiliy Krivtsov, Ford Motor Company. Over the years I have benefited from
numerous technical discussions with him.
I would also like to thank Prof. Hoang Pham, Rutgers University; Dr. Greg
Hobbs, Hobbs Engineering Corporation; and Prof. Dimitri Kececioglu, University
of Arizona, who all generously reviewed parts of the manuscript and offered
comments.
Finally, I would like to express my deep appreciation and gratitude to my
wife, Ling, and sons Benjamin and Laurence. Their support was essential to the
successful completion of the book.
GUANGBIN YANG
Dearborn, Michigan
May 2006
1
RELIABILITY ENGINEERING
AND PRODUCT LIFE CYCLE
Reliability has a broad meaning in our daily life. In technical terms, reliability is
defined as the probability that a product performs its intended function without
failure under specified conditions for a specified period of time. The definition,
which is elaborated on in Chapter 2, contains three important elements: intended
function, specified period of time, and specified conditions. As reliability is quan-
tified by probability, any attempts to measure it involve the use of probabilistic
and statistical methods. Hence, probability theory and statistics are important
mathematical tools for reliability engineering.
Reliability engineering is the discipline of ensuring that a product will be
reliable when operated in a specified manner. In other words, the function of
reliability engineering is to avoid failures. In reality, failures are inevitable; a
product will fail sooner or later. Reliability engineering is implemented by taking
structured and feasible actions that maximize reliability and minimize the effects
of failures. In general, three steps are necessary to accomplish this objective. The
first step is to build maximum reliability into a product during the design and
development stage. This step is most critical in that it determines the inherent reli-
ability. The second step is to minimize production process variation to assure that
the process does not appreciably degrade the inherent reliability. Once a product
is deployed, appropriate maintenance operations should be initiated to allevi-
ate performance degradation and prolong product life. The three steps employ a
large variety of reliability techniques, including, for example, reliability planning
1
2 RELIABILITY ENGINEERING AND PRODUCT LIFE CYCLE
Product life cycle refers to sequential phases from product planning to disposal.
Generally, it comprises six main stages, as shown in Figure 1.1. The stages, from
product planning to production, take place during creation of the product and thus
are collectively called the product realization process. The tasks in each stage
are described briefly below.
Design and Development Phase This phase usually begins with preparation of
detailed product specifications on reliability, features, functionalities, economics,
ergonomics, and legality. The specifications must meet the requirements defined
in the product planning phase, ensure that the product will satisfy customer expec-
tations, comply with governmental regulations, and establish strong competitive
position in the marketplace. The next step is to carry out the concept design. The
starting point of developing a concept is the design of a functional structure that
determines the flow of energy and information and the physical interactions. The
functions of subsystems within a product need to be clearly defined; the require-
ments regarding these functions arise from the product specifications. Functional
block diagrams are always useful in this step. Once the architecture is complete,
PRODUCT LIFE CYCLE 3
Product
Planning
Design
and
Development
Product
Realization
Process
Verification
and
Validation
Production
Field
Deployment
Disposal
the physical conception begins to determine how the functions of each subsystem
can be fulfilled. This step benefits from the use of advanced design techniques
such as TRIZ and axiomatic design (Suh, 2001; K. Yang and El-Haik, 2003)
and may result in innovations in technology. Concept design is a fundamental
stage that largely determines reliability, robustness, cost, and other competitive
potentials.
Concept design is followed by detailed design. This step begins with the
development of detailed design specifications which assure that the subsystem
requirements are satisfied. Then physical details are devised to fulfill the functions
of each subsystem within the product structure. The details may include physical
linkage, electrical connection, nominal values, and tolerances of the functional
parameters. Materials and components are also selected in this step. It is worth
noting that design and development is essentially an iterative task as a result of
4 RELIABILITY ENGINEERING AND PRODUCT LIFE CYCLE
Verification and Validation Phase This phase consists of two major steps:
design verification (DV) and process validation (PV). Once a design is com-
pleted successfully, a small number of prototypes are built for DV testing to
prove that the design achieves the functional, environmental, reliability, regula-
tory, and other requirements concerning the product as stipulated in the product
specifications. Prior to DV testing, a test plan must be developed that specifies
the test conditions, sample sizes, acceptance criteria, test operation procedures,
and other elements. The test conditions should reflect the real-world use that
the product will encounter when deployed in the field. A large sample size in
DV testing is often unaffordable; however, it should be large enough so that the
evidence to confirm the design achievement is statistically valid. If functional
nonconformance or failure occurs, the root causes must be identified for poten-
tial design changes. The redesign must undergo DV testing until all acceptance
criteria have been met completely.
Parallel to DV testing, production process planning may be initiated so that
pilot production can begin once the design is verified. Process planning involves
the determination of methods for manufacturing a product. In particular, we
choose the steps required to manufacture the product, tooling processes, pro-
cess checkpoints and control plans, machines, tools, and other requirements. A
computer simulation is helpful in creating a stable and productive production
process.
The next step is PV testing, whose purpose is to validate the capability of
the production process. The process must not degrade the inherent reliability
to an unacceptable level and must be capable of manufacturing products that
meet all specifications with minimum variation. By this time, the process has
been set up and is intended for production at full capacity. Thus, the test units
represent the products that customers will see in the marketplace. In other words,
the samples and the final products are not differentiable, because both use the
same materials, components, production processes, and process monitoring and
measuring techniques. The sample size may be larger than that for DV testing,
due to the need to evaluate process variation. The test conditions and acceptance
criteria are the same as those for DV testing.
Production Phase Once the design is verified and the process is validated,
full capacity production may begin. This phase includes a series of interrelated
activities such as materials handling, production of parts, assembly, and quality
control and management. The end products are subject to final test and then
shipped to customers.
Field Deployment Phase In this phase, products are sold to customers and real-
ize the values built in during the product realization process. This phase involves
marketing advertisement, sales service, technical support, field performance mon-
itoring, and continuous improvement.
INTEGRATION OF RELIABILITY ENGINEERING INTO THE PRODUCT LIFE CYCLE 5
Disposal This is the terminal phase of a product in the life cycle. A product
is discarded, scraped, or recycled when it is unable to continue service or is not
cost-effective. A nonrepairable product is discarded once it fails; a repairable
product may be discarded because it is not worthy of repair. The service of some
repairable products is discontinued because their performance does not meet
customer demands. The manufacturer must provide technical support to dispose
of, dismantle, and recycle the product to minimize the associated costs and the
adverse impact on the environment.
A conventional product realization process is serial; that is, a step starts only after
the preceding step has been completed. In the sequential model, the information
flows in succession from phase to phase. Design engineers in the upstream part
of the process usually do not address the manufacturability, testability, and ser-
viceability in their design adequately because of a lack of knowledge. Once the
design is verified and the process fails to be validated due to inadequate manufac-
turability, design changes in this phase will increase cost substantially compared
PROBLEMS 7
to making the changes in the design and development phase. In general, cost to
fix a design increases an order of magnitude with each subsequent phase (Levin
and Kalal, 2003).
The application of concurrent engineering to a product realization process is
the solution to problems associated with the sequential model. In the framework
of concurrent engineering, a cross-functional team is established representing
every aspect of the product, including design, manufacturing process, reliability
and quality planning, marketing and sales, purchasing, cost accounting, mate-
rial handling, material control, data management and communication, service,
testing, and others. The team relays information to design engineers concerning
all aspects of the product so that from the very beginning the engineers will
address any potential issues that would otherwise be ignored. The information
flow is multidirectional between all functional areas, as stated above, and contin-
ues throughout the entire product realization process. As a result, other phases in
addition to design and development also benefit from the concurrent involvement.
For example, test plans can be developed in the design and development phase,
with valuable input from design engineers and other team members. If a testabil-
ity problem is discovered in this phase, design engineers are more likely to make
design changes. Under concurrent engineering, most phases of the product real-
ization process can take place simultaneously. The resulting benefits are twofold:
maximization of the chance of doing things right the first time, and reducing
the time to market. Ireson et al. (1996) and Usher et al. (1998) describe concur-
rent engineering in greater detail and present application examples in different
industries.
In the environment of concurrent engineering, a multidisciplinary reliability
team is required to perform effective reliability tasks. The team is an integral
part of the engineering team and participates in decision making so that reli-
ability objectives and constraints are considered. Because reliability tasks are
incorporated into engineering activities, a concurrent product realization process
entails multiple reliability tasks to be conducted simultaneously. The environment
allows reliability tasks to be implemented in the upfront phases of the process to
consider the potential influences that might be manifested in subsequent phases.
For example, the reliability allocation performed at the beginning of the process
should take into account the technological feasibility of achieving the reliability,
economics of demonstrating the reliability, and manufacturability of components.
Although being concurrent is always desirable, some reliability tasks must be per-
formed sequentially. For example, a process FMEA usually starts after a design
FMEA has been completed because the former utilizes the outputs of the lat-
ter. In these situations, we should understand the interrelationships between the
reliability tasks and sequence the tasks to maximize the temporal overlap.
PROBLEMS
1.1 Explain the concept of reliability and the function of reliability engineering.
8 RELIABILITY ENGINEERING AND PRODUCT LIFE CYCLE
1.2 Describe the key engineering tasks and reliability roles in each phase of a
product life cycle.
1.3 Explain the important differences between serial and concurrent product real-
ization processes.
1.4 How should a reliability program be organized in the environment of con-
current engineering?
2
RELIABILITY DEFINITION,
METRICS, AND PRODUCT LIFE
DISTRIBUTIONS
2.1 INTRODUCTION
9
10 RELIABILITY DEFINITION, METRICS, AND PRODUCT LIFE DISTRIBUTIONS
normal, and lognormal. The reliability definition, metrics, and life distributions
presented in this chapter are the basic materials for subsequent reliability design,
testing, and analysis.
In our daily life, reliability has a broad meaning and often means dependabil-
ity. In technical terms, reliability is defined as the probability that a product
performs its intended function without failure under specified conditions for
a specified period of time. The definition contains three important elements:
intended function, specified period of time, and specified conditions.
The definition above indicates that reliability depends on specification of the
intended function or, complementarily, the failure criteria. For a binary-state
product, the intended function is usually objective and obvious. For example,
lighting is the intended function of a light bulb. A failure occurs when the light
bulb is blown out. For a multistate or degradation product, the definition of
an intended function is frequently subjective. For example, the remote key to
a car is required to command the operations successfully at a distance up to,
say, 30 meters. The specification of a threshold (30 meters in this example) is
somewhat arbitrary but largely determines the level of reliability. A quantitative
relationship between the life and threshold for certain products is described in
G. Yang and Yang (2002). If the product is a component to be installed in a
system, the intended function must be dictated by the system requirements, and
thus when used in different systems, the same components may have different
failure criteria. In the context of commercial products, the customer-expected
intended functions often differ from the technical intended functions. This is
especially true when products are in warranty period, during which customers
tend to make warranty claims against products that have degraded appreciably
even through they are technically unfailed.
Reliability is a function of time. In the reliability definition, the period of
time specified may be the warranty length, design life, mission time, or other
period of interest. The design life should reflect customers’ expectations and
be competitive in the marketplace. For example, in defining the reliability of a
RELIABILITY DEFINITION 11
Probability Density Function (pdf) The pdf, denoted f (t), indicates the failure
distribution over the entire time range and represents the absolute failure speed.
The larger the value of f (t), the more failures that occur in a small interval of
time around t. Although f (t) is rarely used to measure reliability, it is the basic
tool for deriving other metrics and for conducting in-depth analytical studies.
Cumulative Distribution Function (cdf) The cdf, denoted F (t), is the proba-
bility that a product will fail by a specified time t. It is the probability of failure,
often interpreted as the population fraction failing by time t. Mathematically, it
is defined as t
F (t) = Pr(T ≤ t) = f (t)dt. (2.1)
−∞
dF (t)
f (t) = . (2.2)
dt
Reliability The reliability function, denoted R(t), also called the survival func-
tion, is often interpreted as the population fraction surviving time t. R(t) is the
probability of success, which is the complement of F (t). It can be written as
∞
R(t) = Pr(T ≥ t) = 1 − F (t) = f (t)dt. (2.5)
t
From (2.4) and (2.5), the reliability function of the exponential distribution is
Hazard Function The hazard function or hazard rate, denoted h(t) and often
called the failure rate, measures the rate of change in the probability that a
surviving product will fail in the next small interval of time. It can be written as
Pr(t < T ≤ t + t|T > t) 1 dR(t) f (t)
h(t) = lim = − = . (2.7)
t→0 t R(t) dt R(t)
From (2.3), (2.6), and (2.7), the hazard rate of the exponential distribution is
h(t) = λ. (2.8)
Equation (2.8) indicates that the hazard rate of the exponential distribution is
a constant.
The unit of hazard rate is failures per unit time, such as failures per hour
or failures per mile. In high-reliability electronics applications, FIT (failures in
time) is the commonly used unit, where 1 FIT equals 10−9 failures per hour. In
the automotive industry, the unit “failures per 1000 vehicles per month” is often
used.
In contrast to f (t), h(t) indicates the relative failure speed, the propensity
of a surviving product to fail in the coming small interval of time. In general,
there are three types of hazard rate in terms of its trend over time: decreasing
hazard rate (DFR), constant hazard rate (CFR), and increasing hazard rate (IFR).
Figure 2.1 shows the classical bathtub hazard rate function. The curve represents
the observation that the life span of a population of products is comprised of
three distinct periods:
early wear-out
failures failures
h(t)
random
failures
0 t1 t2
t
Early failures are usually caused by major latent defects, which develop into
patent defects early in the service time. The latent defects may be induced by
manufacturing process variations, material flaws, and design errors; customer
misuse is another cause of early failures. In the automotive industry, the infant
mortality problem is significant. It is sometimes called the one-month effect,
meaning that early failures usually occur in the first month in service. Although
a decreasing hazard rate can result from infant mortality, early failures do not
necessarily lead to a decreasing hazard rate. A substandard subpopulation con-
taining latent defects may have an increasing hazard rate, depending on the life
distribution of the subpopulation. For example, if the life distribution of the sub-
standard products is Weibull with a shape parameter of less than 1, the hazard
rate decreases over time. If the shape parameter is greater than 1, the hazard rate
has an increasing trend.
In the random failure period, the hazard rate remains approximately constant.
During this period of time, failures do not follow a predictable pattern and occur at
random due to the unexpected changes in stresses. The stresses may be higher or
lower than the design specifications. Higher stresses cause overstressing, whereas
lower stresses result in understressing. Both over- and understressing may pro-
duce failures. For instance, an electromagnetic relay may fail due to a high or
low electric current. A high current melts the electric contacts; a low current
increases the contact resistance. In the constant hazard rate region, failures may
also result from minor defects that are built into products due to variations in
the material or the manufacturing process. Such defects take longer than major
defects to develop into failures.
In the wear-out region, the hazard rate increases with time as a result of
irreversible aging effects. The failures are attributable to degradation or wear out,
which accumulates and accelerates over time. As a product enters this period, a
failure is imminent. To minimize the failure effects, preventive maintenance or
scheduled replacement of products is often necessary.
Many products do not illustrate a complete bathtub curve. Instead, they have
one or two segments of the curve. For example, most mechanical parts are domi-
nated by the wear-out mechanism and thus have an increasing hazard rate. Some
components exhibit a decreasing hazard rate in the early period, followed by an
increasing hazard rate. Figure 2.2 shows the hazard rate of an automotive sub-
system in the mileage domain, where the scale on the y-axis is not given here
to protect proprietary information. The hazard rate decreases in the first 3000
miles, during which period the early failures took place. Then the hazard rate
stays approximately constant through 80,000 miles, after which failure data are
not available.
h(t)
0 10 20 30 40 50 60 70 80 90
Miles (× 103)
From (2.7) and (2.9) the relationship between H (t) and R(t) can be written as
If H (t) is very small, a Taylor series expansion results in the following approx-
imation:
R(t) ≈ 1 − H (t). (2.12)
H (t) is a nondecreasing function. Figure 2.3 depicts the H (t) associated with
DFR, CFR, and IFR. The shapes of H (t) for DFR, CFR, and IFR are convex,
DFR
H(t)
CFR
IFR
0
t
FIGURE 2.3 Cumulative hazard functions corresponding to DFR, CFR and IFR
16 RELIABILITY DEFINITION, METRICS, AND PRODUCT LIFE DISTRIBUTIONS
wear-out
failures
random
early failures
H(t)
failures
0 t1 t2
t
FIGURE 2.4 Cumulative hazard function corresponding to the bathtub curve hazard
rate
flat, and concave, respectively. Figure 2.4 illustrates the H (t) corresponding to
the bathtub curve hazard rate in Figure 2.1.
tp = F −1 (p). (2.13)
Mean Time to Failure (MTTF) MTTF is the expected life E(T ) of a nonre-
pairable product. It is defined as
∞
MTTF = E(T ) = tf (t) dt. (2.15)
−∞
The f (t), F (t), R(t), h(t), H (t), E(T ), and Var(T ) of the exponential distribu-
tion are given in (2.3), (2.4), (2.6), (2.8), (2.10), (2.17), and (2.19), respectively.
In these equations, λ is called the hazard rate or failure rate. The f (t), F (t), R(t),
h(t), and H (t) are shown graphically in Figure 2.5, where θ = 1/λ is the mean
time. As shown in the figure, when t = θ , F (t) = 0.632 and R(t) = 0.368. The
first derivative of R(t) with respect to t evaluated at time 0 is R (0) = −1/θ . This
implies that the tangent of R(t) at time 0 intersects the t axis at θ . The tangent is
shown in the R(t) plot in Figure 2.5. The slope of H (t) is 1/θ , illustrated in the
H (t) plot in Figure 2.5. These properties are useful for estimating the parameter
λ or θ using graphical approaches such as the probability plot (Chapter 7) or the
cumulative hazard plot (Chapter 11).
The exponential distribution possesses an important property: that the hazard
rate is a constant. The constant hazard rate indicates that the probability of a
surviving product failing in the next small interval of time is independent of time.
That is, the amount of time already exhausted for an exponential product has no
effects on the remaining life of the product. Therefore, this characteristic is also
called the memoryless property. Mathematically, the property can be expressed as
1/q 1
F(t)
0.632
f (t)
0 0 q
t t
1 1/θ
R(t)
0.368 h(t)
0 q 0
t t
H(t)
tan−1 (1/q)
0
t
the stress is greater than the threshold strength. On the other hand, the argu-
ments imply that the exponential distribution is inappropriate for failures due to
degradation or wear out.
The exponential distribution is widely used and is especially popular in mod-
eling the life of some electronic components and systems. For example, Murphy
et al. (2002) indicate that the exponential distribution adequately fits the failure
data of a wide variety of systems, such as radar, aircraft and spacecraft electron-
ics, satellite constellations, communication equipment, and computer networks.
The exponential distribution is also deemed appropriate for modeling the life
of electron tubes, resistors, and capacitors (see, e.g., Kececioglu, 1991; Meeker
and Escobar, 1998). However, the author’s test data suggest that the exponential
distribution does not adequately fit the life of several types of capacitors and
resistors, such as electrolytic aluminum and tantalum capacitors and carbon film
resistors. The Weibull distribution is more suitable. The failure of these compo-
nents is driven primarily by performance degradation; for example, an electrolytic
aluminum capacitor usually fails because of exhaustion of the electrolyte.
The exponential distribution is often mistakenly used because of its mathe-
matical tractability. For example, the reliability prediction MIL-HDBK-217 series
assumes that the life of electronic and electromechanical components follows an
exponential distribution. Because of the exponential assumption and numerous
other deficiencies, this handbook has been heavily criticized and is no longer
actively maintained by the owner [U.S. Department of Defense (U.S. DoD)].
Another common misuse lies in redundant systems comprised of exponential
components. Such systems are nonexponentially distributed (see, e.g., Murphy
et al. 2002).
In the Weibull formulas above, β is the shape parameter and α is the char-
acteristic life; both are positive. α also called the scale parameter, equals the
63.2th percentile (i.e., α = t0.632 ). α has the same unit as t: for example, hours,
miles, and cycles. The generic form of the Weibull distribution has an additional
parameter, called the location parameter. Kapur and Lamberson (1977), Nelson
(1982) and Lawless (2002), among others, present the three-parameter Weibull
distribution.
To illustrate the Weibull distribution graphically, Figure 2.6 plots f (t), F (t),
h(t), and H (t) for α = 1 and β = 0.5, 1, 1.5, and 2. As shown in the figure,
the shape parameter β determines the shape of the distribution. When β < 1
(β > 1), the Weibull distribution has a decreasing (increasing) hazard rate. When
β = 1, the Weibull distribution has a constant hazard rate and is reduced to the
exponential distribution. When β = 2, the hazard rate increases linearly with t,
as shown in the h(t) plot in Figure 2.6. In this case the Weibull distribution is
called the Rayleigh distribution, described in, for example, Elsayed (1996). The
linearly increasing hazard rate models the life of some mechanical and electrome-
chanical components, such as valves and electromagnetic relays, whose failure
is dominated by mechanical or electrical wear out.
It can be seen that the Weibull distribution is very flexible and capable of
modeling each region of a bathtub curve. It is the great flexibility that makes
the Weibull distribution widely applicable. Indeed, in many applications, it is the
best choice for modeling not only the life but also the product’s properties, such
as the performance characteristics.
2.5 1
2 0.8
b = 0.5 1
1.5 0.6
F(t)
b = 0.5
f (t)
1 1.5
1 0.4
2 1.5
0.5 0.2
2
0 0
0 0.5 1 1.5 2 0 0.5 1 1.5 2
t t
4 4
3.5 3.5
2 2
3 3
1.5
2.5 2.5
1.5 H(t)
h(t)
2 2
b = 0.5 1
1.5 1.5
1
1 1 b = 0.5
0.5 0.5
0 0
0 0.5 1 1.5 2 0 0.5 1 1.5 2
t t
This indicates that 2.4% of the component population will fail by the end of
warranty period. The reliability at 36,000 miles is
That is, 97.6% of the component population will survive the warranty period.
From (2.23), the hazard rate at 36,000 miles is
1.3−1
1.3 36,000
h(36,000) = = 0.89 × 10−8 failures per mile.
6.2 × 105 6.2 × 105
Because β > 1, the hazard rate increases with mileage. From (2.25), t0.01 , the
mileage by which 1% of the population will fail, is given by
where ρ is the fraction of the first subpopulation accounting for the entire popu-
lation and βi and αi (i = 1, 2) are the shape parameter and characteristic life of
subpopulation i. The associated cdf is
This indicates that 20.2% of the component population produced in the first two
months of production will fail by 36,000 miles.
1.6
1.4
1.2
1
f (t) (× 10−5)
0.8
0.6
0.4
0.2
0
0 20 40 60 80 100 120
Miles (× 103)
0.8
0.6
F(t)
0.4
0.2
0
0 20 40 60 80 100 120
3)
Miles (× 10
The cdf is
t −µ
F (t) = 1 − exp − exp , −∞ < t < ∞. (2.30)
σ
0.14 1
s =3
0.12
0.8
0.1
0.08 0.6
F(t)
f(t)
6
0.06 0.4 9
9
0.04 6
0.2
0.02 s=3
0 0
10 15 20 25 30 35 40 10 15 20 25 30 35 40
t t
10 30
8 25
s=3 s=3
20
6
H(t)
h(t)
15
4 6
6 10 9
2 5
9
0 0
10 15 20 25 30 35 40 10 15 20 25 30 35 40
t t
FIGURE 2.9 Smallest extreme value f (t), F (t), h(t) and H (t) for µ = 30
The cumulative hazard function is simply the hazard function times σ . The mean,
variance, and 100pth percentile are
respectively. The hazard function and cumulative hazard function can be obtained
from (2.7) and (2.9), respectively. They cannot be simplified and thus are not
given here.
When T has a normal distribution, it is usually indicated by T ∼ N (µ, σ 2 ).
µ is the location parameter and σ is the scale parameter. They are also the pop-
ulation mean and standard deviation as shown in (2.36) and have the same unit
as t, where −∞ < µ < ∞ and σ > 0. Figure 2.10 plots the normal distribution
for µ = 15 and various values of σ .
0.5 1
s=1 s=1
0.4 0.8
1.5
0.3 0.6
2
F(t)
f (t)
0.2 1.5
0.4
2
0.1 0.2
0 0
10 12 14 16 18 20 10 12 14 16 18 20
t t
6 16
5 14
s =1 12
4
10
s =1
h(t)
H(t)
3 8
1.5 1.5
2 6
4
1 2
2 2
0 0
10 12 14 16 18 20 10 12 14 16 18 20
t t
(z) is tabulated in, for example, Lewis (1987) and Nelson (1990, 2004).
Many commercial software packages such as Minitab and Microsoft Excel are
capable of doing the calculation. With the convenience of (z), (2.35) can be
written as
t −µ
F (t) = , −∞ < t < ∞. (2.39)
σ
n
n
µ= µi and σ 2 = σi2 .
i=1 i=1
Example 2.3 An electronic circuit contains three resistors in series. The resis-
tances (say, R1 , R2 , R3 ) in ohms of the three resistors can be modeled with the
normal distributions R1 ∼ N (10, 0.32 ), R2 ∼ N (15, 0.52 ), and R3 ∼ N (50, 1.82 ).
28 RELIABILITY DEFINITION, METRICS, AND PRODUCT LIFE DISTRIBUTIONS
Calculate the mean and standard deviation of the total resistance and the proba-
bility of the total resistance being within the tolerance range 75 ± 5%.
1 [ln(t) − µ]2 1 ln(t) − µ
f (t) = √ exp − = φ , t > 0, (2.41)
2πσ t 2σ 2 σt σ
t
1 [ln(y) − µ]2 ln(t) − µ
F (t) = √ exp − dy = , t > 0,
0 2πσy 2σ 2 σ
(2.42)
where (·) is the standard normal cdf. The 100pth percentile is
tp = exp(µ + zp σ ), (2.43)
0.4 1
s = 0.5
0.3 0.8
1
s = 0.5
0.6
F(t)
f(t)
0.2 1.5
1 0.4
0.1
1.5 0.2
0 0
0 1 2 3 4 5 6 0 1 2 3 4 5 6
t t
0.8 3
s = 0.5
2.5
0.6 s = 0.5
2
h(t)
0.4 H(t) 1
1.5
1
1
0.2 1.5
1.5 0.5
0 0
0 1 2 3 4 5 6 0 1 2 3 4 5 6
t t
the mean and standard deviation of ln(T ) because ln(T ) has a normal distribution
when T is lognormal.
The lognormal distribution is plotted in Figure 2.11 for µ = 1 and various val-
ues of σ . As the h(t) plot in Figure 2.11 shows, the hazard rate in not monotone.
It increases and then decreases with time. The value of t at which the hazard
rate is the maximum is derived below.
The lognormal hazard rate is given by
φ
h(t) = , (2.45)
σ t (1 − )
where φ and are the function of [ln(t) − µ]/σ . Taking the natural logarithm
on both sides of (2.45) gives
ln(t) − µ φ
+σ − = 0. (2.47)
σ 1−
The solution of t, say t ∗ , to (2.47) is the value at which the hazard rate is
maximum. Before t ∗ , the hazard increases; after t ∗ , the hazard rate decreases.
30 RELIABILITY DEFINITION, METRICS, AND PRODUCT LIFE DISTRIBUTIONS
3
2.7
2.4
2.1
1.8
1.5
x
1.2
0.9
0.6
0.3
0
0 0.3 0.6 0.9 1.2 1.5 1.8 2.1
s
t ∗ = x exp(µ). (2.48)
Figure 2.12 plots x versus σ . From the chart the value of x is read for a specific
σ value. Then it is substituted into (2.48) to calculate t ∗ . It can be seen that the
hazard rate decreases in practically an entire lifetime when σ > 2. The hazard rate
is in the increasing trend for a long period of time when σ is small, especially
at the neighbor of 0.2. The time at which the hazard rate begins to decrease
frequently interests design engineers and warranty analysts. If this time occurs
after a very long time in service, the failure process during the useful life is
dominated by degradation or wear out.
The lognormal distribution is useful in modeling the life of some electronic
products and metals due to fatigue or crack. The increasing hazard rate in early
time usually fits the life of the freak subpopulation, and the decreasing hazard rate
describes the main subpopulation. The lognormal distribution is also frequently
used to model the use of products. For example, as shown in Lawless et al. (1995),
M. Lu (1998), and Krivtsov and Frankstein (2004), the accumulated mileages of
an automobile population at a given time in service can be approximated using
a lognormal distribution. This is illustrated in the following example and studied
further in Chapter 11.
Example 2.4 The warranty plan of a car population covers 36 months in service
or 36,000 miles, whichever comes first. The accumulated mileage U of the car
population by a given month in service can be modeled with the lognormal
distribution with scale parameter 6.5 + ln(t) and shape parameter 0.68, where
t is the months in service of the vehicles. Calculate the population fraction
exceeding 36,000 miles by 36 months.
PROBLEMS 31
SOLUTION From (2.42) the probability that a car exceeds 36,000 miles by
36 months is
ln(36,000) − 6.5 − ln(36)
Pr(U ≥ 36,000) = 1 − = 0.437.
0.68
That is, by the end of warranty time, 43.7% of the vehicles will be leaving the
warranty coverage by exceeding the warranty mileage limit.
PROBLEMS
2.1 Define the reliability of a product of your choice, and state the failure criteria
for the product.
2.2 Explain the concepts of hard failure and soft failure. Give three examples of
each.
2.3 Select the metrics to measure the reliability of the product chosen in Problem
2.1, and justify your selection.
2.4 Explain the causes for early failures, random failures, and wear-out failures.
Give examples of methods for reducing or eliminating early failures.
2.5 The life of an airborne electronic subsystem can be modeled using an expo-
nential distribution with a mean time of 32,000 hours.
(a) Calculate the hazard rate of the subsystem.
(b) Compute the standard deviation of life.
(c) What is the probability that the subsystem will fail in 16,000 hours?
(d) Calculate the 10th percentile.
(e) If the subsystem survives 800 hours, what is the probability that it will
fail in the following hour? If it survives 8000 hours, compute this prob-
ability. What can you conclude from these results?
2.6 The water pump of a car can be described by a Weibull distribution with
shape parameter 1.7 and characteristic life 265,000 miles.
(a) Calculate the population fraction failing by the end of the warranty
mileage limit (36,000 miles).
(b) Derive the hazard function.
(c) If a vehicle survives 36,000 miles, compute the probability of failure in
the following 1000 miles.
(d) Calculate B10 .
32 RELIABILITY DEFINITION, METRICS, AND PRODUCT LIFE DISTRIBUTIONS
2.7 An electronic circuit has four capacitors connected in parallel. The nominal
capacitances of the four are 20, 80, 30, and 15 µF. The tolerance of each
capacitor is ± 10%. The capacitance can be approximated using a normal
distribution with mean equal to the nominal value and the standard deviation
equal to one-sixth of the two-side tolerance. The total capacitance is the sum
of the four. Calculate the following:
(a) The mean and standard deviation of the total capacitance.
(b) The probability of the total capacitance being greater than 150 µF.
(c) The probability of the total capacitance being within the range 146 ±
10%.
2.8 The time to failure (in hours) of a light-emitting diode can be approximated
by a lognormal distribution with µ = 12.3 and σ = 1.2.
(a) Plot the hazard function.
(b) Determine the time at which the hazard rate begins to decrease.
(c) Compute the MTTF and standard deviation.
(d) Calculate the reliability at 15,000 hours.
(e) Estimate the population fraction failing in 50,000 hours.
(f) Compute the cumulative hazard rate up to 50,000 hours.
3
RELIABILITY PLANNING
AND SPECIFICATION
3.1 INTRODUCTION
33
34 RELIABILITY PLANNING AND SPECIFICATION
Customer
satisfaction
performance wants
excitement wants
Degree of
basic wants
want being
satisfied
Technical
Characteristics Correlation
Technical Characteristics
(HOWs)
Direction of Improvement
Technical Importance
Technical Competitive
Benchmarking
meaning what is to be addressed. The technical axis explains the technical char-
acteristics that affect customer satisfaction directly for one or more customer
expectations. Also on the technical axis are the correlations, importance, and tar-
gets of the technical characteristics and technical competitive benchmarking. The
technical characteristics component is often referred to as HOW, meaning how
to address WHAT; then technical targets are accordingly called HOW MUCH.
The interrelationships between customer wants and technical characteristics are
evaluated in the relationship matrix.
The objective of QFD is to translate customer wants, including reliability
expectation, into operational design characteristics and production control vari-
ables. This can be done by deploying the houses of quality in increasing detail.
In particular, customer wants and reliability demands are converted to technical
characteristics through the first house of quality. Reliability is usually an impor-
tant customer need and receives a high importance rating. The important technical
characteristics, which are highly correlated to reliability demand, among others,
are cascaded to design parameters at the part level through the second house of
quality. The design parameters from this step of deployment should be closely
related to reliability and can be used in subsequent robust design (Chapter 5) and
performance degradation analysis (Chapter 8). Critical design parameters are then
deployed in process planning to determine process parameters through the third
house of quality. Control of the process parameters identified directly minimizes
unit-to-unit variation (an important noise factor in robust design) and reduces
CUSTOMER EXPECTATIONS AND SATISFACTION 37
HOWs
Relationships HOWs
WHATs
between
WHATs and Relationships HOWs
WHATs
HOWs between
WHATs and Relationships HOWs
WHATs
Product Planning HOWs between
WHATs and Relationships
WHATs
Part Deployment HOWs between
WHATs and
Process Deployment HOWs
Production Deployment
infant mortality and variation in degradation rates. The fourth house of quality
is then used to translate the process parameters into production requirements.
The deployment process is illustrated in Figure 3.3. A complete QFD process
consists of four phases: (1) product planning, (2) part deployment, (3) process
deployment, and (4) production deployment. The four phases are described in
detail in the following subsections.
1. State what customers want in the WHAT entries of the first house of quality.
These customer expectations are usually nontechnical and fuzzy expressions. For
example, customers may state their reliability wants as “long life,” “never fail,”
and “very dependable.” Technical tools such as affinity and tree diagrams may be
used to group various customer requirements (Bossert, 1991). For an automobile
windshield wiper system, customer expectations include high reliability, minimal
operation noise, no residual water traces, no water film, and large wiping area,
among others. These wants are listed in the WHAT entries of the quality house,
as shown in Figure 3.4.
2. Determine customer desirability, which rates the desirability for each cus-
tomer want relative to every other want. Various scaling approaches are used
in practice, but none of them is theoretically sound. In the windshield wiper
example, we use the analytic hierarchy process approach (Armacost et al., 1994),
which rates importance levels on a scale of 1 to 9, where 9 is given to the
38 RELIABILITY PLANNING AND SPECIFICATION
−−
++ +
+ −
++
Friction Coefficient
+ +: strongly
Blade-to-Glass
Arm Rotation
Arm Length
positive
Motor Load
Desirability
Customer
Competitive Performance
………
+: positive
Torque
Angle
−: negative
− −: strongly
negative 1 2 3 4 5
High B
9 3 9 1 3 1 1 … A C
Reliability D
C
Minimal Noise 5 9 3 3 3 1 3 A B
D
No Residual B
7 3 3 3 3 3 A C
Water Traces D
C
No Water Film 9 3 3 3 3 A B
D
Large Wiping C
3 9 3 A B
Area D
……… … … … … … … …
Direction of o o o o o o o
Improvement …
A, B, C: Competing
Technical 93 144 36 90 24 62 72 … products
Importance
Technical D: Prior-generation
0.16 1.2 115 1.7 0.35 40 58 …
Targets product
FIGURE 3.4 Example QFD for automobile windshield wiper system planning
of the evaluation is to assess the strengthes and weaknesses of the product being
designed and to identify areas for improvement. Shown on the right-hand side of
Figure 3.4 are the competitive performance ratings for three competitors (denoted
A, B, and C) and the predecessor of this wiper system (denoted D).
4. List the technical characteristics that directly affect one or more customer
wants on the customer axis. These characteristics should be measurable and
controllable and define technically the performance of the product being designed.
The characteristics will be deployed selectively to the other three houses of quality
in subsequent phases of deployment. In this step, fault tree analysis, cause-and-
effect diagrams, and test data analysis of similar products are helpful because the
technical characteristics that strongly influence reliability may not be obvious. In
the wiper system example, we have identified the technical characteristics that
describe the motor, arm, blade, linkage, and other components. Some of them
are listed in Figure 3.4.
5. Identify the interrelationships between customer wants and technical char-
acteristics. The strength of relationship may be classified into three levels, where
a rating of 9 is assigned to a strong relation, 3 to a medium relation, and 1
to a weak relation. Each technical characteristic must be interrelated to at least
one customer want; one customer want must also be addressed by at least one
technical characteristic. This ensures that all customer wants are concerned in the
product planning, and all technical characteristics are established properly. The
ratings of the relation strength for the wiper system are entered in the relation-
ship matrix entries of the quality house, shown in Figure 3.4. It can be seen that
the motor load is one of the technical characteristics that strongly affect system
reliability.
6. Develop the correlations between technical characteristics and indicate
them in the roof of the house of quality. The technical characteristics can have
a positive correlation, meaning that the change of one technical characteristic in
a direction affects another characteristic in the same direction. A negative corre-
lation means otherwise. Four levels of correlation are used: a strongly positive
correlation, represented graphically by ++; positive by +; negative by −; and
strongly negative by −−. Correlations usually add complexity to product design
and would result in trade-off decisions in selecting technical targets if the corre-
lations are negative. Correlations among the technical characteristics of the wiper
system appear in the roof of the quality house, as shown in Figure 3.4.
7. Determine the direction of improvement for each technical characteristic.
There are three types of characteristics: larger-the-better, nominal-the-best, and
smaller-the-better (Chapter 5), which are represented graphically by ↑, ◦, and ↓,
respectively, in a house of quality. The direction is to maximize, set to target,
or minimize the technical characteristic, depending on its type. The technical
characteristics listed in Figure 3.4 are all nominal-the-best type.
8. Calculate ratings of technical importance. For a given technical character-
istic, the values of the customer desirability index are multiplied by the corre-
sponding strength ratings. The sum of the products is the importance rating of
40 RELIABILITY PLANNING AND SPECIFICATION
the technical characteristic. The importance ratings allow the technical character-
istics to be prioritized and thus indicate the significant characteristics that should
be selected for further deployment. Characteristics with low values of rating may
not need deployment to subsequent QFD phases. In the wiper system example,
the importance rating of the motor load is 9 × 9 + 5 × 3 + 7 × 3 + 9 × 3 = 144.
Ratings of the listed technical characteristics are in the technical importance row,
shown in Figure 3.4. The ratings indicate that the motor load is an important
characteristic and should be deployed to lower levels.
9. Perform technical competitive benchmarking. Determine the measurement
of each technical characteristic of the predecessor of the product as well as the
competing products evaluated on the customer axis. The measurements should
correlate strongly with the competitive performance ratings. Lack of correlation
signifies inadequacy of the technical characteristics in addressing customer expec-
tations. This benchmarking allows evaluation of the position of the predecessor
relative to competitors from a technical perspective and assists in the develop-
ment of technical targets. In the wiper system example, the measurements and
units of the technical characteristics of the products under comparison are shown
in Figure 3.4.
10. Determine a measurable target for each technical characteristic with inputs
from the technical competitive benchmarking. The targets are established so that
identified customer wants are fulfilled and the product being planned will be
highly competitive in the marketplace. The targets of the technical characteristics
listed for the wiper system are shown in Figure 3.4.
of the second house of quality, and the new HOWs are the process parameters to
produce the WHATs at the target values. Outputs from this phase include critical
process parameters and their target values, which should be deployed to the next
phase for developing control plans. Deployment is critical in this phase, not only
because this step materializes customer wants in production, but also because the
process parameters and target values determined in this step have strong impacts
on productivity, yield, cost, quality, and reliability.
For most products, reliability is a performance need for which customers are
willing to pay more. Meeting this expectation linearly increases customer satis-
faction, which decreases linearly with failure to do so, as depicted in Figure 3.1.
To win the war of sustaining and expanding market share, it is vital to estab-
lish competitive reliability requirements, which serve as the minimum goals and
must be satisfied or exceeded through design and production. In this section we
describe three methods of setting reliability requirements, which are driven by
customer satisfaction, warranty cost objectives, and total cost minimization. Lu
and Rudy (2000) describe a method for deriving reliability requirement from
warranty repair objectives.
Reliability requirements must define what constitutes a failure (i.e., the failure
criteria). The definition may be obvious for a hard-failure product whose failure is
the complete termination of function. For a soft-failure product, failure is defined
in terms of performance characteristics crossing specified thresholds. As pointed
out in Chapter 2, the thresholds are more or less subjective and often arguable. It
thus is important to have all relevant parties involved in the specification process
and concurring as to the thresholds. In a customer-driven market, the thresholds
should closely reflect customer expectations. For example, a refrigerator may be
said to have failed if it generates, say, 50 dB of audible noise, at which level
90% of customers are dissatisfied.
As addressed earlier, life can be measured in calendar time, usage, or other
scales. The most appropriate life scale should be dictated by the underlying failure
mechanism that governs the product failure process. For example, mechanical
wear out is the dominant failure mechanism of a bearing, and the number of
revolutions is the most suitable life measure because wear out develops only by
rotation. The period of time specified should be stated on such a life scale. As
discussed before, reliability is a function of time (e.g., calendar age and usage).
Reliability requirements should define the time at which the reliability level is
specified. For many commercial products, the specified time is the design life.
Manufacturers may also stipulate other times of interest, such as warranty lengths
and mission times.
Reliability is influenced largely by the use environment. For example, a resis-
tor would fail much sooner at a high temperature than at ambient temperature.
Reliability requirements should include the operating conditions under which
the product must achieve the reliability specified. The conditions specified for
a product should represent the customer use environment, which is known as
the real-world usage profile. In designing subsystems within a product, this pro-
file is translated into the local operating conditions, which in turn become the
environmental requirements for the subsystems. Verification and validation tests
intended to demonstrate reliability must correlate the test environments to the
use conditions specified; otherwise, the test results will be unrealistic.
customer
dissatisfaction
Di customer
satisfaction
Yi
0
Time
mi
Si = Pr(Yj ≤ Dj ), i = 1, 2, . . . , n, (3.1)
j
mi
Pr(Yj ≤ Dj ) = Si∗ , i = 1, 2, . . . , n. (3.2)
j
When the number of important customer wants equals the number of crit-
ical performance characteristics (i.e., n = m), (3.2) is a system containing m
equations with m unknowns. Solving the equation system gives unique solu-
tions of the probabilities, denoted pi (i = 1, 2, . . . , m). If the two numbers are
unequal, unique solutions may be obtained by adopting or dropping less important
customer wants, which have lower values of customer desirability.
44 RELIABILITY PLANNING AND SPECIFICATION
Because the product is said to have failed if one of the m independent perfor-
mance characteristics crosses the threshold, the reliability target R ∗ of the product
can be written as
It is worth noting that meeting the minimum reliability level is a necessary and
not a sufficient condition to achieve all specified customer satisfactions simultane-
ously, because the reliability depends only on the product of pi (i = 1, 2, . . . , m).
This is illustrated in the following example. To fulfill all customer satisfactions,
it is important to ensure that Pr(Yi ≤ Di ) ≥ pi (i = 1, 2, . . . , m) for each perfor-
mance characteristic in product design.
Note that meeting this overall reliability target does not guarantee all customer
satisfactions. For instance, R ∗ = 0.857 may result in p1 = 0.98, p2 = 0.92, and
p3 = 0.95. Then E3 is not satisfied.
where W (t0 ) is the expected number of repairs per unit by t0 . If the repair is a
minimal repair (i.e., the failure rate of the product immediately after repair equals
that right before failure), W (t0 ) can be written as
1
W (t0 ) = ln . (3.5)
R(t0 )
RELIABILITY REQUIREMENTS 45
Because the total warranty cost must not be greater than Cw∗ , from (3.6) the
reliability target is
∗ Cw∗
R = exp − . (3.7)
c0 n
For a complicated product, the costs per repair and failure rates of subsystems
may be substantially different. In this situation, (3.4) does not provide a good
approximation to the total warranty cost. Suppose that the product has m sub-
systems connected in series and the life of the subsystems can be modeled with
the exponential distribution. Let c0i and λi denote the cost per repair and failure
rate of subsystem i, respectively. The expected warranty cost is
m
Cw = nt0 c0i λi . (3.8)
i=1
λnt0
m
Cw = c0i Ci . (3.10)
C i=1
Because the total warranty cost must not exceed Cw∗ , the maximum allowable
failure rate of the product can be written as
C∗ C
λ∗ = mw . (3.11)
nt0 i=1 c0i Ci
Example 3.2 A product consists of five subsystems. The production cost and
cost per repair of each subsystem are shown in Table 3.1. The manufacturer plans
to produce 150,000 units of such a product and requires that the total warranty
cost be less than $1.2 million in the warranty period of one year. Determine the
maximum allowable failure rate of the product.
46 RELIABILITY PLANNING AND SPECIFICATION
c0i 25 41 68 35 22
Ci 38 55 103 63 42
SOLUTION The total production cost is C = 5i=1 Ci = $301. From (3.11),
the maximum allowable failure rate of the product is
1,200,000 × 301
λ∗ =
150,000 × 8760 × (25 × 38 + · · · + 22 × 42)
= 2.06 × 10−5 failures per hour.
Cw∗ C
λ∗ = t0 m . (3.13)
n 0 Pr[U (t) ≤ u0 ]dt i=1 c0i Ci
Example 3.3 Refer to Example 3.2. Suppose that the product is warranted with
a two-dimensional plan covering one year or 12,000 cycles, whichever comes
RELIABILITY REQUIREMENTS 47
first. The use is accumulated linearly at a constant rate (cycles per month) for
a particular customer. The rate varies from customer to customer and can be
modeled using the lognormal distribution with scale parameter 6.5 and shape
parameter 0.8. Calculate the maximum allowable failure rate of the product.
SOLUTION The probability that a product accumulates less than 12,000 cycles
by t months is
ln(12,000) − 6.5 − ln(t) 2.893 − ln(t)
Pr[U (t) ≤ 12,000] = = .
0.8 0.8
1,200,000 × 301
λ∗ = 12
150,000 0 [[2.893 − ln(t)]/0.8]dt (25 × 38 + · · · + 22 × 42)
= 0.0169 failures per month = 2.34 × 10−5 failures per hour.
Comparing this result with that from Example 3.2, we note that the two-
dimensional warranty plan yields a less stringent reliability requirement, which
favors the manufacturer.
failure cost
total cost
reliability
program cost
Cost
0 optimum
Reliability
failure cost
reliability
total cost program cost
Cost
0
optimum
Reliability
savings in design,
verification, and
production
FIGURE 3.7 Costs and savings associated with a proactive reliability program
testing and were essentially reactive. Such programs do not add much value at the
beginning of the design cycle. Nowadays, reliability design techniques such as
robust design are being integrated into the design process to build reliability into
products. The proactive methods break the design–test–fix loop, and thus greatly
reduce the time to market and cost. In almost every project, reliability investment
is returned with substantial savings in design, verification, and production costs.
Figure 3.7 illustrates the costs and savings. As a result of the savings, the total
cost is reduced, especially when the required reliability is high. If the costs and
savings can be quantified, the optimal reliability level is the one that minimizes
the total cost. Clearly, the optimal reliability is considerably larger than the one
given by the conventional total cost model.
As we know, modeling failure cost is a relatively easy task. However, estimat-
ing the costs and savings associated with a reliability program is difficult, if not
impossible. Thus, in most applications the quantitative reliability requirements
cannot be obtained by minimizing the total cost. Nevertheless, the principle of
total cost optimization is universally applicable and useful indeed in justifying
a high-reliability target and the necessity of implementing a proactive reliability
program to achieve the target.
they are well orchestrated and integrated into a reliability program. In this section
we describe a generic reliability program, considerations for developing product-
specific programs, and management of reliability programs.
In the design and development stage and before prototypes are created, reli-
ability tasks are to build reliability and robustness into products and to prevent
50
FIGURE 3.8
• Reliability history • Stress derating verification analysis • Warranty data
analysis • Robust design • Accelerated life • Stress screening analysis
• Reliability planning • Concept and design testing • Acceptance sampling • Customer feedback
and specification FMEA • Accelerated • Failure analysis analysis
• Reliability estimation
• Warranty cost
potential failure modes from occurrence. The reliability tasks in this stage are
described below.
In the product verification and process validation stage, reliability tasks are
intended to verify that the design achieves the reliability target, to validate that the
production process is capable of manufacturing products that meet the reliability
requirements, and to analyze the failure modes and mechanisms of the units
that fail in verification and validation tests. As presented in Chapter 1, process
planning is performed in this phase to determine the methods of manufacturing
the product. Thus, also needed are reliability tasks that assure process capability.
The tasks that may be executed in this phase are explained below.
In the production stage, the objective of reliability tasks is to assure that the
production process has minimum detrimental impact on the design reliability.
The tasks that may be implemented in this phase are described as follows.
should be aligned with those of the design, verification, and production activi-
ties. For example, QFD is a tool for translating customer wants into engineering
requirements and supports product planning. Thus, it should be performed in the
product planning stage. Another example is that of design controls, which are
conducted only after the design schematics are completed and before a design is
released for production. In situations where multiple reliability tasks are devel-
oped to support an engineering design, verification, or production activity, the
sequence of these tasks should be orchestrated carefully to reduce the associated
cost and time. Doing so requires fully understanding of interrelationships among
the reliability tasks. If a task generates outputs that serve as the inputs of another
task, it should be completed earlier. For instance, thermal analysis as a method of
design control for a printed circuit board yields a temperature distribution which
can be an input to reliability prediction. Thus, the reliability prediction may begin
after the thermal analysis has been completed.
The time line for a reliability program should accommodate effects due to
changes in design, verification, and production plans. Whenever changes take
place, some reliability tasks need to be revised accordingly. For example, design
changes must trigger the modification of design FMEA and FTA and the rep-
etition of design control tasks to verify the revised design. In practice, some
reliability tasks have to be performed in early stages of the life cycle with lim-
ited information. As the life cycle proceeds, the tasks may be repeated with more
data and more specific product configurations. A typical example is reliability
prediction, which is accomplished in the early design stage based on part count
to provide inputs for comparison of design alternatives and is redone later to pre-
dict field reliability with specific product configuration, component information,
stress levels, and prototype test data.
Once the reliability program and time lines are established, implementation
strategies should be developed to assure and improve the effectiveness of the
program. This lies in the field of reliability program management and is discussed
in the next subsection.
Prototype Build
individual disciplines that it utilizes but in the synergy that all the different
methodologies and tools provide in the pursuit of improvement. Currently, there
are two types of six-sigma approaches: six sigma and DFSS. The six-sigma
approach is aimed at resolving the problems of existing products or processes
through use of the DMAIC process, introduced in Section 3.4.1. Because of its
reactive nature, it is essentially a firefighting method, and thus the value it can add
is limited. In contrast, DFSS is a proactive approach deployed at the beginning
of a design cycle to avoid building potential failure modes into a product. Thus,
DFSS is capable of preventing failure modes from occurring. DFSS is useful in
the design of a new product or the redesign of an existing product.
DFSS is implemented by following the ICOV process, where I stands for
identify, C for characterize, O for optimize, and V for validate. The main activities
in each phase of the ICOV process are described below.
defined in the I phase are the reliability team and the roles and responsibilities
of the team members.
The second phase of reliability design in the framework of the ICOV process
is “characterize design”, which occurs in the early design and development stage.
In this phase the technical characteristics identified in the I phase are translated
further, into product functional characteristics to be used in the O phase. The
translation is done by expanding the first house of quality to the second house.
Reliability modeling, allocation, prediction, FMEA, and FTA may be performed
to help develop design alternatives. For example, a concept FMEA may rule out
potential design alternatives that have a high failure risk. Once detailed design
alternatives are generated, they are evaluated with respect to reliability by apply-
ing reliability techniques such as reliability prediction, FMEA, FTA, and design
control methods. Outputs from this phase are the important product characteristics
and the best design alternative that has high reliability.
The next phase in reliability design by the ICOV process is “optimize design”,
which is implemented in the late design and development stage. The concept
design has been finalized by this stage and detailed design is being performed.
The purpose of reliability design in this phase is to obtain the optimal setting
of design parameters that maximizes reliability and makes product performance
insensitive to use condition and process variation. The main reliability tasks
applied for this purpose include robust reliability design, accelerated life testing,
accelerating degradation testing, and reliability estimation. Functional and relia-
bility performance may be predicted at the optimal setting of design parameters.
Design control methods such as thermal analysis and mechanical stress analysis
should be implemented after the design optimization is completed to verify that
the optimal design is free of critical potential failure modes.
The last phase of reliability design by the ICOV process is to validate the
optimal design in the verification and validation stage. In this phase, samples are
built and tested to verify that the design has achieved the reliability target. The
test conditions should reflect real-world usage. For this purpose, a P-diagram (a
tool for robust design, described in Chapter 5) may be employed to determine
the noise factors that the product will encounter in the field. Accelerated tests
may be conducted to shorten the test time; however, it should be correlated to
real-world use. Failure analysis must be performed to reveal the causes of failure.
This may be followed by a recommendation for design change.
In summary, the DFSS approach provides a lean and nimble process by which
reliability design can be performed in a more efficient way. Although this process
improves the effectiveness of reliability design, the success of reliability design
relies heavily on each reliability task. Therefore, it is vital to develop suitable
reliability tasks that are capable of preventing and detecting potential failure
modes in the design and development stage. It is worth noting that the DFSS is
a part of a reliability program. Completion of the DFSS process is not the end
of the program; rather, the reliability tasks for production and field deployment
should begin or continue.
64 RELIABILITY PLANNING AND SPECIFICATION
PROBLEMS
3.1 Define the three types of customer expectations, and give an example of each
type. Explain how customer expectation for reliability influences customer
satisfaction.
3.2 Describe the QFD process and the inputs and outputs of each house of
quality. Explain the roles of QFD in reliability planning and specification.
3.3 Perform a QFD analysis for a product of your choice: for example, a lawn
mower, an electrical stove, or a refrigerator.
3.4 A QFD analysis indicates that the customer expectations for a product include
E1 , E2 , E3 , and E4 , which have customer desirability values of 9, 9, 8, and
3, respectively. The QFD strongly links E1 to performance characteristics
Y1 and Y3 , E2 to Y1 and Y2 , and both E3 and E4 to Y2 and Y3 . The required
customer satisfaction for E1 , E2 , E3 , and E4 is 90%, 95%, 93%, and 90%,
respectively. Calculate the reliability target.
3.5 A manufacturer is planning to produce 135,000 units of a product which
are warranted for 12 months in service. The manufacturer sets the maximum
allowable warranty cost to $150,000 and expects the average cost per repair to
be $28. Determine the reliability target at 12 months to achieve the warranty
objective.
3.6 Refer to Example 3.3. Suppose that the customers accumulate usage at higher
rates, which can be modeled with the lognormal distribution with scale
parameter 7.0 and shape parameter 0.8. Determine the minimum reliability
requirement. Compare the result with that from Example 3.3, and comment
on the difference.
3.7 Describe the roles of reliability tasks in each phase of the product life cycle
and the principles for developing an effective reliability program.
3.8 Explain the process of six sigma and design for six sigma (DFSS). What are
the benefits of performing reliability design through the DFSS approach?
4
SYSTEM RELIABILITY EVALUATION
AND ALLOCATION
4.1 INTRODUCTION
65
66 SYSTEM RELIABILITY EVALUATION AND ALLOCATION
overall reliability target for a car should be allocated to the powertrain, chassis,
body, and electrical subsystem. The reliability allocated to the powertrain is fur-
ther apportioned to the engine, transmission, and axle. The allocation process is
continued until the assembly level is reached. Then the auto suppliers are obli-
gated to achieve the reliability of the assemblies they are contracted to deliver.
In this chapter we present various reliability allocation methods.
A comprehensive reliability program usually requires evaluation of system
(product) reliability in the design and development stage for various purposes,
including, for example, selection of materials and components, comparison of
design alternatives, and reliability prediction and improvement. Once a system
or subsystem design is completed, the reliability must be evaluated and com-
pared with the reliability target that has been specified or allocated. If the target
is not met, the design must be revised, which necessitates a reevaluation of reli-
ability. This process continues until the desired reliability level is attained. In
the car example, the reliability of the car should be calculated after the system
configuration is completed and assembly reliabilities are available. The process
typically is repeated several times and may even invoke reliability reallocation
if the targets of some subsystems are unattainable.
In this chapter we describe methods for evaluating the reliability of systems
with different configurations, including series, parallel, series–parallel, and k-out-
of-n voting. Methods of calculating confidence intervals for system reliability are
delineated. We also present measures of component importance. Because system
configuration knowledge is a prerequisite to reliability allocation, it is presented
first in the chapter.
Exterior Interior Engine Transmission Axle Power Customer Vehicle Brakes Suspension
supply features control
FIGURE 4.1 Hierarchical configuration of a typical automobile
1 2 3 4
FIGURE 4.2 Reliability block diagram with blocks representing first-level subsystems
1 2 3 4 5 6
10 9 8 7
the reliability block diagram of the automobile, in which the blocks represent
the first-level subsystems, assuming that their reliabilities are known. Figure 4.3
is a diagram illustrating second-level subsystems. Comparing Figure 4.2 with
Figure 4.3, we see that the complexity of a reliability block diagram increases
with the level of subsystem that blocks represent. The reliability block diagram
of a typical automobile contains over 12,000 blocks if each block represents a
component or part.
R = Pr(E) = Pr(E1 · E2 · · · En ).
R = R0n . (4.2)
Equation (4.1) indicates that the system reliability is the product of reliabilities
of components. This result is unfortunate, in that the system reliability is less than
the reliability of any component. Furthermore, the system reliability decreases
rapidly as the number of components in a system increases. The observations
support the principle of minimizing the complexity of an engineering design.
Let’s consider a simple case where the times to failure of n components in a
system are modeled with the exponential distribution. The exponential reliability
function for component i is Ri (t) = exp(−λi t), where λi is the failure rate of
component i. Then from (4.1), the system reliability can be written as
n
R(t) = exp −t λi = exp(−λt), (4.3)
i=1
SERIES SYSTEMS 69
Example 4.2 Refer to Figure 4.2. Suppose that the lifetimes of the body, pow-
ertrain, and electrical and chassis subsystems are exponentially distributed with
λ1 = 5.1 × 10−4 , λ2 = 6.3 × 10−4 , λ3 = 5.5 × 10−5 , and λ4 = 4.8 × 10−4 fail-
ures per 1000 miles, respectively. Calculate the reliability of the vehicle at 36,000
miles and the mean mileage to failure.
t βi
Ri (t) = exp − ,
αi
where βi and αi are, respectively, the shape parameter and the characteristic life
of component i. From (4.1) the system reliability is
n
t βi
R(t) = exp − . (4.6)
i=1
αi
70 SYSTEM RELIABILITY EVALUATION AND ALLOCATION
βi t βi −1
n
h(t) = . (4.7)
α αi
i=1 i
Equation (4.7) indicates that like the exponential case, the failure rate of the
system is the sum of all individual failure rates. When βi = 1, (4.7) reduces
to (4.4), where λi = 1/αi .
If the n components have a common shape parameter β, the mean time to
failure of the system is given by
∞
((1/β) + 1)
MTTF = R(t) dt = n , (4.8)
0 (1/α )β 1/β
i=1 i
AC power
Capacitor
Inductor
Resistor
1 2 3 4
parameters shown in Figure 4.5. Calculate the reliability and failure rate of the
circuit at 5 × 104 hours.
SOLUTION Substituting the values of the Weibull parameters into (4.6) gives
1.3 1.8 1.6
5 × 104 5 × 104 5 × 104
R(5 × 10 ) = exp −
4
− −
3.3 × 105 1.5 × 106 4.7 × 106
2.3
5 × 104
−
7.3 × 105
= 0.913.
A system is said to be a parallel system if and only if the failure of all components
within the system results in the failure of the entire system. In other words, a
parallel system succeeds if one or more components are operational. For example,
the lighting system that consists of three bulbs in a room is a parallel system,
because room blackout occurs only when all three bulbs break. The reliability
block diagram of the lighting system is shown in Figure 4.6. The reliability of a
general parallel system is calculated as follows.
Suppose that a parallel system consists of n mutually independent components.
We use the following notation: Ei is the event that component i is operational;
Bulb 1
1
Bulb 2
2
Bulb 3
3
n
R =1− (1 − Ri ). (4.10)
i=1
R = 1 − (1 − R0 )n , (4.11)
ln(1 − R)
n= . (4.12)
ln(1 − R0 )
Example 4.4 Refer to Figure 4.6. Suppose that the lighting system uses three
identical bulbs and that other components within the system are 100% reliable.
The times to failure of the bulbs are Weibull with parameters α = 1.35 and
β = 35,800 hours. Calculate the reliability of the system after 8760 hours of
use. If the system reliability target is 99.99% at this time, how many bulbs
should be connected in parallel?
SOLUTION Since the life of the bulbs is modeled with the Weibull distribution,
the reliability of a single bulb after 8760 hours of use is
8760 1.35
R0 = exp − = 0.8611.
35,800
Substituting the value of R0 into (4.11) gives the system reliability at 8760 hours
as R = 1 − (1 − 0.8611)3 = 0.9973. From (4.12), the minimum number of bulbs
required to achieve 99.99% reliability is
ln(1 − 0.9999)
n= = 5.
ln(1 − 0.8611)
There are situations in which series and parallel configurations are mixed in a
system design to achieve functional or reliability requirements. The combinations
form series–parallel and parallel–series configurations. In this section we discuss
the reliability of these two types of systems.
1 2 . . . n
1 1 1
2 2 2
. . .
. . .
. . .
m1 m2 mn
1 2 n
design. To calculate the system reliability, we first reduce each parallel subsystem
to an equivalent reliability block. From (4.10), the reliability Ri of block i is
mi
Ri = 1 − (1 − Rij ), (4.15)
j =1
When all components in the series–parallel system are identical and the num-
ber of components in each subsystem is equal, (4.16) simplifies to
R = [1 − (1 − R0 )m ]n , (4.17)
ni
Ri = Rij , i = 1, 2, . . . , m, (4.18)
j =1
1
1 2 n1
2
1 2 n2
. . . . . . . . . . . .
m
1 2 nm
.
.
.
If all components in the parallel–series system are identical and the number of
components in each subsystem is equal, the system reliability can be written as
R = 1 − (1 − R0n )m , (4.20)
Example 4.5 Suppose that an engineer is given four identical components, each
having 90% reliability at the design life. The engineer wants to choose the system
design that has a higher reliability from between the series–parallel and paral-
lel–series configurations. The two configurations are shown in Figures 4.11 and
4.12. Which design should the engineer select from the reliability perspective?
1 1
2 2
1 2
1 2
R = [1 − (1 − 0.9)2 ]2 = 0.9801.
R = 1 − (1 − 0.92 )2 = 0.9639.
0.9
0.8
0.7
0.6 S–P: n = 3, m = 3
R
P–S: n = 3, m = 3
0.5 S–P: n = 2, m = 3
P–S: n = 2, m = 3
0.4 S–P: n = 3, m = 2
P–S: n = 3, m = 2
0.3
0.2
0.5 0.55 0.6 0.65 0.7 0.75 0.8 0.85 0.9 0.95 1
R0
where λ is the component failure rate. The mean time to failure of the system is
11
∞ n
MTTF = R(t)dt = . (4.24)
0 λ i=k i
Example 4.6 A web host has five independent and identical servers connected
in parallel. At least three of them must operate successfully for the web service
not to be interrupted. The server life is modeled with the exponential distribution
with λ = 2.7 × 10−5 failures per hour. Calculate the mean time between failures
(MTBF) and the reliability of the web host after one year of continuous service.
1
5
1
MTBF = −5
= 2.9 × 104 hours.
2.7 × 10 i=3 i
Substituting the given data into (4.23) yields the reliability of the web host at
8760 hours (one year) as
5
−5 −5
R(8760) = C5i e−2.7×10 ×8760i
(1 − e−2.7×10 ×8760 5−i
) = 0.9336.
i=3
F = Pr(E 1 · E 2 + E 1 · E 3 + E 2 · E 3 ). (4.25)
REDUNDANT SYSTEMS 79
1 2
1 3
2 3
The equation indicates that the system fails if any of the three events E 1 · E 2 ,
E 1 · E 3 , or E 2 · E 3 occurs. The event is called a minimal cut set. The defi-
nition and application of the minimal cut set is presented in Section 4.8 and
discussed further in Chapter 6. As shown in (4.25), a 2-out-of-3:G system has
three minimal cut sets, and each contains two elements. In general, a k-out-
of-n:G system contains Cnn−k+1 minimal cut sets, and each consists of exactly
k elements.
Let’s continue the computation of the probability of failure. Equation (4.25)
can be expanded to
R =1−F
= 1 − (1 − R1 )(1 − R2 ) − (1 − R1 )(1 − R3 ) − (1 − R2 )(1 − R3 )
+ 2(1 − R1 )(1 − R2 )(1 − R3 ). (4.26)
The reliability is the same as that obtained from (4.22). Note that unlike (4.22),
(4.26) does not require the components to be identical in order to calculate the
system reliability. Hence, transformation of a k-out-of-n:G system to an equiva-
lent parallel system provides a method for calculating the system reliability for
cases where component reliabilities are unequal.
the function when the primary unit fails. Failure of the system occurs only when
some or all of standby units fail. Hence, redundancy is a system design technique
that can increase system reliability. Such a technique is used widely in critical
systems. A simple example is an automobile equipped with a spare tire. Whenever
a tire fails, it is replaced with the spare tire so that the vehicle is still drivable.
A more complicated example is described in W. Wang and Loman (2002). A
power plant designed by General Electric consists of n active and one or more
standby generators. Normally, each of the n generators runs at 100(n − 1)/n
percent of its full load and together supplies 100% load to end users, where
n − 1 generators can fully cover the load. When any one of the active generators
fails, the remaining n − 1 generators will make up the power loss such that the
output is still 100%. Meanwhile, the standby generator is activated and ramps
to 100(n − 1)/n percent, while the other n − 1 generators ramp back down to
100(n − 1)/n percent.
If a redundant unit is fully energized when the system is in use, the redundancy
is called active or hot standby. Parallel and k-out-n:G systems described in the
preceding sections are typical examples of active standby systems. If a redun-
dant unit is fully energized only when the primary unit fails, the redundancy is
known as passive standby. When the primary unit is successfully operational, the
redundant unit may be kept in reserve. Such a unit is said to be in cold standby.
A cold standby system needs a sensing mechanism to detect failure of the pri-
mary unit and a switching actuator to activate the redundant unit when a failure
occurs. In the following discussion we use the term switching system to include
both the sensing mechanism and the switching actuator. On the other hand, if the
redundant unit is partially loaded in the waiting period, the redundancy is a warm
standby. A warm standby unit usually is subjected to a reduced level of stress and
may fail before it is fully activated. According to the classification scheme above,
the spare tire and redundant generators described earlier are in cold standby. In
the remainder of this section we consider cold standby systems with a perfect or
imperfect switching system. Figure 4.15 shows a cold standby system consist-
ing of n components and a switching system; in this figure, component 1 is the
primary component and S represents the switching system.
2
S .
.
.
n
T = Ti . (4.27)
i=1
λn n−1 −λt
f (t) = t e , (4.28)
(n)
where (·) is the gamma function, defined in Section 2.5. The system reliabil-
ity is
∞ n
n−1
λ n−1 −λt (λt)i
R(t) = t e dt = e−λt . (4.29)
t (n) i=0
i!
The mean time to failure of the system is given by the gamma distribution as
n
MTTF = . (4.30)
λ
Alternatively, (4.30) can also be derived from (4.27). Specifically,
n n
1 n
MTTF = E(T ) = E(Ti ) = = .
i=1 i=1
λ λ
Example 4.7 A small power plant is equipped with two identical generators,
one active and the other in cold standby. Whenever the active generator fails, the
redundant generator is switched to working condition without interruption. The
life of the two generators can be modeled with the exponential distribution with
λ = 3.6 × 10−5 failures per hour. Calculate the power plant reliability at 5000
hours and the mean time to failure.
2
MTTF = = 5.56 × 104 hours.
3.6 × 10−5
ž The primary component (whose life is T1 ) does not fail in time t; that is,
T1 ≥ t.
ž If the primary component fails at time τ (τ < t), the cold standby component
(whose life is T2 ) continues the function and does not fail in the remaining
time (t − τ ). Probabilistically, the event is described by (T1 < t) · (T2 ≥
t − τ ).
Since the above two events are mutually exclusive, the system reliability is
where Ri and fi are, respectively, the reliability and pdf of component i. In most
situations, evaluation of (4.32) requires a numerical method. As a special case,
when the two components are identically and exponentially distributed, (4.32)
can result in (4.31).
For some switching systems, such as human operators, the reliability may not
change over time. In these situations, R0 (τ ) is static or independent of time. Let
R0 (τ ) = p0 . Then (4.33) can be written as
t
−λt
R(t) = e + p0 λe−λτ e−λ(t−τ ) dτ = (1 + p0 λt)e−λt . (4.34)
0
Note the similarity and difference between (4.31) for a perfect switching system
and (4.34) for an imperfect one. Equation (4.34) reduces to (4.31) when p0 = 1.
The mean time to failure of the system is
∞
1 + p0
MTTF = R(t) dt = . (4.35)
0 λ
Figure 4.16 plots r0 and r1 for various values of δ. It can be seen that the
unreliability of the switching system has stronger effects on MTTF than on the
84 SYSTEM RELIABILITY EVALUATION AND ALLOCATION
1
r0
0.95
r1
0.90
0.85
r0, r1
0.80
0.75
0.70
0.65
0 10 20 30 40 50 60 70 80 90 100
d
reliability of the entire system. Both quantities are largely reduced when λ0 is
greater than 10% of λ. The effects are alleviated by the decrease in λ0 , and
become nearly negligible when λ0 is less than 1% of λ.
Example 4.8 Refer to Example 4.7. Suppose that the switching system is sub-
ject to failure following the exponential distribution with λ0 = 2.8 × 10−5 failures
per hour. Calculate the power plant reliability at 5000 hours and the mean time
to failure.
1 1 3.6 × 10−5
MTTF = + −
3.6 × 10−5 2.8 × 10−5 2.8 × 10−5 (3.6 × 10−5 + 2.8 × 10−5 )
= 4.34 × 104 hours.
Comparing these results with those in Example 4.7, we note the adverse effects
of the imperfect switching system.
SOLUTION The steps for calculating the system reliability are as follows:
1. Decompose the system into blocks A, B, C, and D, which represent a
parallel–series, parallel, series, and cold standby subsystem, respectively,
as shown in Figure 4.17.
2. Calculate the reliabilities of blocks A, B, C, and D. From (4.19), the reli-
ability of block A is
−4
RA = 1 − (1 − R1 R2 )(1 − R3 R4 ) = 1 − 1 − e−(1.2+2.3)×10 ×600
−4
1 − e−(0.9+1.6)×10 ×600
= 0.9736.
A B D
Now that the original system has been reduced to a single unit, as shown in
Figure 4.20, the reduction process is exhausted. Then the system reliability is
R = RG = 0.9367.
E F
B
A D
C
E F
where A is the event that keystone component A is 100% reliable, A the event
that keystone component A has failed, Pr(system good | A) the probability that
the system is functionally successful given that component A never fails, and
Pr(system good | A) the probability that the system is functionally successful
given that component A has failed. The efficiency of the method depends on the
selection of the keystone component. An appropriate choice of the component
leads to an efficient calculation of the conditional probabilities.
Example 4.10 Consider the bridge system in Figure 4.21. Suppose that the reli-
ability of component i is Ri , i = 1, 2, . . . , 5. Calculate the system reliability.
1 2
3 4
1 2
3 4
1 2
3 4
Then the system is reduced as shown in Figure 4.22. The reduced system is a
series–parallel structure, and the conditional reliability is
The next step is to assume that component 5 has failed and is removed from
the system structure. Figure 4.23 shows the new configuration, which is a paral-
lel–series system. The conditional reliability is
Equation (4.43) has four terms. In general, for binary components, if m key-
stone components are selected simultaneously, the reliability equation contains
2m terms. Each term is the product of the reliability of one of the decomposed
subsystems and that of the condition on which the subsystem is formed.
failure. If any component is removed from the set, the remaining components
collectively are no longer a cut set. The definitions of cut set and minimal cut
set are similar to those defined in Chapter 6 for fault tree analysis.
Since every minimal cut set causes the system to fail, the event that the system
breaks is the union of all minimal cut sets. Then the system reliability can be
written as
R = 1 − Pr(C1 + C2 + · · · + Cn ), (4.44)
where Ci (i = 1, 2, . . . , n) represents the event that components in minimal
cut set i are all in a failure state and n is the total number of minimal cut
sets. Equation (4.44) can be evaluated by applying the inclusion–exclusion rule,
which is
n n
Pr(C1 + C2 + · · · + Cn ) = Pr(Ci ) − Pr(Ci · Cj )
i=1 i<j =2
n
+ Pr(Ci · Cj · Ck ) + · · ·
i<j <k=3
+ (−1) n−1
Pr(C1 · C2 · · · Cn ). (4.45)
Example 4.11 Refer to Example 4.10. If the five components are identical and
have a common reliability R0 , calculate the reliability of the bridge system shown
in Figure 4.21 using the minimal cut set method.
SOLUTION The minimal cut sets of the bridge system are {1, 3}, {2, 4}, {1, 4, 5},
and {2, 3, 5}. Let Ai denote the event that component i has failed, i = 1, 2, . . . , 5.
Then the events described by the minimal cut sets can be written as C1 = A1 · A3 ,
C2 = A2 · A4 , C3 = A1 · A4 · A5 , and C4 = A2 · A3 · A5 . From (4.44) and (4.45)
and using the rules of Boolean algebra (Chapter 6), the system reliability can be
written as
4
4 4
R =1− Pr(Ci ) − Pr(Ci · Cj ) + Pr(Ci · Cj · Ck )
i=1 i<j =2 i<j <k=3
− Pr(C1 · C2 · C3 · C4 )
R = h(R1 , R2 , . . . , Rn ), (4.46)
92 SYSTEM RELIABILITY EVALUATION AND ALLOCATION
Using (4.48) and (4.49), we can easily derive the variance of R̂ for a series–para-
llel or parallel–series system.
For a complex system, the variance of R̂ can be approximated by using a
Taylor series expansion of (4.46). Then we have
n
∂R 2
Var(R̂) ≈ Var(R̂i ). (4.50)
i=1
∂Ri
Note that the covariance terms in the Taylor series expansion are zero since
the n components are assumed to be mutually independent. Coefficient ∂R/∂Ri
in (4.50) measures the sensitivity of the system reliability variance to the varia-
tion of individual component reliability. As we will see in the next section, the
coefficient is also Birnbaum’s component importance measure.
Substituting estimates of component reliabilities and component-reliability
variances into (4.48), (4.49), or (4.50), we can obtain an estimate of Var(R̂), de-
noted V̂ar(R̂). We often approximate the distribution of R̂ with a normal dis-
tribution. Then the two-sided 100(1 − α)% confidence interval for the system
reliability is
R̂ ± z1−α/2 V̂ar(R̂), (4.51)
where z1−α/2 is the 100(1 − α/2) percentile of the standard normal distribution.
The one-sided lower 100(1 − α)% confidence bound is
R̂ − z1−α V̂ar(R̂). (4.52)
CONFIDENCE INTERVALS FOR SYSTEM RELIABILITY 93
Note that (4.51) and (4.52) can yield a negative lower confidence bound.
To ensure that the lower confidence bound is always nonnegative, we use the
transformation
R
p = ln . (4.53)
1−R
p̂ − p
Zp̂ =
V̂ar(p̂)
The distribution of Zp̂ leads to the two-sided 100(1 − α)% confidence interval as
R̂ R̂
, , (4.54)
R̂ + (1 − R̂) × w R̂ + (1 − R̂)/w
where
z1−α/2 V̂ar(R̂)
w = exp .
R̂(1 − R̂)
Example 4.12 Refer to Example 4.10. Suppose that the estimates of component
reliabilities and component-reliability variances at the mission time of 1000 hours
have been calculated from life test data, as shown in Table 4.1. Estimate the two-
and one-sided lower 95% confidence bound(s) on the system reliability.
R̂i
0.9677 0.9358 0.9762 0.8765 0.9126
V̂ar(R̂i ) 0.0245 0.0173 0.0412 0.0332 0.0141
94 SYSTEM RELIABILITY EVALUATION AND ALLOCATION
Substituting the values of w and R̂ into (4.54), we get the two-sided 95% con-
fidence interval for the system reliability as [0.9806, 0.9957]. Now we calculate
the value of w for the one-sided 95% confidence bound. It is
√
Substituting the value of w and R̂ into the lower endpoint of (4.54) gives 0.9828,
which is the one-sided lower 95% confidence bound on the system reliability.
1. Partition the system into blocks where each block is comprised of compo-
nents in pure series or parallel configurations.
2. Calculate the reliability estimates and variances for series blocks using
(4.48) and for parallel blocks using (4.49).
3. Collapse each block by replacing it in the system reliability block diagram
with an equivalent hypothetic component with the reliability and variance
estimates that were calculated for it.
4. Repeat steps 1 to 3 until the system reliability block diagram is represented
by a single component. The variance for this component approximates the
variance of the original system reliability estimate.
Once the variance is calculated, we estimate the confidence intervals for system
reliability. The estimation is based on the assumption that the system reliability
(unreliability) estimate has a lognormal distribution. This assumption is reason-
able for a relatively large-scale system. For a series configuration of independent
subsystems, the system reliability is the product of the subsystem reliability val-
ues, as formulated in (4.1). Then the logarithm of the system reliability is the
sum of the logarithms of the subsystem reliabilities. According to the central
limit theorem, the logarithm of system reliability approximately follows a normal
distribution if there are enough subsystems regardless of their time-to-failure dis-
tributions. Therefore, the system reliability is lognormal. An analogous argument
can be made for a parallel system, where the system unreliability is approxi-
mately lognormal. Experimental results from simulation reported in Coit (1997)
indicate that the approximation is accurate for any system that can be partitioned
into at least eight subsystems in series or parallel.
96 SYSTEM RELIABILITY EVALUATION AND ALLOCATION
For a series system, the estimate of system reliability has a lognormal distri-
bution with parameters µ and σ . The mean and variance are, respectively,
1 2
E(R̂) = exp µ + σ ,
2
Var(R̂) = exp(2µ + σ 2 )[exp(σ 2 ) − 1] = [E(R̂)]2 [exp(σ 2 ) − 1].
The mean value of the system reliability estimate is the true system reliability:
namely, E(R̂) = R. Hence, the variance of the log estimate of system reliability
can be written as
Var(R̂)
σ = ln 1 +
2
. (4.55)
R2
where
V̂ar(R̂)
F̂ = 1 − R̂ and σ̂ 2 = ln 1 + .
F̂ 2
Note that Var(F̂ ) = Var(R̂). The lower and upper bounds on system reliability
equal 1 minus the upper and lower bounds on system unreliability from (4.57),
respectively. The one-sided lower 100(1 − α)% confidence bound on system reli-
ability is
1 2
1 − F̂ exp σ̂ + z1−α σ̂ . (4.58)
2
Coit (1997) restricts the confidence intervals above to systems that can be
partitioned into series or parallel blocks in order to calculate the variances of block
reliability estimates from (4.48) or (4.49). If the variances for complex blocks are
CONFIDENCE INTERVALS FOR SYSTEM RELIABILITY 97
computed from (4.50), the restriction may be relaxed and the confidence intervals
are applicable to any large-scale systems that consist of subsystems (containing
blocks in various configurations) in series or parallel.
SOLUTION The system can be decomposed into eight subsystems in series. Thus,
the lognormal approximation may apply to estimate the confidence bound. First the
system is partitioned into parallel and series blocks, as shown in Figure 4.25. The
estimates of block reliabilities are
The variance estimates for series blocks, A, C, and E, are calculated from (4.48).
Then we have
3 7 8 9
1 2 4 5 6
3 7 9
R̂i
0.9856 0.9687 0.9355 0.9566 0.9651 0.9862 0.9421 0.9622 0.9935
V̂ar(R̂i ) 0.0372 0.0213 0.0135 0.046 0.0185 0.0378 0.0411 0.0123 0.0158
98 SYSTEM RELIABILITY EVALUATION AND ALLOCATION
A B C D E
3 7 8 9
1 2 4 5 6
3 7 9
Similarly,
V̂ar(R̂C ) = 0.00344 and V̂ar(R̂E ) = 0.00038.
The variance estimates for parallel blocks, B and D, are calculated from (4.49).
Then we have
After estimating the block reliabilities and variances, we replace each block
with a hypothetical component in the system reliability block diagram. The dia-
gram is then further partitioned into a series block G and a parallel block H, as
shown in Figure 4.26. The reliability estimates for the blocks are
R̂G = R̂A R̂B R̂C R̂D = 0.8628 and R̂H = 1 − (1 − R̂9 )(1 − R̂E ) = 0.9997.
G H
E
A B C D
9
G H
block, denoted I, as shown in Figure 4.28. The estimates of the system reliability
and variance at 500 hours equal these of block I: namely,
1 2
3 4
Similarly, provided that computer 2 is failed, the conditional probability that the
system is functional is
Pr(system good | A) = R3 R4 .
IB (1|t) = R2 (1 − R3 ), IB (2|t) = R1 + R3 − R1 R3 − R3 R4 ,
IB (3|t) = R2 + R4 − R1 R2 − R2 R4 , IB (4|t) = R3 (1 − R2 ).
Since the times to failure of the computers are exponential, we have Ri (t) =
e−λi t , i = 1, 2, 3, 4.
The reliabilities of individual computers at 4000 hours are
According to the importance measures, the priority of the computers is, in des-
cending order, computers 3, 4, 2, and 1.
Similarly, the reliabilities of individual computers at 8000 hours are
0.45
0.4
0.35 3
0.3 2
0.25
IB (i|t)
4
0.2
computer 1
0.15
0.1
0.05
0
0 1000 2000 3000 4000 5000 6000 7000 8000
t (h)
FIGURE 4.30 Birnbaum’s importance measures of individual computers at different
times
Comparison of the priorities at 4000 and 8000 hours shows that computer
3 is most important and computer 1 is least important, at both points of time.
Computer 4 is more important than 2 at 4000 hours; however, the order is reversed
at 8000 hours. The importance measures at different times (in hours) are plotted
in Figure 4.30. It indicates that the short-term system reliability is more sensitive
to computer 4, and computer 2 contributes more to the long-term reliability.
Therefore, the importance measures should be evaluated, and the priority be
made, at the time of interest (e.g., the design life).
Example 4.15 Refer to Example 4.14. Determine the criticality importance me-
asures for the individual computers at 4000 and 8000 hours.
MEASURES OF COMPONENT IMPORTANCE 103
SOLUTION By using (4.61) and the results of IB (i|t) from Example 4.14, we
obtain the criticality importance measures for the four computers as
The priority order at 8000 hours is the same as that at 4000 hours. Figure 4.31
plots the criticality importance measures at different times (in hours). It is seen
that the priority order of the computers is consistent over time. In addition,
computers 2 and 3 are far more important than computers 1 and 4, because they
are considerably less reliable.
1
0.9
0.8 3 2
0.7
0.6
IC (i|t)
0.5
0.4
0.3
0.2 computer 1
4
0.1
0
0 1000 2000 3000 4000 5000 6000 7000 8000
t (h)
where Cj is the event that the components in the minimal cut set containing
component i are all failed; j = 1, 2, . . . , ni , and ni is the total number of the
minimal cut sets containing component i; F (t) is the probability of failure of
the system at time t. In (4.62), the probability, Pr(C1 + C2 + · · · + Cni ), can be
calculated by using the inclusion–exclusion rule expressed in (4.45). If compo-
nent reliabilities are high, terms with second and higher order in (4.45) may be
omitted. As a result, (4.62) can be approximated by
1
n i
SOLUTION The minimal cut sets of the computing system are {1, 3}, {2, 4},
and {2, 3}. Let Ai denote the failure of computer i, where i = 1, 2, 3, 4. Then
we have C1 = A1 · A3 , C2 = A2 · A4 , and C3 = A2 · A3 . Since the reliabilities
of individual computers are not high, (4.63) is not applicable. The importance
measures are calculated from (4.62) as
Pr(C1 ) Pr(A1 · A3 ) F1 F3
IFV (1|t) = = = ,
F (t) F (t) F
Pr(C2 + C3 ) Pr(A2 · A4 ) + Pr(A2 · A3 ) − Pr(A2 · A3 · A4 )
IFV (2|t) = =
F (t) F (t)
F2 (F4 + F3 − F3 F4 )
= ,
F
Pr(C1 + C3 ) F3 (F1 + F2 − F1 F2 )
IFV (3|t) = = ,
F (t) F
Pr(C2 ) F2 F4
IFV (4|t) = = ,
F (t) F
The priority order at 8000 hours is the same as that at 4000 hours. Figure 4.32
plots the importance measures of the four individual computers at different times
(in hours). It is seen that computers 2 and 3 are considerably more important
than the other two at different points of time, and the relative importance order
does not change with time.
Examples 4.14 through 4.16 illustrate application of the three importance mea-
sures to the same problem. We have seen that the measures of criticality impor-
tance and Fessell–Vesely’s importance yield the same priority order, which does
not vary with time. The two measures are similar and should be used if we
are concerned with the probability of the components being the cause of sys-
tem failure. The magnitude of these measures increases with the unreliability
of component (Meng, 1996), and thus a component of low reliability receives a
1
0.9
0.8 3
2
0.7
0.6
IFV (i|t)
0.5
0.4
0.3
4 computer 1
0.2
0.1
0
0 1000 2000 3000 4000 5000 6000 7000 8000
t (h)
high importance rating. The two measures are especially appropriate for systems
that have a wide range of component reliabilities. In contrast, Birnbaum’s mea-
sure of importance yields a different priority order in the example, which varies
with time. The inconsistency at different times imposes difficulty in selecting the
weakest components for improvement if more than one point of time is of interest.
In addition, the measure does not depend on the unreliability of the component
in question (Meng, 2000). Unlike the other two, Birnbaum’s importance does
not put more weight on less reliable components. Nevertheless, it is a valuable
measure for identifying the fastest path to improving system reliability. When
using this measure, keep in mind that the candidate components may not be eco-
nomically or technically feasible if the component reliabilities are already high.
To maximize the benefits, it is suggested that Birnbaum’s importance be used at
the same time with one of the other two. In the examples, if resources allow only
two computers for improvement, concurrent use of the measures would identify
computers 3 and 2 as the candidates, because both have large effects on system
reliability and at the same time have a high likelihood of causing the system to
fail. Although Birnbaum’s measure suggests that computer 4 is the second-most
important at 4000 hours, it is not selected because the other two measures indi-
cate that it is far less important than computer 2. Clearly, the resulting priority
order from the three measures is computers 3, 2, 4, and 1.
The equal allocation technique treats equally all criteria including those described
in Section 4.11.1 for all components within a system and assigns a common
reliability target to all components to achieve the overall system reliability target.
Although naive, this method is the simplest one and is especially useful in the
early design phase when no detail information is available. For a series system,
the system reliability is the product of the reliabilities of individual components.
RELIABILITY ALLOCATION 109
where λ∗i and λ∗ are the maximum allowable failure rates of component i and
system, respectively. Then the maximum allowable failure rate of a component is
λ∗
λ∗i = , i = 1, 2, . . . , n. (4.68)
n
Example 4.17 An automobile consists of a body, a powertrain, an electrical
subsystem, and a chassis connected in series, as shown in Figure 4.2. The life-
times of all subsystems are exponentially distributed and equally important. If
the vehicle reliability target at 36 months in service is 0.98, determine the reli-
ability requirement at this time and the maximum allowable failure rate of each
subsystem.
The maximum allowable failure rate of the vehicle in accordance with the overall
reliability target is
5.612 × 10−4
λ∗i = = 1.403 × 10−4 failures per month, i = 1, 2, 3, 4.
4
allocation becomes the task of choosing the failure rates of individual components
λ∗i such that (4.67) is satisfied. The determination of λ∗i takes into account the
likelihood of component failure (one of the criteria described earlier) by using
the following weighting factors:
λi
wi = n , i = 1, 2, . . . , n, (4.69)
i=1 λi
where λi is the failure rate of component i obtained from historical data or
prediction. The factors reflect the relative likelihood of failure. The larger the
value of wi , the more likely the component is to fail. Thus, the failure rate target
allocated to a component should be proportional to the value of the weight:
namely,
λ∗i = wi λ0 , i = 1, 2, . . . , n, (4.70)
n
where λ0 is a constant. Because i=1 wi = 1 and if the equality holds in (4.67),
inserting (4.70) into (4.67) yields λ0 = λ∗ . Therefore, (4.70) can be written as
λ∗i = wi λ∗ , i = 1, 2, . . . , n. (4.71)
This gives the maximum allowable failure rate of a component. The correspond-
ing reliability target is readily calculated as
Example 4.18 Refer to Example 4.17. The warranty data for similar subsystems
of an earlier model year have generated the failure rate estimates of the body,
powertrain, electrical subsystem, and chassis as λ1 = 1.5 × 10−5 , λ2 = 1.8 ×
10−4 , λ3 = 2.3 × 10−5 , and λ4 = 5.6 × 10−5 failures per month, respectively.
Determine the reliability requirement at 36 months in service and the maximum
allowable failure rate of each subsystem in order to achieve the overall reliability
target of 0.98.
1.5 × 10−5
w1 =
1.5 × 10−5 + 1.8 × 10−4 + 2.3 × 10−5 + 5.6 × 10−5
1.5 × 10−5
= = 0.0547,
27.4 × 10−5
1.8 × 10−4 2.3 × 10−5
w2 = = 0.6569, w3 = = 0.0839,
27.4 × 10−5 27.4 × 10−5
5.6 × 10−5
w4 = = 0.2044.
27.4 × 10−5
RELIABILITY ALLOCATION 111
Substituting the values of λ∗ and the weighting factors into (4.71) gives the
maximum allowable failure rates of the four subsystems. Then we have
R1∗ (36) × R2∗ (36) × R3∗ (36) × R4∗ (36) = 0.9989 × 0.9868 × 0.9983 × 0.9959
= 0.98.
where R ∗ (t) is the system reliability target at time t, Ri∗ (ti ) the reliability target
allocated for subsystem i at time ti (ti ≤ t), wi the importance of subsystem i,
and n the number of subsystems. It can be seen that the allocation method allows
the mission time of a subsystem to be less than that of the system.
Since the times to failure of subsystems are distributed exponentially and
we have the approximation exp(−x) ≈ 1 − x for a very small x, (4.72) can be
written as
n
λ∗i wi ti = − ln R ∗ (t) ,
i=1
where λ∗i is the failure rate allocated to subsystem i. Taking the complexity into
account, λ∗i can be written as
mi ln[R ∗ (t)]
λ∗i = − , i = 1, 2, . . . , n, (4.73)
mwi ti
where mi is the number of modules in subsystem i, m is the total number of
modules in the system and equals n1 mi , and wi is the importance of subsystem
i.
Considering the approximations exp(−x)≈1 − x for small x and ln(y) ≈ y−1
for y close to 1, the reliability target allocated to subsystem i can be written as
1 − [R ∗ (t)]mi /m
Ri∗ (ti ) = 1 − . (4.74)
wi
If wi is equal or close to 1, (4.74) simplifies to
It can be seen that (4.73) and (4.74) would result in a very low reliability
target for a subsystem of little importance. A very small value of wi distortedly
outweighs the effect of complexity and leads to an unreasonable allocation. The
method works well only when the importance of each subsystem is close to 1.
1 Sensing 12 1 12
2 Diagnosis 38 1 12
3 Indication 6 0.85 6
diagnosis subsystems are essential for the system to fulfill the intended functions.
Failure of the indication subsystem causes the system to fail at an estimated
probability of 0.85. In the case of indicator failure, it is possible that a component
failure is detected by the driver due to poor drivability. Determine the reliability
targets for the subsystems in order to satisfy the system reliability target of 0.99
in a driving cycle of 12 hours. Table 4.3 gives the data necessary to solve the
problem.
SOLUTION From Table 4.3, the total number of modules in the system is m =
12 + 38 + 6 = 56. Substituting the given data into (4.73) yields the maximum
allowable failure rates (in failures per hour) of the three subsystems as
12 × ln(0.99) 38 × ln(0.99)
λ∗1 = − = 1.795 × 10−4 , λ∗2 = −
56 × 1 × 12 56 × 1 × 12
= 5.683 × 10−4 ,
6 × ln(0.99)
λ∗3 = − = 2.111 × 10−4 .
56 × 0.85 × 6
From (4.74), the corresponding reliability targets are
1 − [0.99]12/56 1 − [0.99]38/56
R1∗ (12) = 1 − = 0.9978, R2∗ (12) = 1 −
1 1
= 0.9932,
1 − [0.99]6/56
R3∗ (6) = 1 − = 0.9987.
0.85
Now we substitute the allocated reliabilities into (4.72) to check the system reli-
ability, which is
[1 − 1 × (1 − 0.9978)] × [1 − 1 × (1 − 0.9932)] × [1 − 0.85 × (1 − 0.9987)]
= 0.9899.
This approximately equals the system reliability target of 0.99.
J1
Ri∗ = p(xj ), i = 1, 2, . . . , n, (4.79)
j =J0 +1
where J1 = i1 nj , J0 = 1i−1 nj for i ≥ 2, and J0 = 0 for i = 1.
Equation (4.79) gives the minimum reliability requirement that is correlated to
the minimum customer satisfaction. It is worth noting that the reliability target is a
RELIABILITY ALLOCATION 115
function of time because the performance characteristics change with time. Thus,
we should specify the time of particular interest at which minimum reliability
must be achieved.
2
4
R1∗ = p(xi ) = 0.9644 × 0.983 = 0.948, R2∗ = p(xi )
i=1 i=3
As a check, the minimum system reliability is R1∗ × R2∗ × R3∗ = 0.948 × 0.9355
× 0.9663 = 0.857. This is equal to the product reliability target R ∗ . It should
be pointed out that meeting the subsystem reliability targets does not guarantee
all customer satisfactions. To ensure all customer satisfactions, p(xj ) for each
subsystem characteristic must not be less than the assigned values.
with savings in engineering design, verification, and production costs. The sav-
ings subtracted from the investment cost yields the net cost, which concerns
a reliability allocation. In general, the cost is a nondecreasing function of required
reliability. The more stringent the reliability target, the higher the cost. As the
reliability required approaches 1, the cost incurred by meeting the target increases
rapidly. Cost behaviors vary from subsystem to subsystem. In other words, the
cost required to attain the same increment in reliability is dependent on sub-
system. As such, it is economically beneficial to assign higher-reliability goals
to the subsystems that demand lower costs to meet the targets. The discussion
above indicates that reliability allocation heavily influences cost. A good allo-
cation method should achieve the overall reliability requirement and low cost
simultaneously.
Let Ci (Ri ) denote the cost of subsystem i with reliability Ri . The cost of the
entire system is the total of all subsystem costs: namely,
n
C= Ci (Ri ), (4.80)
i=1
where C is the cost of the entire system and n is the number of subsystems. In the
literature, various models for Ci (Ri ) have been proposed. Examples include Misra
(1992), Aggarwal (1993), Mettas (2000), Kuo et al. (2001), and Kuo and Zuo
(2002). In practice, it is important to develop or select the cost functions that
are suitable for the specific subsystems. Unfortunately, modeling a cost function
is an arduous task because it is difficult, if not impossible, to estimate the costs
associated with attaining different reliability levels of a subsystem. The modeling
process is further complicated by the fact that subsystems within a system often
have different cost models. Given the constraints, we often employ a reasonable
approximation to a cost function.
If the cost function Ci (Ri ) for subsystem i (i = 1, 2, . . . , n) is available, the
task of reliability allocation may be transformed into an optimization problem.
In some applications, cost is a critical criterion in reliability allocation. Then the
reliability targets for subsystems should be optimized by minimizing the cost,
while the constraint on overall system reliability is satisfied. This optimization
problem can be formulated as
n
Min Ci (Ri∗ ), (4.81)
i=1
Let Ei (Ri , Ri∗ ) denote the effort function, which describes the dollar amount of
effort required to increase the reliability of subsystem i from the current reliability
level Ri to a higher reliability level Ri∗ . The larger the difference between Ri and
Ri∗ , the more the effort, and vice versa. Thus, the effort function is nonincreasing
in Ri for a fixed value of Ri∗ and nondecreasing in Ri∗ for a fixed value of Ri .
Using the effort function, (4.81) can be written as
n
Min Ei (Ri , Ri∗ ), (4.82)
i=1
2
1 4
3
subject to R1∗ R4∗ (R2∗ + R3∗ − R2∗ R3∗ ) ≥ 0.92. The optimization model can be sol-
ved easily using a numerical algorithm. Here we employ Newton’s method and
the Solver in Microsoft Excel, and obtain R1∗ = 0.9752, R2∗ = 0.9392, R3∗ =
0.9167, and R4∗ = 0.9482. The costs (in dollars) associated with the subsystems
are C1 (R1∗ ) = 58.81, C2 (R2∗ ) = 18.31, C3 (R3∗ ) = 21.42, and C4 (R4∗ ) = 83.94.
The minimum total cost of the system is C = 4i=1 Ci (Ri∗ ) = $182.48.
PROBLEMS
4.1 An automotive V6 engine consists of six identical cylinders. For the engine
to perform its intended functions, all six cylinders must be operationally
successful. Suppose that the mileage to failure of a cylinder can be modeled
with the Weibull distribution with shape parameter 1.5 and characteristic
life 3.5 × 106 miles. Calculate the reliability and failure rate of the engine
at 36,000 miles.
4.2 A special sprinkler system is comprised of three identical humidity sensors,
a digital controller, and a pump, of which the reliabilities are 0.916, 0.965,
and 0.983, respectively. The system configuration is shown in Figure 4.34.
Calculate the reliability of the sprinkler system.
PROBLEMS 119
Sensor
Controller Pump
2 5 6
1 3
7 8
4
4.3 Calculate the reliability of the system in Figure 4.35, where component i
has reliability Ri (i = 1, 2, . . . , 8).
4.4 A power-generating plant is installed with five identical generators running
simultaneously. For the plant to generate sufficient power for the end users,
at least three of the five generators must operate successfully. If the time
to failure of a generator can be modeled with the exponential distribution
with λ = 3.7 × 10−5 failures per hour, calculate the reliability of the plant
at 8760 hours.
4.5 A critical building has three power sources from separate stations. Nor-
mally, one source provides the power and the other two are in standby.
Whenever the active source fails, a power supply grid switches to a standby
source immediately. Suppose that the three sources are distributed identi-
cally and exponentially with λ = 1.8 × 10−5 failures per hour. Calculate
the reliability of the power system at 3500 hours for the following cases:
(a) The switching system is perfect and thus never fails.
(b) The switching system is subject to failure according to the exponential
distribution with a failure rate of 8.6 × 10−6 failures per hour.
4.6 A computing system consists of five individual computers, as shown in
Figure 4.36. Ri (i = 1, 2, . . . , 5) is the reliability of computer i at a given
time. Compute the system reliability at the time.
4.7 Refer to Problem 4.6. Suppose that the computer manufacturers have pro-
vided estimates of reliability and variance of each computer at 10,000 hours,
as shown in Table 4.5. Calculate the one-sided lower 95% confidence bound
on the system reliability at 10,000 hours.
4.8 Calculate the Birnbaum, criticality, and Fessel–Vesely measures of impor-
tance for the computing system in Problem 4.7. What observation can you
make from the values of the three importance measures?
120 SYSTEM RELIABILITY EVALUATION AND ALLOCATION
1 2
3 4
R̂i
0.995 0.983 0.988 0.979 0.953
V̂ar(R̂i ) 0.0172 0.0225 0.0378 0.0432 0.0161
1 33 1 500
2 18 1 500
3 26 0.93 405
4 52 1 500
PROBLEMS 121
E1 E2 E3 E4
Y1 Y2 Y3 Y4
x1 x2 x3 x4 x5 x6
5.1 INTRODUCTION
122
RELIABILITY AND ROBUSTNESS 123
development and presents three wining cases in particular. Menon et al. (2002)
delineate a case study on the robust design of spindle motor. Tu et al. (2006)
document the robust design of a manufacturing process. Chen (2001) describes
robust design of very large-scale integration (VLSI) process and device. Taguchi
(2000) and Taguchi et al. (2005) present a large number of successful projects
conducted in a wide spectrum of companies.
Numerous publications, most of which come with case studies, demonstrate
that robust design is also an effective methodology for improving reliability. K.
Yang and Yang (1998) propose a design and test method for achieving robust
reliability by making products and processes insensitive to the environmental
stresses. The approach is illustrated with a case study on the reliability
improvement of the integrated-circuit interconnections. Chiao and Hamada
(2001) describe a method for analyzing degradation data from robust design
experiments. A case study on reliability enhancement of light-emitting diodes is
presented. Tseng et al. (1995) report on increasing the reliability of fluorescent
lamp with degradation data. C. F. Wu and Hamada (2000) present the design of
experiments and dedicate a chapter to methods of reliability improvement through
robust parameter design. Condra (2001) includes Taguchi’s method and basic
reliability knowledge in one book and discusses several case studies. Phadke and
Smith (2004) apply robust design method to increase the reliability of engine
control software.
In this chapter we describe the concepts of reliability and robustness and dis-
cuss their relationships. The robust design methods and processes for improving
reliability are presented and illustrated with several industrial examples. Some
advanced topics on robust design are described at the end of the chapter; these
materials are intended for readers who want to pursue further study.
S1
y
S2
G
0
t
FIGURE 5.1 Robust product sensitive to aging
S1
S2
y
0
t
FIGURE 5.2 Reliable product sensitive to use conditions
RELIABILITY DEGRADATION AND QUALITY LOSS 125
S1
S2
y
G
0
t
FIGURE 5.3 Product insensitive to aging and use conditions
where L(y) is the quality loss, y a quality characteristic, my the target value of
y, and K the quality loss coefficient. Quality characteristics can be categorized
into three types: (1) nominal-the-best, (2) smaller-the-better, and (3) larger-the-
better.
L0
L(y)
f(y)
my − ∆0 my my + ∆0
y
FIGURE 5.4 Quadratic quality loss for a nominal-the-best characteristic
loss of this type of characteristics. The quality loss is illustrated in Figure 5.4. If
the quality loss is L0 when the tolerance is just breached, (5.1) can be written as
L0
L(y) = (y − my )2 . (5.2)
20
The quality characteristic y is a random variable due to the unit-to-unit varia-
tion and can be modeled with a probabilistic distribution. The probability density
function (pdf) of y, denoted f (y), is depicted in Figure 5.4. If y has a mean µy
and standard deviation σy , the expected quality loss is
Equation (5.3) indicates that to achieve the minimum expected quality loss, we
have to minimize the variance of y and set the mean µy to the target my .
The quality loss function is depicted is Figure 5.5. If the quality loss is L0 when
y just breaches the upper limit, the quality loss at y can be written as
L0 2
L(y) = y . (5.5)
20
The expected quality loss is
L0
L(y) f (y)
0 ∆0
y
FIGURE 5.5 Quadratic quality loss for a smaller-the-better characteristic
L(y)
f (y)
L0
0 ∆0
y
FIGURE 5.6 Quadratic quality loss for a larger-the-better characteristic
Equation (5.9) indicates that increasing the mean or decreasing the variance
reduces the quality loss.
functions presented in Section 5.3.1 are applicable to these two types of failure
modes.
Hard Failure For a hard-failure product, there is generally no indication of per-
formance degradation before the failure occurs. Customers perceive the product
life as the key quality characteristic. Obviously, life is a larger-the-better char-
acteristic. The quality loss function for life can be modeled by (5.7). Then (5.8)
is used to calculate the quality loss, where 0 is often the design life and L0
is the loss due to the product failure at the design life. Here the design life is
deemed as the required life span because customers usually expect a product to
work without failure during its design life. L0 may be determined by the life
cycle cost. Dhillon (1999) describes methods for calculation of the cost.
The expected quality loss for life is described by (5.9). To minimize or reduce
the loss due to failure, we have to increase the mean life and reduce the life
variation. In robust reliability design, this can be accomplished by selecting the
optimal levels of the design parameters.
Soft Failure For a soft-failure product, failure is defined in terms of a per-
formance characteristic crossing a prespecified threshold. Such a performance
characteristic usually belongs to the smaller-the-better or larger-the-better type.
Few are nominal-the-best type. Regardless of the type, the performance charac-
teristic that defines failure is the quality characteristic that incurs the quality loss.
The characteristic is often the one that most concerns customers.
The quality loss functions presented in Section 5.3.1 describe the initial loss
due to the spreading performance characteristic caused by material and process
variations. After the product is placed in service, the performance characteristic
degrades over time. As a result, the reliability decreases and the quality loss
increases with time. It is clear that the quality loss is nondecreasing in time.
Taking the time effect into account, the expected loss for a smaller-the-better
characteristic can be written as
E{L[y(t)]} = K[µ2y (t) + σy2 (t)]. (5.10)
Similarly, the expected loss for a larger-the-better characteristic is
K 3σy2 (t)
E{L[y(t)]} ≈ 2 1+ 2 . (5.11)
µy (t) µy (t)
∆0
y(t) f [ y(t3)]
f [ y(t2)]
f [ y(t1)]
0 t1 t2 t3
t
FIGURE 5.7 Relation of performance degradation and probability of failure
L[ y(t)]
f [y(t2)]
f [y(t1)] f[y(t3)]
0 ∆0
y(t)
FIGURE 5.8 Quadratic quality loss as a function of time
(5.11) indicate, the quality loss may be minimized by alleviating the performance
degradation, which can be accomplished by robust design (to be discussed later).
In Section 5.3 we related reliability degradation to quality loss and concluded that
reliability can be improved through robust design. In this section we describe the
process of robust design.
the axiomatic design and TRIZ (an innovative problem-solving method). Dieter
(2000), Suh (2001), Rantanen and Domb (2002), and K. Yang and El-Haik
(2003) describe the system design in detail. Nevertheless, there are few systematic
approaches in the literature, largely because system design is skill intensive.
Parameter design aims at minimizing the sensitivity of the performance of a
product or process to noise factors by setting its design parameters at the optimal
levels. In this step, designed experiments are usually conducted to investigate the
relationships between the design parameters and performance characteristics of
the product or process. Using such relationships, one can determine the optimal
setting of the design parameters. In this book, parameter design is the synonym
of robust design in a narrow sense. In a broad sense, the former is a subset of
the latter.
Tolerance design is to choose the tolerance of important components to reduce
the performance sensitivity to noise factors under cost constraints. Tolerance
design may be conducted after the parameter design is completed. If the parame-
ter design cannot achieve sufficient robustness, tolerance design is necessary. In
this step, the important components whose variability has the largest effects on
the product sensitivity are identified through experimentation. Then the tolerance
of these components is tightened by using higher-grade components based on the
trade-off between the increased cost and the reduction in performance variabil-
ity. Jeang (1995), P. Ross (1996), Creveling (1997), C. C. Wu and Tang (1998),
C. Lee (2000), and Vlahinos (2002) describe the theory and application of the
tolerance design.
response, (b) determine the optimal setting of the significant control fac-
tors, and (c) predict the response under the optimal setting. Graphical
response analysis or analysis of variance (ANOVA) is usually performed
in this step.
9. Run a confirmation test. The optimality of the control factor levels is
confirmed by running a test on the samples at the optimal setting.
10. Recommend actions. The optimal setting should be implemented in design
and production. To sustain the improvement, follow-up actions such as
executing a statistical process control are recommended. Montgomery
(2001a) and Stamatis (2004), for example, describe quality control appro-
aches, including statistical process control.
Customer: Manufacturing:
use frequency, process variation,
load, etc. manufacturability
5.6 P-DIAGRAM
Noise factors
System
Failure modes
Control factors
factor is inevitable, but can be reduced through tolerance design and process
control. In the braking system example, variation in the thickness of drums
and pads is this type of noise factor.
In the automotive industry, noise factors are further detailed. For example, at
Ford Motor Company, noise factors are divided into five categories: (1) unit-to-
unit variation; (2) change in dimension, performance, or strength over time or
mileage; (3) customer usage and duty cycle; (4) external environment, including
climate and road conditions; and (5) internal environment created by stresses from
neighboring components. Although the last three types belong to the external
noise described above, this further itemization is instrumental in brainstorming
all relevant noise factors. Strutt and Hall (2003) describe the five types of noise
factors in greater detail.
Control factors are the design parameters whose levels are specified by design-
ers. The purpose of a robust design is to choose optimal levels of the parameters.
In practice, a system may have a large number of design parameters, which are
not of the same importance in terms of the contribution to robustness. Often, only
the key parameters are included in a robust design. These factors are identified
by using engineering judgment, analytical study, a preliminary test, or historical
data analysis.
Intended functions are the functionalities that a system is intended to perform.
The functions depend on signals, noise factors, and control factors. Noise factors
and control factors influence both the average value and variability of functional
responses, whereas the signals determine the average value and do not affect the
variability.
Failure modes represent the manner in which a system fails to perform its
intended functions. As explained earlier, failure modes can be classified into two
types: hard failure and soft failure. In the braking system example, excessive
stopping distance is a soft failure, whereas complete loss of hydraulic power in
the braking system is a hard failure.
The creation of a P-diagram leads to the identification of all noise factors that
disturb the functional responses and generate failure modes. The noise factors
Noise Factors
Unit-to-unit variation: sensor transfer function variation, gasket orifice size
variation, spring load variation, tube insertion depth variation, etc.
Wear out/degradation (internal noise): gasket bead compression, orifice
erosion, wire connector correction, sensor degradation, regulator spring drift, etc.
Customer usage (external noise): extreme driving conditions, etc.
External environment (external noise): temperature, barometer, humidity, etc.
NOISE EFFECTS MANAGEMENT
Failure Modes
Control Factors
Fail to detect a failure (beta error),
Calibration parameters,
the failure mode “fail to detect a failure (beta error)” is caused by multiple noise
may be the consequence of several noise factors. For example, in Figure 5.11,
cause the most troublesome failure modes, those with high risk priority number
One noise factor may result in more than one failure mode; one failure mode
involves numerous noise factors, it is important to identify the critical factors that
transfer function can cause both alpha and beta errors. Since a product usually
factors, including sensor transfer function variation, wire connector corrosion,
supply voltage variation, and others. On the other hand, the variation in sensor
135
and failure modes have cause-and-effect relations, which usually are complicated.
136 RELIABILITY IMPROVEMENT THROUGH ROBUST DESIGN
The effects of critical noise factors must be addressed in robust design. System
design, parameter design, and tolerance design are the fundamental methods for
reducing noise effects. In system design, the following common techniques are
often employed to eliminate or mitigate the adverse effects:
In earlier sections we defined the scope of robust design, identified the critical
control factors and noise factors and their levels, and determined the key quality
characteristic. The next step in robust design is to design the experiment.
Design of experiment is a statistical technique for studying the effects of
multiple factors on the experimental response simultaneously and economically.
The factors are laid out in a structured array in which each row represents a
combination of levels of factors. Then experiments with each combination are
conducted and response data are collected. Through experimental data analysis,
we can choose the optimal levels of control factors that minimize the sensitivity
of the response to noise factors.
Various types of structured arrays or experimental designs, such as full factorial
designs and a variety of fractional factorial designs are described in the literature
(e.g., C. F. Wu and Hamada, 2000; Montgomery, 2001b). In a full factorial
design, the number of runs equals the number of levels to the power of the
number of factors. For example, a two-level full factorial design with eight factors
requires 28 = 256 runs. If the number of factors is large, the experiments will be
unaffordable in terms of time and cost. In these situations, a fractional factorial
design is often employed. A fractional factorial design is a subset of a full
factorial design, chosen according to certain criteria. The commonly used classical
fractional factorials are 2k−p and 3k−p , where 2 (3) is the number of levels, k
DESIGN OF EXPERIMENTS 137
the number of factors, and 2−p (3−p ) the fraction. For example, a two-level
half-fractional factorial with eight factors needs only 28−1 = 128 runs.
The classical fractional factorial designs require that all factors have an equal
number of levels. For example, a 2k−p design can accommodate only two-level
factors. In practice, however, some factors are frequently required to take a differ-
ent number of levels. In such situations, the classical fractional factorial designs
are unable to meet the demand. A more flexible design is that of orthogonal
arrays, which have been used widely in robust design. As will be shown later,
the classical fractional factorial designs are special cases of orthogonal arrays. In
this section we present experimental design using orthogonal arrays.
ž All possible combinations of any two columns of the matrix occur an equal
number of times within the two columns. The two columns are also said to
be orthogonal.
ž Each level of a specific factor within a column has an equal number of
occurrences within the column.
For example, Table 5.1 shows the orthogonal array L8 (27 ). The orthogonal
array has seven columns. Each column may accommodate one factor with two
levels, where the low and high levels are denoted by 0 and 1, respectively. From
Table 5.1 we see that any two columns, for example, columns 1 and 2, have
level combinations (0,0), (0,1), (1,0), and (1,1). Each combination occurs twice
within the two columns. Therefore, any two columns are said to be orthogonal.
In addition, levels 0 and 1 in any column repeat four times. The array contains
eight rows, each representing a run. A full factorial design with seven factors and
1 0 0 0 0 0 0 0
2 0 0 0 1 1 1 1
3 0 1 1 0 0 1 1
4 0 1 1 1 1 0 0
5 1 0 1 0 1 0 1
6 1 0 1 1 0 1 0
7 1 1 0 0 1 1 0
8 1 1 0 1 0 0 1
138 RELIABILITY IMPROVEMENT THROUGH ROBUST DESIGN
two levels of each would require 27 = 128 runs. Thus, this orthogonal array is
1
a 16 fractional factorial design. In general, because of the reduction in run size,
an orthogonal array usually saves a considerable amount of test resource. As
opposed to the improved test efficiency, an orthogonal array may confound the
main effects (factors) with interactions. To avoid or minimize such confounding,
we should identify any interactions before design of experiment and lay out the
experiment appropriately. This is discussed further in subsequent sections.
In general, an orthogonal array is indicated by LN (IP × JQ ), where N denotes
the number of experimental runs, P is the number of I-level columns, and Q is
the number of J-level columns. For example, L18 (21 × 37 ) identifies the array as
having 18 runs, one two-level column, and seven three-level columns. The most
commonly used orthogonal arrays have the same number of levels in all columns,
and then LN (IP × JQ ) simplifies to LN (IP ). For instance, L8 (27 ) indicates that the
orthogonal array has seven columns, each with two levels. The array requires
eight runs, as shown in Table 5.1.
Because of the orthogonality, some columns in LN (IP ) are fundamental (inde-
pendent) columns, and all other columns are generated from two or more of the
fundamental columns. The generation formula is as follows, with few exceptions.
Example 5.2 In L8 (27 ) as shown in Table 5.1, columns 1, 2, and 4 are the fun-
damental columns, all other columns being generated from these three columns.
For instance, column 3 is generated from columns 1 and 2 as follows:
columns, and generated columns are the interaction columns. For example, in
Table 5.1, the interaction between columns 1 and 2 goes to column 3. In an
experimental layout, if factor A is assigned to column 1 and factor B to column
2, column 3 should be allocated to the interaction A × B if it exists. Assigning an
independent factor to column 3 can lead to an incorrect data analysis and faulty
conclusion because the effect of the independent factor is confounded with the
interaction effect. Such experimental design errors can be prevented by using a
linear graph.
A linear graph, a pictorial representation of the interaction information, is
made up of dots and lines. Each dot indicates a column to which a factor (main
effect) can be assigned. The line connecting two dots represents the interaction
between the two factors represented by the dots at each end of the line segment.
The number assigned to a dot or a line segment indicates the column within the
array. In experimental design, a factor is assigned to a dot, and an interaction is
assigned to a line. If the interaction represented by a line is negligible, a factor
may be assigned to the line.
Figure 5.12 shows two linear graphs of L8 (27 ). Figure 5.12a indicates that
columns 1, 2, 4, and 7 can be used to accommodate factors. The interaction
between columns 1 and 2 goes into column 3, the interaction between 2 and 4
goes into column 6, and the interaction between columns 1 and 4 goes into column
5. From (5.12) we can see that column 7 represents a three-way interaction among
columns 1, 2, and 4. The linear graph assumes that three-way or higher-order
interactions are negligible. Therefore, column 7 is assignable to a factor. It should
be noted that all linear graphs are based on this assumption, although it may be
questionable in some applications.
Most orthogonal arrays have two or more linear graphs. The number and
complexity of linear graphs increase with the size of orthogonal array. A multitude
of linear graphs provide great flexibility for assigning factors and interactions.
1
2
3
7
3 5 5
1 4
6 7
2 4
6
(a) (b)
The most commonly used orthogonal arrays and their linear graphs are listed in
the Appendix.
1 0 0 0
2 0 1 1
3 1 0 1
4 1 1 0
3
1 2
FIGURE 5.13 Linear graph for L4 (23 )
DESIGN OF EXPERIMENTS 141
1 0 0 0 0
2 0 1 1 1
3 0 2 2 2
4 1 0 1 1
5 1 1 2 0
6 1 2 0 1
7 2 0 2 1
8 2 1 0 2
9 2 2 1 0
3, 4
1 2
FIGURE 5.14 Linear graph for L9 (34 )
142 RELIABILITY IMPROVEMENT THROUGH ROBUST DESIGN
2
3
1 5
4
6
7
FIGURE 5.15 Selection of three columns to form a new column
DESIGN OF EXPERIMENTS 143
1 0 0 0 00 0 0 0 0 0
2 0 0 0 00 0 1 1 1 1
3 0 1 1 01 1 0 0 1 1
4 0 1 1 01 1 1 1 0 0
5 1 0 1 10 2 0 1 0 1
6 1 1 0 11 3 1 1 1 0
7 1 1 0 11 3 0 1 1 0
8 1 1 0 11 3 1 0 0 1
15 8
14 9
13 1 11
12 10
3 5
7
2 4
6
FIGURE 5.16 Selection of seven columns to form a new column
1 0 0 0 000 0 0 0 0 0 0 0 0 0
2 0 0 0 000 0 1 1 1 1 1 1 1 1
3 0 0 1 001 1 0 0 0 0 1 1 1 1
4 0 0 1 001 1 1 1 1 1 0 0 0 0
5 0 1 0 010 2 0 0 1 1 0 0 1 1
6 0 1 0 010 2 1 1 0 0 1 1 0 0
7 0 1 1 011 3 0 0 1 1 1 1 0 0
8 0 1 1 011 3 1 1 0 0 0 0 1 1
9 1 0 0 100 4 0 1 0 1 0 1 0 1
10 1 0 0 100 4 1 0 1 0 1 0 1 0
11 1 0 1 101 5 0 1 0 1 1 0 1 0
12 1 0 1 101 5 1 0 1 0 0 1 0 1
13 1 1 0 110 6 0 1 1 0 0 1 1 0
14 1 1 0 110 6 1 0 0 1 1 0 0 1
15 1 1 1 111 7 0 1 1 0 1 0 0 1
16 1 1 1 111 7 1 0 0 1 0 1 1 0
1. Calculate the total number of degrees of freedom needed to study the factors
(main effects) and interactions of interest. This is the degrees of freedom
required.
2. Select the smallest orthogonal array with at least as many degrees of free-
dom as required.
3. If necessary, modify the orthogonal array by merging columns or using
other techniques to accommodate the factor levels.
4. Construct a required linear graph to represent the factors and interactions.
The dots represent the factors, and the connecting lines indicate the inter-
actions between the factors represented by the dots.
5. Choose the standard linear graph that most resembles the linear graph
required.
6. Modify the required graph so that it is a subset of the standard linear
graph.
7. Assign factors and interactions to the columns according to the linear graph.
The unoccupied columns are error columns.
Example 5.4 The rear spade in an automobile can fracture in the ends of the
structure due to fatigue under road conditions. An experiment was designed
to improve the fatigue life of the structure. The fatigue life may be affected
by the setting of the design parameters as well as the manufacturing process.
The microfractures generated during forging grow while the spade is in use.
Therefore, the control factors in this study include the design and production
process parameters. The main control factors are as follows:
1
D C
C×D
7
3 5
B×D
A
B 2 4
6
(a) (b)
FIGURE 5.17 (a) Required linear graph; (b) standard linear graph
1D
7A
B×D C×D
3 5
2 6 4
B error column C
z1 : 0 0 1 1
Inner Array
Factors and Interactions z2 : 0 1 0 1
Run 1 2 3 4 5 6 7 z3 : 0 1 1 0
1 0 0 0 0 0 0 0
2 0 0 0 1 1 1 1
3 0 1 1 0 0 1 1
4 0 1 1 1 1 0 0
Experimental data
5 1 0 1 0 1 0 1
6 1 0 1 1 0 1 0
7 1 1 0 0 1 1 0
8 1 1 0 1 0 0 1
and an outer array forms a cross array. Table 5.6 shows a cross array in which
the inner array is L8 (27 ) and the outer array is L4 (23 ). The outer array can
accommodate three noise factors (z1 , z2 , and z3 ).
The run size of a cross array is N × l, where N is the run size of the inner array
and l is the run size of the outer array. For the cross array in Table 5.6, the total
number of runs is 8 × 4 = 32. If more noise factors or levels are to be included in
the experiment, the size of the outer array will be larger. As a result, the total run
size will increase proportionally, and the experiment will become too expensive.
This difficulty may be resolved by using the noise-compounding strategy.
The aim of using an outer array is to integrate into the experiment the noise
conditions under which the product will operate in the field. The test units at a
setting of control factors are subjected to a noise condition. The quality char-
acteristic of a unit usually takes extreme values at extreme noise conditions. If
the quality characteristic is robust against extreme conditions, it would be robust
in any condition between extremes. Therefore, it is legitimate in an outer array
to use only extreme noise conditions: the least and most severe conditions. The
least severe condition often is a combination of the lower bounds of the ranges
of the noise factors, whereas the most severe is formed by the upper bounds.
In the context of reliability testing, experiments using the levels of noise
factors within the use range frequently generate few failures or little degradation.
Considering this, elevated levels of carefully selected noise factors are sometimes
applied to yield a shorter life or more degradation. Such noise factors must not
148 RELIABILITY IMPROVEMENT THROUGH ROBUST DESIGN
D B D×B C D×C A
Run 1 2 3 4 5 6 7 z11 z12
1 0 0 0 0 0 0 0
2 0 0 0 1 1 1 1
3 0 1 1 0 0 1 1
4 0 1 1 1 1 0 0
Fatigue life data
5 1 0 1 0 1 0 1
6 1 0 1 1 0 1 0
7 1 1 0 0 1 1 0
8 1 1 0 1 0 0 1
interact with the control factors. Otherwise, the accelerated test data may lead to
a falsely optimal setting of the control factors (Section 5.14.4).
Example 5.5 Refer to Example 5.4. The noise factors for the rear spade are as
follows:
ž Noise factor M: stroke frequency; M0 = 0.5 Hz, M1 = 3 Hz
ž Noise factor S: stroke amplitude; S0 = 15 mm, S1 = 25 mm
There are four combinations of noise levels, but running the experiments at the
least and most severe combinations would yield sufficient information. The least
severe combination is M0 and S0 , and the most severe combination is M1 and S1 .
The two combinations are denoted by z11 and z12 . Then the outer array needs to
include only these two noise conditions. Using the linear graph in Figure 5.18,
we developed the cross array for the experiment of the rear spade as given in
Table 5.7.
product. In this section we describe methods for analyzing experimental life data.
If a product loses its function gradually, it is possible to monitor and measure a
performance characteristic during testing. The performance characteristic is the
key quality characteristic used in subsequent design optimization. In Section 5.10
we discuss experimental degradation data analysis.
µ = β0 + β1 z1 + · · · + βp zp = zT β, (5.13)
where
z1 β1 β11 β12 /2 · · · β1p /2
z2 β2 β21 /2 β22 ... β2p /2
Z=
..., b=
..., B=
...
.
... ... ...
zp βp βp1 /2 βp2 /2 . . . βpp
µ = β0 + β1 z1 + β2 z2 + β12 z1 z2 , (5.15)
Lognormal Distribution The log likelihood for a log exact failure time y is
where [·] is the cumulative distribution function (cdf) of the standard normal
distribution.
The log likelihood for an observation left censored at log time y is
LL = ln{[z(y)]}. (5.18)
Weibull Distribution The log likelihood for a log exact failure time y is
where EXT, INT, LFT, and RHT denote the sets of exact, interval, left-censored,
and right-censored data in a run, respectively. In practice, there usually are only
one or two types of data. Thus, the form of the log likelihood function is much
simpler than it appears in (5.24).
z1 : 0 0 1 1
Inner Array
Factors and Interactions z2 : 0 1 0 1
Run 1 2 3 4 5 6 7 z3 : 0 1 1 0
y(t)
0 t1 t0
t
FIGURE 5.19 Projecting y(t1 ) to a normal censoring time
y(t)
0
t
y = g(t; β1 , β2 , . . . , βp ) + e, (5.25)
The life derived from (5.26) is actually a pseudolife; it contains model error as
well as residual error. The pseudolife data and lifetimes observed are combined
to serve as observations of the quality characteristic in the subsequent design
optimization.
y(t)
0 t1 tk
t
Testing such products requires a relatively large sample size. A few samples at a
time are inspected destructively. Then the statistical distribution of the degrada-
tion at the time can be estimated. As testing proceeds to the next inspection time,
another group of samples is destructed for measurement. The destructive measure-
ments are taken at each inspection time. The last inspection exhausts all remaining
samples. Figure 5.21 plots the destructive measurements of a larger-the-better
characteristic at various times. The bullets represent the samples destructed. The
degradation paths in this plot are shown for illustration purposes and cannot be
observed in reality.
Once measurements have been taken at each inspection time, a statistical
distribution is fitted to these measurement data. In many applications, the mea-
surements may be modeled with a location-scale distribution (e.g., normal, log-
normal, or Weibull). For example, K. Yang and Yang (1998) report that the shear
strength of copper bonds is approximately normal, and Nelson (1990, 2004) mod-
els the dielectric breakdown strength of insulators with the lognormal distribution.
Let µy (ti ) and σy (ti ) be, respectively, the location and scale parameters of the
sample distribution at time ti , where i = 1, 2, . . . , k, and k is the number of
inspections. The distributions are shown in Figure 5.21. The estimates µ̂y (ti )
and σ̂y (ti ) can be obtained with graphical analysis or by the maximum like-
lihood method (Chapter 7). Through linear or nonlinear regression analysis on
µ̂y (ti ) and σ̂y (ti ) (i = 1, 2, . . . , k), we can build the regression models µy (t; β̂)
and σy (t; θ̂ ), where β̂ and θ̂ are the estimated vectors of the regression model
parameters. Then the reliability estimate at the time of interest, say τ , is
G − µy (τ ; β̂)
R̂(τ ) = Pr[y(τ ) ≤ G] = F (5.27)
σy (τ ; θ̂ )
1. Dispersion factor: a control factor that has a strong effect on the dispersion
of quality characteristic (Figure 5.22a). In the figure, z is the noise factor
and A is the control factor; the subscript 0 represents the low level and
1 is the high level. This figure shows that the variation of noise factor
is transformed into the variability of quality characteristic. The quality
characteristic at A1 spreads more widely than at A0 . Therefore, A0 is a
better choice. Figure 5.22a also indicates that the dispersion factor interacts
with the noise factor. It is this interaction that provides an opportunity for
robustness improvement. In general, the level of a dispersion factor should
be chosen to minimize the dispersion of the quality characteristic.
2. Mean adjustment factor: a control factor that has a significant influence
on the mean and does not affect the dispersion of the quality characteristic
(Figure 5.22b). The response line at A0 over the noise range parallels that
at A1 , indicating that the mean adjustment factor does not interact with the
noise factor. In general, the level of a mean adjustment factor is selected
to bring the mean of the quality characteristic on target.
3. Dispersion and mean adjustment factor: a control factor that influences
both the dispersion and mean significantly (Figure 5.22c). This factor inter-
acts with the noise factor and should be treated as the dispersion factor. In
general, the level is set to minimize dispersion.
4. Insignificant factor: a control factor that affects significantly neither the
dispersion nor the mean (Figure 5.22d). The response at A0 over the noise
DESIGN OPTIMIZATION 157
A1 A1
A0
y
y
A0
z0 z1 z0 z1
(a) (b)
A1
A1
A0
y
y
A0
z0 z1 z0 z1
(c) (d )
FIGURE 5.22 (a) Dispersion factor; (b) mean adjustment factor; (c) dispersion and
mean adjustment factor; (d) insignificant factor
where µy and σy2 are the mean and variance of the quality characteristic for a
setting of control factors. The larger the value of η, the more robust the product.
158 RELIABILITY IMPROVEMENT THROUGH ROBUST DESIGN
where log(·) denotes the common (base 10) logarithm. η is measured in deci-
bels (dB).
In application, the mean and variance in (5.30) are unknown. They can be esti-
mated from observations of the quality characteristic. For row i in Table 5.9, let
l
1
l nij
ki = nij , yi = yij k , Di = ki y 2i , i = 1, 2, . . . , N.
j =1
ki j =1 k=1
1 2 ... l
Inner Array
Factors 0 0 ... 1
and Interactions ... ... ... ...
If ki is large, σ̂yi2 /ki becomes negligible. Then the signal-to-noise ratio can be
written as
y 2i yi
η̂i ≈10 log 2
= 20 log , i = 1, 2, . . . , N. (5.32)
σ̂yi σ̂yi
Note that y i /σ̂yi is the reciprocal of the coefficient of variation, which measures
the dispersion of the quality characteristic. Therefore, maximizing the signal-to-
noise ratio minimizes the characteristic dispersion.
Smaller-the-Better Quality Characteristics For this type of quality characteris-
tics, the target is zero, and the estimates of the mean would be zero or negative.
As a result, we cannot use the log transformation as in (5.30). Rather, the signal-
to-noise ratio is defined as
η = −10 log(MSD), (5.33)
where MSD is the mean-squared deviation from the target value of the quality
characteristic. Because the target of the smaller-the-better type is zero, the MSD
for row i is given by
1 2
l nij
MSDi = y , i = 1, 2, . . . , N.
ki j =1 k=1 ij k
where Rij is the reliability estimate at the cross combination of row i and column
j . Then the signal-to-noise ratio is
l 2
1 1
η̂i = −10 log − 1 , i = 1, 2, . . . , N. (5.36)
l j =1 Rij
Example 5.6 Refer to Table 5.8. Suppose that the reliability estimates in the
first row are 0.92, 0.96, 0.8, and 0.87. Calculate the signal-to-noise ratio for
this row.
For each row of the inner array, a signal-to-noise ratio is calculated using
(5.32), (5.34), (5.35), or (5.36), depending on the type of quality characteris-
tic. Then a further analysis using the graphical response method or analysis of
variance is performed to determine the optimal setting of control factors.
2. Identify the control factors that significantly affect the signal-to-noise ratio
through graphical response analysis (Section 5.11.4) or analysis of variance
(Section 5.11.5).
3. Determine the optimal setting of the significant control factors by maxi-
mizing the signal-to-noise ratio.
4. Determine the levels of insignificant control factors in light of material
cost, manufacturability, operability, and simplicity.
5. Predict the signal-to-noise ratio at the optimal setting.
6. Conduct a confirmation test using the optimal setting to verify that the
optimal setting chosen yields the robustness predicted.
Nominal-the-Best Characteristics
1. Calculate the signal-to-noise ratio and the mean response (y i in Table 5.9)
for each row of the inner array.
2. Identify the significant control factors and categorize them into dispersion
factors, mean adjustment factors, or dispersion and mean adjustment factors
(treated as dispersion factors).
3. Select the setting of the dispersion factors to maximize the signal-to-
noise ratio.
4. Select the setting of the mean adjustment factors such that the estimated
quality characteristic is closest to the target.
5. Determine the levels of insignificant control factors based on consideration
of material cost, manufacturability, operability, and simplicity.
6. Predict the signal-to-noise ratio and the mean response at the optimal
setting.
7. Conduct a confirmation test using the optimal setting to check if the optimal
setting produces the signal-to-noise ratio and mean response predicted.
Example 5.7 The attaching clips in automobiles cause audible noise while
vehicles are operating. The two variables that may affect the audible noise level
are length of clip and type of material. Interaction between these two variables is
possible. The noise factors that influence the audible noise level include vehicle
162 RELIABILITY IMPROVEMENT THROUGH ROBUST DESIGN
speed and temperature. The levels of the control factors and noise factors are as
follows:
Then the average responses at levels 0 and 1 of factors A and B are computed:
Level A B
0 −29.1 −27.4
1 −29.5 −31.2
−24.7 − 30.2
B0 = = −27.4.
2
Next, a two-way table is constructed for the average response of the interaction
between factors A and B:
1 0 0 0 15 19 −24.7
2 0 1 1 47 49 −33.6
3 1 0 1 28 36 −30.2
4 1 1 0 26 29 −28.8
DESIGN OPTIMIZATION 163
A0 A1
B0 −24.7 −30.2
B1 −33.6 −28.8
Having calculated the average response at each level of factors and interac-
tions, we need to determine significant factors and interactions and then select
optimal levels of the factors. The work can be accomplished using graphical
response analysis.
Graphical response analysis is to plot the average response for factor and
interaction levels and determine the significant factors and their optimal levels
from the graphs. The average response for a level of a factor is the sum of the
observations corresponding to the level divided by the total number of obser-
vations. Example 5.7 shows the calculation of average response at B0 . Plot the
average responses on a chart in which the x-axis is the level of a factor and
y-axis is the response. Then connect the dots on the chart. This graph is known
as a main effect plot. Figure 5.23a shows the main effect plots for factors A and
B of Example 5.7. The average response of an interaction between two factors
is usually obtained using a two-way table in which a cross entry is the average
response, corresponding to the combination of the levels of the two factors (see
the two-way table for Example 5.7). Plot the tabulated average responses on a
chart where the x-axis is the level of a factor. The chart has more than one line
segment, each representing a level of the other factor, and is called an interaction
plot. The interaction plot for Example 5.7 is shown in Figure 5.23b.
The significance of factors and interactions can be assessed by viewing the
graphs. A steep line segment in the main effect plot indicates a strong effect of the
factor. The factor is insignificant if the line segment is flat. In an interaction plot,
parallel line segments indicate no interaction between the two factors. Otherwise
an interaction is existent. Let’s revisit Example 5.7. Figure 5.23a indicates that
factor B has a strong effect on the response because the line segment has a steep
slope. Factor A has little influence on the response because the corresponding
line segment is practically horizontal. The interaction plot in Figure 5.23b indi-
cates the lack of parallelism of the two line segments. Therefore, the interaction
between factors A and B is significant.
Once the significant control factors have been identified, the optimal setting
of these factors should be determined. If interaction is important, the optimal
levels of the factors involved are selected on the basis of the factor-level com-
bination that results in the most desirable response. For factors not involved in
an interaction, the optimal setting is the combination of factor levels at which
the most desirable average response is achieved. When the interaction between
two factors is strong, the main effects of the factors involved do not have much
meaning. The levels determined through interaction analysis should override the
levels selected from main effect plots. In Example 5.7, the interaction plot shows
that interaction between factors A and B is important. The levels of A and B
should be dictated by the interaction plot. From Figure 5.23b it is seen that A0 B0
164 RELIABILITY IMPROVEMENT THROUGH ROBUST DESIGN
−23
−25
−27
B
−29
h∧
A
−31
−33
−35
0 1
Level
(a)
−23
−25
A0
−27
−29
h∧
A1
−31
−33
−35
B0 B1
(b)
produces the largest signal-to-noise ratio. This level combination must be used
in design, although the main effect plot indicates that factor A is an insignificant
variable whose level may be chosen for other considerations (e.g., using a shorter
clip to save material cost).
If the experimental response is a nominal-the-best characteristic, we should
generate the main effect and interaction plots for both signal-to-noise ratio and
mean response. If a factor is identified to be both a dispersion factor and a
mean adjustment factor, it is treated as a dispersion factor. Its level is selected
to maximize the signal-to-noise ratio by using the strategy described above. To
determine the optimal levels of the mean adjustment factors, we enumerate the
average response at each combination of mean adjustment factor levels. The
average response at a level combination is usually obtained by the prediction
method described below. Then the combination is chosen to bring the average
response on target.
Once the optimal levels of factors have been selected, the mean response at the
optimal setting should be predicted for the following reasons. First, the prediction
DESIGN OPTIMIZATION 165
indicates how much improvement the robust design will potentially make. If the
gain is not sufficient, additional improvement using other techniques, such as the
tolerance design, may be required. Second, a subsequent confirmation test should
be conducted and the result compared against the predicted value to verify the
optimality of the design. The prediction is made based on an estimation of the
effects of significant factors and interactions. For convenience, we denote by T
and T the total of responses and the average of responses, respectively. Then
we have
N
T
T = yi , T = ,
i=1
N
where MET is a set of significant main effects, INT a set of significant inter-
actions, F i the average response of factor Fi at the optimal level, and F ij the
average response of the interaction between factors Fi and Fj at the optimal
levels. Because the effect of an interaction includes the main effects of the fac-
tors involved, the main effects should be subtracted from the interaction effect
as shown in the second term of (5.37). If the response is a nominal-the-best
characteristic, (5.37) should include all significant dispersion factors and mean
adjustment factors and interactions. Then apply the equation to estimate the
signal-to-noise ratio and the mean response.
In Example 5.7, B and A × B are significant and A0 B0 is the optimal setting.
The grand average response is T = −29.3. The signal-to-noise ratio predicted at
the optimal setting is obtained from (5.37) as
η̂ = T + (B 0 − T ) + [(A0 B0 − T ) − (A0 − T ) − (B 0 − T )] = A0 B0 − A0 + T
= −24.7 + 29.1 − 29.3 = −24.9,
which is close to −24.7, the signal-to-noise ratio calculated from the experimental
data at A0 B0 and shown in Table 5.10.
In general, a confirmation experiment should be conducted before implemen-
tation of the optimal setting in production. The optimality of the setting is verified
if the confirmation result is close to the value predicted. A statistical hypothesis
test may be needed to arrive at a statistically valid conclusion.
p
n
y..
y.. = yij , y .. = ,
i=1 j =1
N
p
p
n
=n (y i. − y .. ) + 2
(yij − y i. )2 = SSA + SSE , (5.38)
i=1 i=1 j =1
where
p
p
n
SSA = n (y i. − y .. ) 2
and SSE = (yij − y i. )2 .
i=1 i=1 j =1
DESIGN OPTIMIZATION 167
SSA is called the sum of squares of the factor, and SSE is the sum of squares of
the error. Equation (5.38) indicates that the total corrected sum of squares can
be partitioned into these two portions.
Factor A has p levels; thus, SSA has p − 1 degrees of freedom. There are N
observations in the experiment, so SST has N − 1 degrees of freedom. Because
there are n observations in each of p levels providing n − 1 degrees of freedom
for estimating the experimental error, SSE has p(n − 1) = N − p degrees of
freedom. Note that the degrees of freedom for SST equals the sum of degrees of
freedom for SSA and SSE . Dividing the sum of squares by its respective degrees
of freedom gives the mean square MS: namely,
SSx
MSx = , (5.39)
dfx
where x denotes A or E and dfx is the number of degrees of freedom for SSx .
The F statistic for testing the hypothesis that the mean responses for all levels
are equal is
MSA
F0 = , (5.40)
MSE
which has an F distribution with p − 1 and N − p degrees of freedom. We
conclude that factor A has a statistically significant effect at 100α% significance
level if
F0 > Fα,p−1,N−p .
p
y2 y..2
SSA = i.
− , (5.42)
i=1
n N
SSE = SST − SSA . (5.43)
Example 5.8 An experiment was designed to investigate the effects of the air-
to-fuel (A/F) ratio on the temperature of the exhaust valve in an automobile
engine. The experiment was replicated with four samples at each A/F ratio. The
experimental data are summarized in Table 5.13. Determine whether A/F ratio
has a strong influence at the 5% significance level.
SOLUTION The total and average of the temperature observations are com-
puted and summarized in Table 5.13. The grand total and grand average are
3
4
8946
y.. = yij = 701 + 713 + · · · + 768 = 8946, y .. = = 745.5.
i=1 j =1
12
3
y2 y..2 28522 + 29952 + 30992 89462
SSA = i.
− = − = 7689.5,
i=1
4 12 4 12
SSE = SST − SSA = 8311 − 7689.5 = 621.5.
The calculation of mean squares and F0 is straightforward. The values are sum-
marized in the ANOVA table as shown in Table 5.14. Because F0 = 55.64 >
F0.05,2,9 = 4.26, we conclude that the A/F ratio has a strong effect on the exhaust
valve temperature at the 5% significance level.
ANOVA for Orthogonal Inner Arrays In the design of experiment, the purpose
of an outer array is to expose samples to noise factors. After the experimental data
are collected, the outer array has completed its role. The outer array usually is not
involved in subsequent ANOVA for design optimization unless we are interested
in understanding the effects of noise factors on the quality characteristic. The
optimal levels of design parameters are determined using analysis of variance for
the inner array.
A column of an inner array may be assigned to a factor, an interaction, or an
error (empty column). An I-level column in LN (IP ) can be considered as an I-
level factor, each level having n = N/I replicates. Thus, (5.41) can be employed
to calculate the total corrected sum of squares of an inner array, and (5.42) applies
to a column of an inner array. Let T be the total of observations: namely,
N
T = yi ,
i=1
where yi represents η̂i or y i as shown in, for example, Table 5.9. Then the total
corrected sum of squares can be written as
N
T2
SST = yi2 − . (5.44)
i=1
N
Also, let Tj denote the total of observations taken at level j in a column. The
sum of squares of column i having I levels is
I −1
I 2 T2
SSi = T − . (5.45)
N j =0 j N
(T0 − T1 )2
SSi = . (5.46)
N
Now let’s look at a simple array, L9 (34 ). From (5.45), the sum of squares of
column 1 is
9 2
3 1
SS1 = [(y1 + y2 + y3 ) + (y4 + y5 + y6 ) + (y7 + y8 + y9 ) ] −
2 2 2
yi .
9 9 i=1
In an inner array, some columns may be empty and are treated as error
columns. The sum of squares for an error column is computed with (5.45). Then
170 RELIABILITY IMPROVEMENT THROUGH ROBUST DESIGN
the sums of squares for all error columns are added together. If an assigned col-
umn has a small sum of squares, it may be treated as an error column, and the
sum of squares should be pooled into the error term. The total corrected sum of
squares equals the total of the sums of squares for factor columns, interaction
columns, and error columns. Recall that the number of degrees of freedom is
I − 1 for an I-level column and is N − 1 for LN (IP ). The number of degrees of
freedom for error is the sum of the degrees of freedom for error columns. The
mean square and F statistic for a factor or interaction are computed from (5.39)
and (5.40), respectively. It is concluded that the factor or interaction is important
at the 100α% significance level if
Example 5.9 Refer to Examples 5.4 and 5.5. The design of experiment for the
rear spade has four control factors and two interactions. L8 (27 ) is used as an
inner array to accommodate the control factors. The outer array is filled with two
combinations of noise levels. Two test units of the same setting of control factors
were run at each of the two noise combinations. The fatigue life data (in 1000
cycles) are shown in Table 5.15.
Fatigue life is a larger-the-better characteristic. The signal-to-noise ratio for
each row of the inner array is computed using (5.35) and is shown in Table 5.15.
For example, the value of the ratio for the first row is
4
1 1
η̂1 = −10 log
4 i=1 yi2
1 1 1 1 1
= −10 log + + + = 17.03.
4 7.62 8.22 6.22 6.92
DESIGN OPTIMIZATION 171
TABLE 5.15 Cross Array and Fatigue Life Data for the Rear Spade
Inner Array
D B D×B C D×C A Outer Array
Run 1 2 3 4 5 6 7 z11 z12 η̂
8
T = η̂i = 17.03 + 14.55 + · · · + 14.72 = 115.38.
i=1
The sum of squares for each column is calculated from (5.46). For example, the
sum of squares for column 2 (factor B) is
1
SS2 = × (17.03 + 14.55 + 12.18 + 13.38
8
− 13.68 − 14.17 − 15.67 − 14.72)2 = 0.15.
The sums of squares of the factors, interactions, and error are given in Table 5.16.
Because each column has two levels, the degrees of freedom for each factor,
interaction, and error is 1. Note that in the table the sum of squares for factor B
is pooled into error e to give the new error term (e). Thus, the new error term
has 2 degrees of freedom.
The critical value of the F statistic is F0.1,1,2 = 8.53 at the 10% significance
level. By comparing the critical value with the F0 values for the factors and inter-
actions in the ANOVA table, we conclude that A, D, and D × B have significant
effects, whereas B, C, and D × C are not statistically important.
For comparison, we use the graphical response method described in
Section 5.11.4 to generate the main effect plots and interaction plots, as shown
Figure 5.24. Figure 5.24a indicates that the slopes for factors A and D are steep,
and thus these factors have strong effects, whereas factor B is clearly insignificant.
From the figure it is difficult to judge the importance of factor C because of the
marginal slope. This suggests that ANOVA should be performed. Figure 5.24b
shows that the interaction between factors B and D is very strong, although
factor B itself is not significant. As indicated in Figure 5.24c, there appears an
interaction between factors C and D because of the lack of parallelism of the
two line segments. The interaction, however, is considerably less severe than that
between B and D. ANOVA shows that this interaction is statistically insignificant,
but the value of F0 is close to the critical value.
Once the significant factors and interactions are identified, the optimal levels
should be selected. Because the interaction D × B has a strong effect, the levels
of B and D are determined by the interaction effect. From the interaction plot,
we choose B0 D0 . Figure 5.24a or T0 and T1 of factor A calculated for ANOVA
suggests that A0 be selected. Because factor C is deemed insignificant, C0 is
chosen to maintain the current manufacturing process. In summary, the design
should use material type 1, forging thickness 7.5 mm, and bend radius 5 mm
with normal shot peening in manufacturing.
The value of the signal-to-noise ratio predicted at the optimal setting A0 B0 C0
D0 is obtained from (5.37) as
η̂ = T + (A0 − T ) + (D 0 − T ) + [(D0 B0 − T ) − (D 0 − T ) − (B 0 − T )]
= A0 + D0 B0 − B 0 = 15.06 + 15.49 − 14.28 = 16.27.
A confirmation test should be run to verify that the signal-to-noise ratio predicted
is achieved.
The estimated average fatigue life at the optimal setting A0 B0 C0 D0 is ŷ =
10η̂/20 × 1000 = 6509 cycles. This life estimate is the average of life data over
the noise levels and unit-to-unit variability.
In this section we describe the development of the robust reliability design method
for diagnostic systems whose functionality is different from that in common
hardware systems in that the signal and response of the systems are binary. In
particular, in this section we define and measure the reliability and robustness of
the systems. The noise effects are evaluated and the noise factors are prioritized.
ROBUST RELIABILITY DESIGN OF DIAGNOSTIC SYSTEMS 173
15.5
A
15.0
D
C
14.5
∧
h
B
14.0
13.5
0 1
Level
(a)
16
D0
15
14
∧
h
D1
13
12
B0 B1
(b)
16
D0
15
D1
14
∧
h
13
12
C0 C1
(c)
FIGURE 5.24 (a) Main effect plots for A, B, C, and D; (b) interaction plot for D × B;
(c) interaction plot for D × C
The steps for robust reliability design are described in detail. An automotive
example is given to illustrate the method.
174 RELIABILITY IMPROVEMENT THROUGH ROBUST DESIGN
Robustness can be measured by α and β. These two types of errors are cor-
related negatively; that is, α increases as β decreases, and vice versa. Therefore,
it is frequently difficult to judge the performance of a diagnostic system using
α and β only. Reliability is a more reasonable and comprehensive metric to
measure performance.
G. Yang and Zaghati (2003) employ the total probability law and give the
reliability of a diagnostic system as
where R(t) is the reliability of the diagnostic system and M(t) is the failure
probability of the prime system. Equation (5.47) indicates that:
ž If the prime system is 100% reliable [i.e., M(t) = 0], the reliability of the
diagnostic system becomes 1 − α. This implies that the unreliability is due
to false detection only.
ž If the prime system fails [i.e., M(t) = 1], the reliability of the diagnostic
system becomes 1 − β. This implies that the unreliability is due only to the
inability of the system to detect failures.
ž If α = β, the reliability becomes 1 − α or 1 − β. This implies that M(t)
has no influence on the reliability.
ž The interval of R(t) is 1 − β ≤ R(t) ≤ 1 − α if β > α (which holds in most
applications).
1 × × × −(1 + β − α) 1
2 × × −1 2
3 × × −(1 + β − α − M) 3
4 × × −(M + β − α) 5
5 × −(1 − M) 4
6 × −M 6
7 × −(β − α) 7
a
×, affected.
Estimates of reliability are used to compute the signal-to-noise ratio using (5.36).
Table 5.19 summarizes the estimates of reliability and signal-to-noise ratio.
Once the estimates of the signal-to-noise ratio are calculated, ANOVA or
graphical response analysis should be performed to identify the significant factors.
Optimal levels of these factors are chosen to maximize the signal-to-noise ratio.
Finally, the optimality of the setting of design parameters selected should be
verified through a confirmation test.
Test Method A sport utility vehicle installed with an on-board diagnostic mon-
itor with a current setting of design parameters was tested to evaluate the robust-
ness of the monitor. Load and engine speed [revolutions per minute (RPM)] are
the key noise factors disturbing the monitor. The combinations of load and RPM
are grouped into seven noise levels; at each level both the load and RPM vary
over an interval because of the difficulty in controlling the noise factors at fixed
levels. Table 5.20 shows the noise levels. The vehicle was driven at different
combinations of load and RPM. The prime system (component) being monitored
is expected to have 10% failure probability at the end of design life (τ = 10
years). Thus, failures at 10% probability were injected into the component under
monitor during the test trips. The test recorded the number of failures undetected
when failures were injected (M2 = 1), and the number of failures detected when
no failures were injected (M1 = 0).
Test Data At each noise level, the number of injected failures, the number
of injected failures undetected, the number of successful operations, and the
number of failures detected from the successful operations, denoted I1 , I2 , S1 ,
and S2 , respectively, are shown in Table 5.20. The test data are coded to protect
the proprietary information.
Data Analysis The estimates of α and β equal the values of S2 /S1 and I2 /I1 ,
respectively. The reliability of the monitor at 10 years at each noise level is
calculated from (5.47) with the α and β estimates and M(τ ) = 0.1. Then the
signal-to-noise ratio of the monitor is computed from (5.36). Table 5.21 summa-
rizes the estimates of α, β, reliability, and signal-to-noise ratio.
wire bonds, it is desirable to make the wire bonds insensitive to thermal cycling.
Table 5.23 presents the parameters of thermal cycling used in the experiment as
a noise factor. In Table 5.23, Tmax and Tmin are the high and low temperatures of
a thermal cycle, respectively, T is the delta temperature between the high and
low temperatures, and dT /dt is the temperature change rate.
80
70
60
my
50
40
30
20
15
sy
10
5
0 100 200 300 400 500 600 700 800
NTC (cycles)
(b)
80
70
60
my
50
40
30
20
15
sy
10
5
0 100 200 300 400 500 600 700 800
NTC (cycles)
(d)
FIGURE 5.25 (a) µy varying with NTC for wire bonds tested at noise factor level 1;
(b) σy varying with NTC for wire bonds tested at noise factor level 1; (c) µy varying with
NTC for wire bonds tested at noise factor level 2; (d) σy varying with NTC for wire bonds
tested at noise factor level 2
Recent advancements in these topics are described briefly. The materials in this
section are helpful for performing a more efficient robust design.
As pointed out by Nair (1992), the signal-to-noise ratio has some drawbacks,
the major one being that minimizing it does not lead automatically to mini-
mization of the quadratic quality loss (5.1) in situations where the variability
of a nominal-the-best characteristic is affected by all significant control factors.
The effect of this problem can, however, be mitigated by data transformation
through which the variability of the transformed data is made independent of
mean adjustment factors (Robinson et al., 2004). Box (1988) proposes the use of
lambda plots to identify the transformation that yields the independence.
A measure alternative to the signal-to-noise ratio is the location and dispersion
model. For each run in Table 5.9, y i and ln(si2 ), representing the sample mean
and log sample variance over the noise replicates, are used as the measures of
location and dispersion. They are
1 1
l nij l nij
yi = yij k , si2 = (yij k − y i )2 , i = 1, 2, . . . , N,
ki j =1 k=1 ki − 1 j =1 k=1
l (5.49)
where ki = j =1 nij .
184 RELIABILITY IMPROVEMENT THROUGH ROBUST DESIGN
1. Select the levels of the mean adjustment factors to minimize (or maximize)
the location.
2. Choose the levels of the dispersion factors that are not mean adjustment
factors to minimize the dispersion.
C. F. Wu and Hamada (2000) and Nair et al. (2002) discuss in greater detail
use of the location and dispersion model to achieve robustness.
The desirability for each response depends on the type of response. For a
nominal-the-best response, the individual desirability is
yi − Li wL
, Li ≤ yi ≤ mi ,
mi − Li
di = yi − Hi wH (5.51)
, mi < yi ≤ Hi ,
m − H
i i
0, yi < Li or yi > Hi ,
where mi , Li , and Hi are the target and minimum and maximum allowable values
of y, respectively, and wL and wH are positive constants. These two constants
are equal if the value of a response smaller than the target is as undesirable as a
value greater than the target.
For a smaller-the-better response, the individual desirability is
1, yi ≤ Li ,
y − H w
i i
di = , Li < yi ≤ Hi , (5.52)
Li − Hi
0, yi > Hi ,
where w is a positive constant and Li is a small enough number.
For a larger-the-better response, the individual desirability is
0, yi ≤ Li ,
y − L w
i i
di = , Li < yi ≤ Hi , (5.53)
Hi − Li
1, yi > Hi ,
where w is a positive constant and Hi is a large enough number.
The desirability of each response depends on the value of the exponent. The
choice of the value is arbitrary and thus subjective. In many situations it is difficult
to specify meaningful minimum and maximum allowable values for a smaller-
the-better or larger-the-better response. Nevertheless, the method has found many
applications in industry (see, e.g., Dabbas et al., 2003; Corzo and Gomez, 2004).
Loss Function Approach Loss function for multiple responses, described
by Pignatiello (1993), Ames et al. (1997), and Vining (1998), is a natural
extension of the quality loss function for a single response. The loss function
for a nominal-the-best response is
L = (Y − my )T K(Y − my ), (5.54)
where Y = (y1 , y2 , . . . , ym ) is the response vector, my = (my1 , my2 , . . . , mym ) is
the target vector, and K is a m × m matrix of which the elements are constants.
The values of the constants are related to the repair and scrap cost of the product
and may be determined based on the functional requirements of the product. In
general, the diagonal elements of K measure the weights of the m responses, and
the off-diagonal elements represent the correlations between these responses.
186 RELIABILITY IMPROVEMENT THROUGH ROBUST DESIGN
Like the single-response case, the loss function (5.54) can be extended to mea-
sure the loss for a smaller-the-better or larger-the-better response (Tsui, 1999).
For a smaller-the-better response, we replace the fixed target myi with zero. For a
larger-the-better response, the reciprocal of the response is substituted into (5.54)
and treated as the smaller-the-better response.
If Y has a multivariate normal distribution with mean vector µ and vari-
ance–covariance matrix , the expected loss can be written as
E(L) = (µ − my )T K(µ − my ) + trace(K), (5.55)
where µ and are the functions of control factors and noise factors and can be
estimated from experimental data by multivariate analysis methods. The methods
are described in, for example, Johnson (1998).
The simplest approach to obtaining the optimal setting of control factors is to
directly minimize the expected loss (5.55). The direct optimization approach is
used by, for example, Romano et al. (2004). Because the approach may require
excessive time to find the optimal setting when the number of control factors is
large, some indirect but more efficient optimization procedures have been pro-
posed. The most common one is the two-step approach, which finds its root in
Taguchi’s two-step optimization approach for a single response. The approach
first minimizes an appropriate variability measure and then brings the mean
response on its target. Pignatiello (1993) and Tsui (1999) describe this two-step
approach in detail.
where y is the response, xi the experimental variable, e the residual error, βi the
coefficient representing the linear effect of xi to be estimated from experimental
data, and n the number of experimental variables. Once the relationship is built, a
search must be conducted over the experimental region to determine if a curvature
of the response is present. If this is the case, a second-order experiment should
be conducted to build and estimate the second-order relationship given by
n
n
n
y = β0 + βi xi + βij xi xj + βii xi2 +e, (5.57)
i=1 i<j i=1
ADVANCED TOPICS IN ROBUST DESIGN 187
where βi is the coefficient representing the linear effect of xi , βij the coefficient
representing the linear-by-linear interaction between xi and xj , and βii the coeffi-
cient representing the quadratic effect of xi . The optimum region for experimental
variables is solved by differentiating y in (5.57) with respect to xi and setting
it zero. C. F. Wu and Hamada (2000) and Myers and Montgomery (2002), for
example, describe in detail the design and analysis of RSM experiments.
The principle of RSM can be applied to improve the optimality of the design
setting obtained from ANOVA or graphical response analysis (K. Yang and Yang,
1998). In the context of robust design presented in this chapter, the response y
in (5.57) is the signal-to-noise ratio. If there exists an interaction or quadratic
effect, the relationship between signal-to-noise ratio and control factors may be
modeled by (5.57), where y is replaced with η. The model contains 1 + 2n +
n(n − 1)/2 parameters. To estimate the parameters, the experimental run must
have the size of at least 1 + 2n + n(n − 1)/2, and each factor must involve at
least three levels. The use of some orthogonal arrays, such as L9 (34 ) and L27 (313 ),
satisfies the requirements; thus, it is possible to continue response surface analysis
for the experimental design. The optimal setting may be obtained through the use
of standard methods for response surface analysis as described in C. F. Wu and
Hamada (2000) and Myers and Montgomery (2002).
RSM assumes that all variables are continuous and derivatives exist. In prac-
tice, however, some design parameters may be discrete variables such as the
type of materials. In these situations, (5.57) is still valid. But it cannot be used to
determine the optima because the derivative with respect to a discrete variable is
not defined. To continue the response surface analysis, we suppose that there are
n1 discrete variables and n2 continuous variables, where n1 + n2 = n. Because
the optimal levels of the n1 discrete factors have been determined in previous
analysis by using the graphical response method or ANOVA, the response sur-
face analysis is performed for the n2 continuous variables. Since the levels of n1
variables have been selected, only the parameter settings that contain combina-
tions of the selected levels of the n1 variables can be used for response surface
analysis. In general, the number of such settings is
n1
1
ws = N , (5.58)
q
i=1 i
where N is the run size of an inner array and qi is the number of levels of discrete
variable xi . Refer to Table 5.3, for example. L9 (34 ) is used to accommodate four
design parameters, one of which is assumed to be discrete and assigned to column
1. Suppose that ANOVA has concluded that level 1 is the optimal level for this
variable. Then ws = 9/3 = 3, because only the responses from runs 4, 5, and 6
can apply to our response surface analysis.
Excluding discrete variables from modeling, (5.57) includes only n2 con-
tinuous variables and has 1 + 2n2 + n2 (n2 − 1)/2 parameters to be estimated.
Therefore, we require that ws ≥ 1 + 2n2 + n2 (n2 − 1)/2. This requirement is fre-
quently unachievable when n1 ≥ 2. Thus, in most situations the response surface
analysis is applicable when only one discrete design parameter is involved.
188 RELIABILITY IMPROVEMENT THROUGH ROBUST DESIGN
where x is the control factor, z the noise factor, and β0 , β1 , β2 , and β12 are the
coefficients to be estimated from experimental data.
The acceleration factor between the life at noise level z1 and that at noise
level z2 is
exp(µ1 )
Af = = exp(µ1 − µ2 ), (5.60)
exp(µ2 )
where Af is the acceleration factor and µi is the location parameter at noise
level i. Chapter 7 presents in detail definition, explanation, and computation of
the acceleration factor. For a given level of control factor, the acceleration factor
between noise levels z1 and z2 is obtained by substituting (5.59) into (5.60). Then
we have
Af = exp[(z1 − z2 )(β2 + β12 x)], (5.61)
which indicates that the acceleration factor is a function of the control factor
level. This is generally true when there are interactions between control factors
and accelerating noise factors.
Accelerated test data may lead to a falsely optimal setting of design parameters
if an acceleration factor depends on the level of control factor. This is illustrated
by the following arguments. For the sake of convenience, we still assume that
robust design involves one design parameter and one noise factor. The control
factor has two levels: x0 and x1 . The noise factor also has two levels: z1 and
z2 , where z1 is within the normal use range and z2 is an elevated level. Let yij
(i = 0, 1; j = 1, 2) denote the life at xi and zj . Also, let Af 0 and Af 1 denote
ADVANCED TOPICS IN ROBUST DESIGN 189
the acceleration factors at x0 and x1 , respectively. Then from (5.35), the signal-
to-noise ratio η̂0 at x0 is
1 1 1 1
η̂0 = −10 log 2
+ 2
= 10 log(2) − 10 log 2
(1 + A2
f0 ) .
2 y01 y02 y01
Similarly, the signal-to-noise ratio η̂1 at x1 is
1
η̂1 = 10 log(2) − 10 log 2 (1 + Af 1 ) .
2
y11
If we have
1 1
2
(1 + A2f 0 ) < 2 (1 + A2f 1 ), (5.62)
y01 y11
then η̂0 > η̂1 . This indicates that x0 is the optimal level for design. Note that this
conclusion is drawn from the accelerated test data.
Now suppose that the experiment is conducted at noise factor levels z0 and z1 ,
where both z0 and z1 are within the normal use range. The noise level z1 remains
the same and z0 = 2z1 − z2 . Then, from (5.61), we know that Af 0 equals the
acceleration factor at x0 between the life at z0 and the life at z1 , and Af 1 equals
the acceleration factor at x1 between the life at z0 and the life at z1 . Let yij
(i = 0, 1; j = 0, 1) be the life at xi and zj . The signal-to-noise ratio η̂0 at x0 is
1 1 + Af 0
2
1 1 1
η̂0 = −10 log 2
+ 2 = 10 log(2) − 10 log 2
.
2 y00 y01 y01 A2f 0
1 1 + Af 0 1 1 + Af 1
2 2
2 2
< 2 ; (5.63)
y01 Af 0 y11 A2f 1
that is, η̂0 > η̂1 . Therefore, x0 is the optimal level. This agrees with the conclu-
sion drawn from the accelerated test data. However, if (5.62) holds and Af 0 <
Af 1 , (5.63) may not be valid. That is, x0 may not be the optimal level. In such
case, the conclusion derived from the accelerated test data is faulty.
Let’s consider a simple example that illustrates the accelerated test data leading
to an erroneous conclusion. Suppose that the life y is 50 at (x0 = 0, z1 = 1), 25
at (x0 = 0, z2 = 2), 60 at (x1 = 1, z1 = 1), and 15 at (x1 = 1, z2 = 2). The value
of the acceleration factor is 2 at x0 and 4 at x1 calculated from these life data.
The value of the signal-to-noise ratio is 30 at x0 and 26.3 at x1 . It would be
concluded that x0 is the optimal level based on the accelerated test data. Now
190 RELIABILITY IMPROVEMENT THROUGH ROBUST DESIGN
suppose that the experiment is conducted at the noise factor levels within the
normal use range: z0 and z1 . The life y is 100 at (x0 = 0, z0 = 0) and 240 at
(x1 = 1, z0 = 0); both are derived by applying the acceleration factors. The value
of the signal-to-noise ratio is 36.0 at x0 and 38.3 at x1 . Then we would conclude
that x1 is the optimal level, which contradicts the conclusion drawn from the
accelerated test data.
In robust design, accelerating noise factors should be those that are indepen-
dent of control factors, to avoid possible faulty conclusions. The high levels of
accelerating noise factors should be as high as possible to maximize the number
of failures or the amount of degradation, but they should not induce failure modes
that are different from those in the normal use range. The low levels of the accel-
erating noise factors should be as low as possible to maximize the range of noise
levels, but they should generate sufficient failures or degradation information.
Unfortunately, such independent noise factors, if any, may be difficult to identify
before experiments are conducted. In these situations, experience, engineering
judgment, or similar data may be used. A preliminary accelerated test may also
be performed to determine the independence.
In addition to noise factors, some control factors may also serve as accelerating
variables, as described in Chapter 7 and by Joseph and Wu (2004). Such control
factors should have large effects on failure or degradation and the effects are
known based on physical knowledge of the product. In traditional experiments,
these factors are of no direct interest to designers; they are not involved in the
design of experiment and their levels are kept at normal values in experiment. In
contrast, in accelerated robust testing, these factors are elevated to higher levels
to increase the number of failures or the amount of degradation. The analysis
of accelerated test data yields the optimal setting of design parameters. The
accelerating control factors are set at normal levels in actual product design. For
the conclusion to be valid at normal levels of the accelerating control factors,
these factors should not interact with other control factors. This requirement
restricts use of the method because of the difficulty in identifying independent
accelerating control factors.
In summary, accelerated testing is needed in many experiments. However,
accelerating variables should not interact with control factors. Otherwise, the
optimality of design setting may be faulty.
PROBLEMS
5.1 Develop a boundary diagram, P-diagram, and strategy of noise effect man-
agement for a product of your choice. Explain the cause-and-effect relations
between the noise factors and failure modes.
5.2 A robust design is to be conducted for improving the reliability and robust-
ness of an on–off solenoid installed in automobiles. The solenoid is tested to
failure, and life (in on–off cycles) is the experimental response. The design
parameters selected for experiment are as follows:
ž A: spring force (gf); A0 = 12; A1 = 18; A2 = 21
PROBLEMS 191
5.6 The fatigue life of an automotive timing belt and its variability were improved
through robust reliability design. The design parameters are belt width (factor
A), belt tension (B), belt coating (C), and tension damping (D). Each design
parameter has two levels. The interaction between B and D is of interest. The
noise factor is temperature, which was set at three levels: 60, 100, and 140◦ C.
The experimental response is cycles to failure measured by a life index,
which is a larger-the-better characteristic. The experiment was censored at
the number of cycles translated to 2350 of the life index. The design life
was equivalent to 1500 of the life index. The experimental layout and life
index are given in Table 5.27. Suppose that the life index is Weibull and
its relation with temperature can be modeled with the Arrhenius relationship
(Chapter 7).
TABLE 5.28 Experimental Layout and Response Data for the Oxygen Sensor
Inner Array
A B C D Outer Array
Run 1 2 3 4 Noise Level 1 Noise Level 2
1 0 0 0 0 11 15 22 16
2 0 1 1 1 8 12 16 14
3 0 2 2 2 15 19 26 23
4 1 0 1 2 17 23 28 35
5 1 1 2 0 7 11 18 22
6 1 2 0 1 8 5 15 18
7 2 0 2 1 23 19 20 26
8 2 1 0 2 16 20 24 26
9 2 2 1 0 8 13 17 15
(a) Estimate the reliability at the design life for each combination of design
setting and temperature.
(b) Perform ANOVA to identify the significant factors. Determine optimal
levels of the design parameters.
(c) Predict the signal-to-noise ratio at the optimal setting. Calculate the aver-
age reliability over the three temperatures.
5.7 To improve the reliability of an oxygen sensor, four design parameters (A to
D) were chosen and assigned to L9 (34 ) as shown in Table 5.28. The sensors
were tested at two levels of compounding noise (temperature and humidity)
and the response voltages at a given oxygen level were recorded. The sensors
are said to have failed if the response voltage drifts more than 30% from the
specified value. The drift percentages at termination of the test are given in
Table 5.28.
(a) Perform graphical response analysis to identify the significant factors.
(b) Carry out ANOVA to determine the significant factors. What are optimal
levels of the design parameters?
(c) Predict the signal-to-noise ratio at the optimal setting.
(d) The current design setting is A0 B0 C0 D0 . How much robustness improve-
ment will the optimal setting achieve?
(e) Calculate the average drift percentage over the noise levels.
(f) Are the interactions between the design parameters and the compounding
noise significant?
6
POTENTIAL FAILURE MODE
AVOIDANCE
6.1 INTRODUCTION
194
FAILURE MODE AND EFFECTS ANALYSIS 195
techniques that help achieve this objective include failure mode and effects
analysis (FMEA) and fault tree analysis (FTA). Computer-aided design analysis
is another error detection technique, which is being widely implemented nowa-
days thanks to the advance in computer technology and software engineering.
This technique includes mechanical stress analysis, thermal analysis, vibration
analysis, and other methods. In this chapter we describe these three types of
techniques.
System FMEA System FMEA is sometimes called concept FMEA because the
analysis is carried out in the concept development stage. This is the highest-level
FMEA that can be performed and is used to analyze and prevent failures related
to technology and system configuration. This type of FMEA should be carried
out as soon as a system design (i.e., the first stage of robust design) is completed,
to validate that the system design minimizes the risk of functional failure dur-
ing operation. Performed properly, system FMEA is most efficient economically
because any changes in the concept design stage would incur considerably less
cost than in subsequent stages.
System FMEA has numerous benefits. It helps identify potential systemic fail-
ure modes caused by the deficiency of system configurations and interactions with
other systems or subsystems. This type of FMEA also aids in (1) examining sys-
tem specifications that may induce subsequent design deficiencies, (2) selecting
the optimal system design alternative, (3) determining if a hardware system
redundancy is required, and (4) specifying system-level test requirements. It acts
as a basis for the development of system-level diagnosis techniques and fault
management strategy. More important, system FMEA enables actions to ensure
customer satisfaction to be taken as early as in the concept design stage, and is
an important input to the design FMEA to follow.
Design FMEA Design FMEA is an analytical tool that is used to (1) identify
potential failure modes and mechanisms, (2) assess the risk of failures, and
(3) provide corrective actions before the design is released to production. To
achieve the greatest value, design FMEA should start before a failure mode is
unknowingly designed into a product. In this sense, FMEA serves as an error pre-
vention tool. In reality, however, design FMEA is frequently performed as soon
as the first version of the design is available, and remedial actions are developed
based on the analysis to eliminate or alleviate the failure modes identified. The
bottom line is that a design should avoid critical failure modes.
The major benefit of design FMEA is in reducing the risk of failure. This is
achieved by identifying and addressing the potential failure modes that may have
adverse effects on environment, safety, or compliance with government regula-
tions in the early design stage. Design FMEA also aids in objective evaluation
of design in terms of functional requirements, design alternatives, manufactura-
bility, serviceability, and environmental friendliness. It enables actions to ensure
that customer satisfaction is initiated as early in the design stage as possible. In
addition, this type of FMEA helps the development of design verification plans,
FAILURE MODE AND EFFECTS ANALYSIS 197
production control plans, and field service strategy. The outputs from design
FMEA are the inputs to process FMEA.
Define system.
Select a component.
Calculate RPN.
Recommend actions.
substituted with a different one that reflects the distinctions of a product and the
needs of an organization.
understanding of the target and neighboring systems and their interactions. In this
step it is often required to perform an interaction analysis and create a boundary
diagram using the method presented in Chapter 5. Interaction analysis also pro-
vides assistance in subsequent FMEA steps in (1) understanding the effects of
failure modes on other systems and end customers, (2) evaluating the severity of
the effects, and (3) discovering the failure mechanisms that may have originated
from other systems.
Once defined, the system should be broken down into subsystems, modules,
or components, depending on the objective of the FMEA study. Analysis at a
high level (subsystem or module level) is usually intended to determine the area
of high priority for further study. Analysis at the component level is technically
more desirable and valuable, in that it usually leads to a determination of the
causes of failure. As such, FMEA is performed mostly at the component level
in practice.
potential effects are identified by asking the question: What will be the conse-
quences if this failure happens? The consequences are evaluated with respect
to the function of the item being analyzed. Because there exists a hierarchical
relationship between the component, module, subsystem, and system levels, the
item failure under consideration may affect the system adversely at several lev-
els. The lowest-level effects are the local effects, which are the consequences
of the failure on the local operation or function of the item being analyzed. The
second-level effects are the next-level effects, which are the impacts of the failure
on the next-higher-level operation or function. The failure effect at one level of
system hierarchy is the item failure mode of the next-higher level of which the
item is a component. The highest-level effects are the end effects, which are the
effects of the failure on the system functions and can be noticed by the end cus-
tomers. For instance, the end effects of a failure occurring to an automobile can
be noise, unpleasant odor, thermal event, erratic operation, intermittent operation,
inoperativeness, roughness, instability, leak, or others.
The end effect of a failure is assessed in terms of severity. In FMEA, severity
is a relative ranking within the scope of a particular study. The military industry
commonly employs a four-level classification system ranking from “catastrophic”
to “minor,” as shown in Table 6.1. In this ranking system, the lowest-ranking
index measures the highest severity, and vice versa. The automotive industry
generally uses a 10-level ranking system, as shown in Table 6.2; the effect of a
failure mode is described as the effect on customers or conformability to gov-
ernment regulations. A failure mode may have multiple effects, each of which
has its own severity rating. Only the most serious effect rating is entered into the
severity column of the FMEA worksheet for calculating RPN.
I Catastrophic: a failure that can cause death or system (e.g., aircraft, tank,
missile, ship) loss
II Critical: a failure that can cause severe injury, major property damage, or
minor system damage which will result in mission loss
III Marginal: a failure that may cause minor injury, minor property damage, or
minor system damage which will result in delay or loss of availability or
mission degradation
IV Minor: a failure not serious enough to cause injury, property damage, or system
damage, but which will result in unscheduled maintenance or repair
Source: U.S. DoD (1984).
202 POTENTIAL FAILURE MODE AVOIDANCE
Hazardous Very high severity ranking when a potential failure mode affects 10
without safe vehicle operation and/or involves noncompliance with
warning government regulation without warning
Hazardous Very high severity ranking when a potential failure mode affects 9
with safe vehicle operation and/or involves noncompliance with
warning government regulation with warning
Very high Vehicle/item inoperable (loss of primary function) 8
High Vehicle/item operable but at a reduced level of performance; 7
customer very dissatisfied
Moderate Vehicle/item operable but comfort/convenience item(s) inoperable; 6
customer dissatisfied
Low Vehicle/item operable but comfort/convenience item(s) operable at 5
a reduced level of performance; customer somewhat dissatisfied
Very low Fit and finish/squeak and rattle item does not conform; defect 4
noticed by most customers (greater than 75%)
Minor Fit and finish/squeak and rattle item does not conform; defect 3
noticed by 50% of customers
Very minor Fit and finish/squeak and rattle item does not conform; defect 2
noticed by discriminating customers (less than 25%)
None No discernible effect 1
Source: SAE (2002).
hunting design mistakes. It is this trait that enables FMEA to be a technique for
detecting and correcting design errors.
Failure mechanisms are discovered by asking and addressing a number of
“what” and “why” questions such as: What could cause the item to fail in
this manner? Why could the item lose its function under the operating condi-
tions? Techniques such as fault tree analysis and cause-and-effect diagrams are
instrumental in determining the failure mechanisms of a particular failure mode.
Examples of failure mechanisms include incorrect choice of components, misuse
of materials, improper installation, overstressing, fatigue, corrosion, and others.
Identified failure mechanisms are assigned with respective occurrence values.
Here, the occurrence is defined as the likelihood that a specific failure mechanism
will occur during the design life. Occurrence is not the value of probability in
the absolute sense; rather, it is a relative ranking within the scope of the FMEA.
The automotive industry uses a 10-level ranking system, as shown in Table 6.3.
The highest-ranking number indicates the most probable occurrence, whereas the
lowest one is for the least probable occurrence. The probable failure rate of a
failure mechanism can be estimated with the assistance of historical data, such
as the previous accelerated test data or warranty data. When historical data are
used, the impacts of changes in design and operating condition on the failure rate
during the design life should be taken into account.
FAILURE MODE AND EFFECTS ANALYSIS 203
RPN = S × O × D, (6.1)
204 POTENTIAL FAILURE MODE AVOIDANCE
Example 6.1 To reduce the amount of nitrogen oxide emission and increase
fuel efficiency, automobiles include an exhaust gas recirculation (EGR) system
that recycles a fraction of exhaust gases to the inlet manifold, where the exhaust
gas is mixed with fresh air. A typical EGR system consists of several subsys-
tems, including the EGR valve, delta pressure feedback EGR (DFPE) sensor,
EGR vacuum regulator (EVR), powertrain control module (PCM), and tubes, as
shown in Figure 6.3. The exhaust gas is directed from the exhaust pipeline to the
EGR valve via the EGR tube. The EGR valve regulates EGR flow to the intake
manifold. The desired amount of EGR is determined by the EGR control strategy
and calculated by the PCM. The PCM sends a control signal to the EVR, which
regulates the vacuum directed to the EGR valve. Energized by the vacuum, the
EGR valve allows the right amount of EGR to flow into the intake manifold. The
EGR system is a closed-loop system in which a DPFE sensor is used to measure
the EGR flow and provide a feedback to the PCM. Then the PCM redoes the cal-
culation and adjusts the vacuum until the desired level of EGR is achieved. The
EGR control process indicates that the subsystems are connected in logic series,
and failure of any subsystem causes the entire system to fail. The design FMEA
is performed at subsystem level by following SAE J1739. Figure 6.4 shows a
part of the FMEA as an illustration; it is not intended to exhaust all elements of
the analysis.
PCM
Fresh air
EVR
EGR valve
Intake manifold
DPFE sensor
EGR tube
Exhaust gas
fuel analysis
efficiency Review integrated
(5) historical to design
data and process
failure
cases
EVR fails to 5 Design 3 135
deliver vacuum review
CAE circuit
analysis
and
accelerated
testing of
EVR
assembly
25
20
15
Count
10
0
0 100 200 300 400 500 600 700 800 900 1000
RPN
[1,200] 67 55.8
(200,400] 26 21.7
(400,600] 17 14.2
(600,800] 7 5.8
(800,1000] 3 2.5
4 7 9
4 9 7
6 6 7
6 7 6
7 4 9
7 6 6
7 9 4
9 4 7
9 7 4
modes. From the practical point of view, however, the failure mode resulting in
the combination (S = 9, O = 7, D = 4) may be more serious than that yielding
(S = 6, O = 6, D = 7) even though both form the same RPN value of 252.
Therefore, it is suggested that the product of S and O be used to further assess
the risk when RPN values tie or are close.
Misspecifying the value of one of the three factors S, O, D has a large effect
on the RPN value, especially when the other two factors have high ratings. Let
S0 , O0 , and D0 be the true values of S, O, and D, respectively. The corresponding
RPN value is RPN0 . Without loss of generality, assume that D is overestimated
by 1 and that S and O take the true values. Then the RPN value is
RPN = S0 × O0 × (D0 + 1).
The increase in RPN value due to the overestimation is
RPN − RPN0 = S0 × O0 .
It is seen that the increase (decrease) in RPN value due to overestimating
(underestimating) the rating of one factor by 1 equals the product of the ratings
of the other two factors. Such change is especially significant when the other two
factors have large ratings. An example follows. If S0 = 9, O0 = 9, and D0 = 5,
RPN0 = 405. The D is overestimated by 1, and the RPN value increases to 486.
210 POTENTIAL FAILURE MODE AVOIDANCE
In summary, the RPN technique for prioritizing failure modes has some defi-
ciencies. Various attempts have been made to modify the prioritization technique
by using fuzzy theory (see, e.g., Bowles and Pelaez, 1995; Franceschini and
Galetto, 2001; Pillay and Wang, 2003). The methods proposed result in more
objective and robust prioritization. But the complicated mathematical operations
required by these methods restrict their application in industry.
needed system protection. Analysis at this level requires the existence of software
design and an expression of that design, at least in pseudocode. To perform the
analysis, we determine the failure modes for each variable and algorithm imple-
mented in each software element. The possible types of failure modes depend
on the type of variables. Table 6.7 lists typical failure modes of software vari-
ables (Goddard, 1993). The effects of a failure are traced through the code to the
system outcomes. If a failure produces a system hazard, the system architecture,
algorithms, and codes should be reviewed to assess if safety requirements have
been fully implemented. If missing requirements are discovered, design changes
must be recommended.
Unlike hardware FMEA, research on and application of software FMEA are
very limited. Although software FMEA is recommended for evaluating critical
systems in some standards, such as IEC 61508 (IEC, 1998, 2000) and SAE
ARP 5580 (SAE, 2000), there are no industrial standards or generally accepted
processes for performing software FMEA. The current status remains at the
homemade stage; the processes, techniques, and formats of software FMEA vary
from user to user.
1. Define the system, assumptions, and failure criteria. The interactions betw-
een the system and neighbors, including the human interface, should be
FAULT TREE ANALYSIS 213
fully analyzed to take account of all potential failure causes in the FTA.
For this purpose, the boundary diagram described in Chapter 5 is helpful.
2. Understand the hierarchical structure of the system and functional relation-
ships between subsystems and components. A block diagram representing
the system function may be instrumental for this purpose.
3. Identify and prioritize the top-level fault events of the system. When FTA
is performed in conjunction with FMEA, the top events should include
failure modes that have high severity values. A separate fault tree is needed
for a selected top event.
4. Construct a fault tree for the selected top event using the symbols and
logic described in Section 6.4.2. Identify all possible causes leading to
the occurrence of the top event. These causes can be considered as the
intermediate effects.
5. List all possible causes that can result in the intermediate effects and
expand the fault tree accordingly. Continue the identification of all possi-
ble causes at a lower level until all possible root causes are determined.
6. Once the fault tree is completed, analyze it to understand the cause-and-
effect logic and interrelationships among the fault paths.
7. Identify all single failures and prioritize cut sets (Section 6.4.3) by the
likelihood of occurrence.
8. If quantitative information is needed, calculate the probability of the top
event to occur.
9. Determine whether corrective actions are required. If necessary, develop
measures to eradicate fault paths or to minimize the probability of fault
occurrence.
10. Document the analysis and then follow up to ensure that the corrective
actions proposed have been implemented. Update the analysis whenever
a design change takes place.
ž AND gate. An output event is produced if all the input events occur simul-
taneously.
FAULT TREE ANALYSIS 215
k/n
VOTING gate Output event occurs if at least k of n
input events occur.
ž OR gate. An output event is produced if any one or more of the input events
occurs.
ž INHIBIT gate. Input produces output only when a certain condition is sat-
isfied. It is used in a pair with the conditional event symbol. An INHIBIT
gate is a special type of AND gate.
ž EXCLUSIVE OR gate. Input events cause an output event if only one of
the input events occurs. The output event will not occur if more than one
input event occurs. This gate can be replaced with the combination of AND
gates and OR gates.
ž VOTING gate. Input events produce an output event if at least k of n input
events occur.
Example 6.2 Refer to Example 6.1. Develop a fault tree for the top event that
no EGR flows into the intake manifold.
SOLUTION The fault tree for the top event is shown in Figure 6.6, where the
gates are alphabetized and the basic events are numbered for the convenience
216 POTENTIAL FAILURE MODE AVOIDANCE
B C
EGR tube icing
Contami- EVR fails to EGR valve
nation deliver vacuum diaphragm
buildup breaks
X1 D X2
Water Low
vaporization temperature
X3 X4
FIGURE 6.6 Fault tree for the top event of no EGR flow to the intake manifold
of future reference. As shown in Figure 6.6, the intermediate event “EGR valve
stuck closed” is caused either by the “EGR valve diaphragm breaks” or by the
“EVR fails to deliver vacuum.” It is worth noting that these two causes have been
identified in the FMEA of the EGR system, as given in Figure 6.4. In general,
FMEA results can facilitate the development of fault trees.
SOLUTION The false MIL is the top event. The fault tree for this top event is
shown in Figure 6.7. An INHIBIT gate is used here to describe the logic that a
false MIL occurs only on vehicles without an EGR component failure, where the
logic agrees with the definition of type I error. In the fault tree, only the event
“MIL criteria falsely satisfied” is fully expanded to the lowest level, because the
causes of this event are of most interest. The fault tree indicates that the false
MIL is caused by software algorithm problems coupled with sensor error.
FAULT TREE ANALYSIS 217
False MIL
No relevant failure
occurs
MIL on
The minimal cut sets of a fault tree can provide insightful information about
the potential weak points of a complex system, even when it is not possible
to compute the probability of either the cut sets or the top event. The failure
probabilities of different basic components in the same system are usually in
the same order of magnitude. Thus, the failure probability of a minimal cut set
decreases in the order of magnitude as the size of the minimal cut set increases.
With this observation, we can analyze the importance of the minimal cut sets by
prioritizing them according to their size. Loosely, the smaller the size, the more
important the minimal cut set. A single-event minimal cut set always has the
highest importance because the single-point failure will result in the occurrence
of the top event. The importance is followed by that of double-event cut sets, then
triple-event cut sets, and so on. The prioritization of minimal cut sets directs the
area of design improvement and provides clues to developing corrective actions.
Another application of minimal cut sets is common cause analysis. A common
cause is a condition or event that causes multiple basic events to occur. For
example, fire is a common cause of equipment failures in a plant. In a qualitative
analysis, all potential common causes are listed, and the susceptibility of each
basic event is assessed to each common cause. The number of vulnerable basic
events in a minimal cut set determines the relative importance of the cut set.
If a minimal cut set contains two or more basic events that are susceptible to
the same common cause failure, these basic events are treated as one event,
and the size of the minimal cut set should be reduced accordingly. Then the
importance of minimal cut sets should be reevaluated according to the reduced
size of the cut sets. Furthermore, the analysis should result in recommended
actions that minimize the occurrence of common causes and protect basic events
from common cause failures.
Top event
Basic Basic
event 1 event 2 2
(a) (b)
FIGURE 6.8 (a) AND gate fault tree; (b) parallel reliability block diagram equivalent
to part (a)
Top event
Basic Basic
event 1 event 2
1 2
(a) (b)
FIGURE 6.9 (a) OR gate fault tree; (b) series reliability block diagram equivalent to
part (a)
occurs when one of the basic events occurs. For example, Figure 6.9a shows an
OR gate fault tree containing two basic events, and Figure 6.9b illustrates the
corresponding series reliability block diagram. Suppose that the failure probabil-
ities of basic events 1 and 2 are p1 = 0.05 and p2 = 0.1, respectively. Then the
reliability of the top event is
R = (1 − p1 )(1 − p2 ) = (1 − 0.05)(1 − 0.1) = 0.855.
The conversion of a fault tree to a reliability block diagram usually starts
from the bottom of the tree. The basic events under the same gate at the lowest
level of the tree form a block depending on the type of the gate. The block is
treated as a single event under the next high-level gate. The block, along with
other basic events, generates an expanded block. This expanded block is again
considered as a single event, and conversion continues until an intermediate event
under a gate is seen. Then the intermediate event is converted to a block by the
same process. The block and existing blocks, as well as the basic events, are put
together according to the type of the gate. The process is repeated until the top
gate is converted.
220 POTENTIAL FAILURE MODE AVOIDANCE
Example 6.4 Convert the fault tree in Figure 6.6 to a reliability block diagram.
Suppose that the EVR never fails to deliver vacuum and that the probabilities
of basic events X1 to X4 are p1 = 0.02, p2 = 0.05, p3 = 0.01, and p4 = 0.1,
respectively. Calculate the probability that no EGR flows into the intake manifold.
SOLUTION Conversion of the fault tree starts from gate D, which is an AND
gate. Basic events X3 and X4 form a parallel block. This block, thought of as
a single component, connects with basic event X1 in series because the next
high-level gate (B) is an OR gate. When the conversion moves up to gate A, an
intermediate event “EGR valve stuck closed” is encountered. The intermediate
event requires a separate conversion. In this particular case, the EVR is assumed
to be 100% reliable and basic event X2 is fully responsible for the intermediate
event. Because gate A is an AND gate, X2 connects in series with the block
converted from gate B. The complete block diagram is shown in Figure 6.10.
The probability that no EGR flows into the intake manifold is
1 2
FIGURE 6.10 Reliability block diagram equivalent to the fault tree in Figure 6.6
FAULT TREE ANALYSIS 221
ž Law of absorption: X · (X + Y ) = X; X + X · Y = X.
ž Distributive law: X · (Y + Z) = X · Y + X · Z; X + Y · Z = (X + Y )
· (X + Z).
Given a fault tree, the corresponding Boolean expression can be worked out
through a top-down or bottom-up process. In the top-down process, we begin at
the top event and work our way down through the levels of the tree, translating
the gates into Boolean equations. The bottom-up process starts at the bottom
level and proceeds upward to the top event, replacing the gates with Boolean
expressions. By either process, the equations for all gates are combined and
reduced to a single equation. The equation is further simplified using Boolean
algebra rules and thus is written as a union of minimal cut sets. Here we illustrate
the bottom-up process with the following example.
Example 6.5 Determine the minimal cut sets of the fault tree in Figure 6.11
using Boolean algebra.
E5 = X4 · X5 . (6.2)
E1 E2
E4 E3
X1 X1
E5
X2 X3 X3
X4 X5
E3 = X3 + E5 , (6.3)
E4 = X2 · X3 . (6.4)
E3 = X3 + X4 · X5 . (6.5)
E1 = X1 + E4 , (6.6)
E2 = X1 + E3 . (6.7)
E1 = X1 + X2 · X3 , (6.8)
E2 = X1 + X3 + X4 · X5 . (6.9)
We proceed to the top event of the tree and translate the OR gate into
T = E1 + E2 . (6.10)
T = X1 + X2 · X3 + X1 + X3 + X4 · X5 . (6.11)
Equation (6.11) is logically equivalent to the fault tree in Figure 6.11. Each
term in (6.11) is a cut set that will lead to occurrence of the top event. However,
the cut sets are not minimal. Reducing the expression by the rules of Boolean
algebra results in
T = X1 + X3 + X4 · X5 . (6.12)
Now X1 , X3 , and X4 · X5 are the minimal cut sets of the fault tree. Equati-
on (6.12) indicates that the top event can be expressed as the union of three
minimal cut sets.
In general, a top event can be written as the union of a finite number of
minimal cut sets. Mathematically, we have
T = C1 + C2 + · · · + Cn , (6.13)
n
n
n
Pr(T ) = Pr(Ci ) − Pr(Ci · Cj ) + Pr(Ci · Cj · Ck )
i=1 i<j =2 i<j <k=3
1. Determine the probabilities of basic events. This step usually requires the
use of different data sources, such as accelerated test data, field and war-
ranty data, historical data, and benchmarking analysis.
2. Compute the probabilities of all minimal cut sets contributing to the top
event. Essentially, this step is to compute the probability of the intersection
of basic events.
3. Calculate the probability of the top event by evaluating (6.15).
The first step is discussed in great detail in other chapters of the book. Now
we focus on the second and third steps.
If a minimal cut set, say C, consists of an intersection of m basic evelts, say
X1 , X2 , . . . , Xm , the probability of the minimal cut set is
Pr(C) = Pr(X1 · X2 · · · Xm ). (6.16)
If the m basic events are independent, (6.16) simplifies to
Pr(C) = Pr(X1 ) · Pr(X2 ) · · · Pr(Xm ), (6.17)
where Pr(Xi ) is the probability of basic event Xi . In many situations, the assump-
tion of independence is valid because the failure of one component usually does
not depend on the failure of other components in the system unless the compo-
nents are subject to a common failure. If they are dependent, other methods, such
as the Markov model, may be used (see, e.g., Henley and Kumamoto, 1992).
224 POTENTIAL FAILURE MODE AVOIDANCE
Before the probability of the top event can be calculated, the probabilities of
the intersections of minimal cut sets in (6.15) should be determined. As a special
case, if the n minimal cut sets are mutually exclusive, (6.15) reduces to
n
Pr(T ) = Pr(Ci ). (6.18)
i=1
If the minimal cut sets are not mutually exclusive but independent, the prob-
ability of the intersection of the minimal cut sets can be written as the product
of the probabilities of individual minimal cut sets. For example, the probability
of the intersection of two minimal cut sets C1 and C2 is
In many situations, the minimal cut sets of a system are dependent because the
sets may contain one or more common basic events. Nevertheless, the probability
of the intersection of minimal cut sets may still be expressed in a simplified
form if the basic events are independent. For example, if X1 , X2 , . . . , Xk are the
independent basic events that appear in minimal cut sets C1 , C2 , or both, the
probability of the intersection of C1 and C2 can be written as
The validity of (6.20) may be illustrated with the following example. Suppose
that C1 = X1 · X2 · · · Xi and C2 = Xi · Xi+1 · · · Xk . C1 and C2 are dependent
because both contain a common basic event Xi . The intersection of C1 and C2
can be written as
C1 · C2 = X1 · X2 · · · Xi · Xi · Xi+1 · · · Xk .
Once the probabilities of the intersections of minimal cut sets are calcu-
lated, (6.15) is ready for evaluating the probability of the top event. The evalu-
ation process appears simple under the assumption of independent basic events,
but it is tedious when a top event consists of a large number of minimal cut sets.
Because a high-order intersection usually has a low probability, the third and
higher terms in (6.15) may be omitted in practice.
SOLUTION The top event of the fault tree has been formulated in (6.12) as
the union of the minimal cut sets. Then the probability of the top event is
Pr(T ) = Pr(X1 + X3 + X4 · X5 ) = Pr(X1 ) + Pr(X3 ) + Pr(X4 · X5 ) − Pr(X1 · X3 )
− Pr(X1 · X4 · X5 ) − Pr(X3 · X4 · X5 ) + Pr(X1 · X3 · X4 · X5 )
= p1 + p3 + p4 p5 − p1 p3 − p1 p4 p5 − p3 p4 p5 + p1 p3 p4 p5
= 0.01 + 0.005 + 0.003 × 0.008 − 0.01 × 0.005 − 0.01 × 0.003 × 0.008
− 0.005 × 0.003 × 0.008
+ 0.01 × 0.005 × 0.003 × 0.008 = 0.015.
diagram (BDD) has been developed and used (see, e.g., Rauzy, 1993; Bouis-
sou, 1996; Sinnamon and Andrews, 1996, 1997a,b; and Dugan, 2003). The BDD
method does not require minimal cut sets for quantitative analysis and can be
more efficient and accurate in probability computation.
A BDD is a directed acyclic graph representing a Boolean function. All paths
through a BDD terminate in one of two states: a 1 state or a 0 state, with 1
representing system failure (occurrence of the top event) and 0 corresponding to
system success (nonoccurrence of the top event). All paths terminating in a 1 state
form a cut set of the fault tree. A BDD consists of a root vertex, internal vertices
and terminal vertices, which are connected by branches. Sometimes branches are
called edges. Terminal vertices end with the value 0 or 1, while internal vertices
represent the corresponding basic events. The root vertex, the top internal vertex,
always has two branches. Branches (edges) are assigned a value 0 or 1, where
0 corresponds to the basic event nonoccurrence and 1 indicates occurrence of a
basic event. All left-hand branches leaving each vertex are assigned the value 1
and called 1 branches; all right-hand branches are given the value 0 and called
0 branches. Figure 6.12 shows an example BDD in which Xi is the basic event.
The cut sets can be found from a BDD. First we select a terminal 1 vertex and
proceed upward through the internal vertices to the root vertex. All alternative
paths that start from the same terminal 1 vertex and lead to the root vertex
should be identified. A cut set is formed by the 1 branches of each path. The
process is repeated for other terminal 1 vertices, and the corresponding cut sets
are determined in the same way. In the example shown in Figure 6.12, starting
from terminal 1 vertex of X4 produces two cut sets: X4 · X3 and X4 · X3 · X1 .
Originating from terminal 1 vertex of X3 yields only one cut set: X3 · X2 · X1 .
Thus, the BDD of Figure 6.12 has three cut sets.
A BDD is converted from a fault tree through the use of an if–then–else
structure. The structure is denoted ite(X, f1 , f2 ), which means: If X fails, consider
Root vertex
X1
1
0
X2
0
1
X3 Internal vertex
X3 1 0
1
0
X4
0
1 1 0
0 Terminal 0 vertex
Terminal 1 vertex
1
0
Example 6.7 Figure 6.13 shows a fault tree. Construct a BDD for this fault tree.
SOLUTION First we give the basic events an arbitrary ordering: X1 < X2 <
X3 . The intermediate events E1 and E2 are written in terms of ite structure as
E1 = X1 + X3 = ite(X1 , 1, 0) + ite(X3 , 1, 0) = ite(X1 , 1, ite(X3 , 1, 0)),
E2 = X3 + X2 = ite(X3 , 1, 0) + ite(X2 , 1, 0) = ite(X3 , 1, ite(X2 , 1, 0)).
The top event can be expressed as
T = E1 · E2 = ite(X1 , 1, ite(X3 , 1, 0)) · ite(X3 , 1, ite(X2 , 1, 0))
= ite(X1 , ite(X3 , 1, ite(X2 , 1, 0)), ite(X3 , 1, 0) · ite(X3 , 1, ite(X2 , 1, 0)))
= ite(X1 , ite(X3 , 1, ite(X2 , 1, 0)), ite(X3 , 1, 0)). (6.24)
Based on (6.24), a BDD can be constructed as shown in Figure 6.14. The cut
sets of the BDD are X3 · X1 , X2 · X1 , and X3 .
228 POTENTIAL FAILURE MODE AVOIDANCE
E1 E2
X1 X3 X3 X2
FIGURE 6.13 Example fault tree with repeated event X3 (From Sinnamon and Andre-
ws, 1997b.)
X1
1
0
X3
1
X3
0
1 1 0
X2 1 0
1 0
1
0
The probability of a top event can be evaluated from the corresponding BDD.
First we find the disjoint paths through the BDD. This is done using the method
for determining cut sets and including in a path the basic events that lie on a
0 branch. Such basic events are denoted by X i , meaning Xi not occurring. The
probability of the top event is equal to the probability of the sum of the disjoint
paths through the BDD. In Example 6.7, the disjoint paths are X3 · X1 , X2 · X3 ·
X1 , and X3 · X 1 . The probability of the top event can be written as
Pr(T ) = Pr(X3 · X1 + X2 · X3 · X1 + X3 · X1 )
= Pr(X3 · X1 ) + Pr(X2 · X 3 · X1 ) + Pr(X3 · X1 )
= p1 p3 + p1 p2 (1 − p3 ) + p3 (1 − p1 ) = p1 p2 (1 − p3 ) + p3 , (6.25)
ADVANCED TOPICS IN FTA 229
ž COLD SPARE (CSP) gate. An output event occurs when the primary com-
ponent and all cold spare units have failed, where the primary component is
the one that is initially powered on, and the cold spare units are those used
as replacements for the primary component. Cold spare units may have zero
failure rates before being switched into active operation.
ž WARM SPARE (WSP) gate. An output event occurs when the primary com-
ponent and all warm spare units have failed. Warm spare units may have
reduced failure rates before being switched into active operation.
ž HOT SPARE (HSP) gate. An output event occurs when the primary compo-
nent and all hot spare units have failed. Hot spare units may have the same
failure rates before and after being switched into active operation.
ž FUNCTIONAL DEPENDENCE (FDEP) gate. The gate has a single trigger
event (either a basic event or the output of another gate of the tree) and one
or more dependent basic events. The dependent basic events are forced to
occur when the trigger event occurs. The output reflects the status of the
trigger event.
ž SEQUENCE ENFORCING (SENF) gate. The input events are forced to
occur in the left-to-right order in which they appear under the gate.
230 POTENTIAL FAILURE MODE AVOIDANCE
ž PRIORITY AND (PAND) gate. An output event occurs if all input events
occur in order. It is logically equivalent to an AND gate, with the added
condition that the events must take place in a specific order.
A fault tree may be comprised of both static and dynamic subtrees. Static
subtrees are solved using the methods described earlier in this chapter, while
dynamic parts are usually converted to Markov models and worked out by using
Markov methods (Manian et al., 1999).
SOLUTION Modeling the system failure needs dynamic gates, which in par-
ticular describe the cold standby spare and functional dependency of the pumps
on the valves and filters. The fault tree is shown in Figure 6.15.
Pump 1 dependency
Pump 2 dependency
FDEP
FDEP
Pump 1
stream fails Pump 2
stream fails
Valve 1 Filter 1
fails Valve 2 Filter 2
fails
fails fails
integral part of the design process, design analysis should be performed to deter-
mine stress distribution and reveal overstressing conditions that will result in
premature failures or other concerns.
As mentioned above, mechanical stress analysis requires use of the finite ele-
ment method. In FEA modeling, the boundary conditions, forces/loads, material
properties, and perhaps manufacturing process are integrated into the finite ele-
ment model for the calculation of stress and deformation. Detailed description of
theory and application of the mechanical stress analysis using FEA can be found
in, for example, Adams and Askenazi (1998). The analysis is usually performed
using a commercial software package such as Hypermesh, Ansys, or Cosmos.
As a result of the analysis, potential unacceptable stress conditions and deforma-
tions can be discovered. Then design changes should be recommended to address
these concerns. For example, an engine component was analyzed using FEA to
identify excessive deformation and overstressing problems. The FEA model and
stress distribution for the component are shown in Figure 6.16. The analysis indi-
cates an overstress area on the top of the component. This finding resulted in
corrective actions that were taken before the design was prototyped.
Overstress area
analysis, thermal analysis software can generate FEA models automatically based
on the dimensions, geometry, and layers of the board. Using the models, the
software then calculates the temperature distribution with the defined boundary
conditions, electrical loads, board and component thermal properties, packaging
method, and the ambient temperature. The resulting temperature distribution indi-
cates hot regions. The components within these hot regions should be checked
for functionality and reliability. If the temperature of a hot region raises concerns,
design changes are required, including, for example, the use of heat sink, repop-
ulation of components, and modification of circuitry. Sergent and Krum (1998)
describe in detail the thermal analysis of electronic assemblies, including PCB.
Let’s look at an example. Right after the schematic design and PCB lay-
out of an automobile body control board were completed, a thermal analysis
was performed to examine the design for potential deficiencies. The temperature
distribution on the top layer of the board is shown in Figure 6.17, where the rect-
angles and ovals on the board represent electronic components. The hot region
on the board was found to coincide with the area where two resistors were popu-
lated. Even though the temperature of the hot region generated no major concerns
on the current-carrying capability of copper trace and solder joint integrity, the
high case temperatures of these two resistors would reduce long-term reliability
in the field. Therefore, design changes were enforced to lower the temperature.
Hot region
may include cracking, fatigue, loose connection, and others. See Chapter 7 for
more discussions on vibration. It is vital to determine the product behavior at the
presence of vibration. This task can be accomplished by performing a vibration
analysis.
Vibration analysis is commonly based on FEA and employs a commercial
software package such as MATLAB, MathCAD, or CARMA. The analysis cal-
culates natural frequencies and displacements with inputs of boundary conditions
and vibration environment. The method of mounting defines the boundary condi-
tions, and the type of vibration (sinusoidal or random vibration) and its severity
specify the vibration environment.
Once the natural frequencies and displacements are computed, further analy-
ses can be performed. For example, maximum displacement should be checked
against minimum clearance to prevent any potential mechanical interference. The
first natural frequency is often used to calculate the stress caused by the vibra-
tion and to determine the fatigue life. A low natural frequency indicates high
stress and displacement. If problems are detected, corrective actions should be
taken to increase the natural frequency. Such measures may include use of a rib,
modification of the mounting method, and others.
Vibration analysis is frequently performed on PCB design to uncover potential
problems such as PCB and solder joint cracking and low fatigue life. For example,
for the automobile body control board discussed in Section 6.6.2, Figure 6.18
shows the FEA-based vibration analysis results, including the first three natu-
ral frequencies and the first fundamental mode shape of vibration. The board
was simply restrained at four edges and subjected to random vibration. Further
calculations on bending stress and fatigue life indicated no concerns due to the
vibration condition specified. Detailed vibration analysis for electronic equipment
is described in, for example, Steinberg (2000).
FIGURE 6.18 Vibration analysis results for an automobile body control PCB
PROBLEMS 235
PROBLEMS
6.1 Explain the processes by which design FMEA and FTA detect design mis-
takes. Can human errors be discovered through the use of FMEA or FTA?
6.2 Explain the correlations and differences between the following terms used
in design FMEA and FTA:
(a) Failure mode and top event.
(b) Failure mechanism/cause and basic event.
(c) Failure effect and intermediate event.
(d) Occurrence and failure probability.
6.3 Perform a design FMEA on a product of your choice using the worksheet
of Figure 6.2 and answer the following questions:
(a) What are the top three concerns by RPN?
(b) What are the top three concerns by S × O? Is the result the same as that
by RPN? Is S × O a more meaningful index than RPN in your case?
(c) Construct a fault tree for the failure mode with the highest severity. Does
the fault tree provide more insights about how the failure mode occurs?
6.4 What are the impacts of the following actions on severity, occurrence, and
detection rankings?
(a) Add a new test method.
(b) Implement a design control before prototypes are built.
(c) Take a failure prevention measure in design.
6.5 Describe the purposes of qualitative and quantitative analyses in FTA. What
are the roles of minimal cut sets?
6.6 Construct a fault tree for the circuit shown in Figure 6.19, where the top
event is “blackout.” Convert the fault tree to a reliability block diagram.
6.7 Figure 6.20 depicts a simplified four-cylinder automobile engine system. The
throttle controls the amount of air flowing to the intake manifold. While the
engine is at idle, the throttle is closed and a small amount of air is bypassed
through the inlet air control solenoid to the manifold to prevent engine
Bulb 1
Bulb 2
Power
supply Wire
Switch Fuse
FIGURE 6.19 Two-bulb lighting circuit
236 POTENTIAL FAILURE MODE AVOIDANCE
Air
Solenoid
Throttle
Pedal
Intake manifold
Spark Injector
Torque
FIGURE 6.20 Simplified automotive engine system
stalling. Fuel is mixed with air, injected into each cylinder, and ignited by
spark. Suppose that the failure probabilities of the throttle, solenoid, sparks,
and fuel injectors are 0.001, 0.003, 0.01, and 0.008, respectively. For the top
event “Engine stalls while vehicle is at idle,” complete the following tasks:
(a) Construct a fault tree for the top event.
(b) Determine the minimal cut sets.
(c) Evaluate the probability of the top event.
(d) Convert the fault tree to a reliability block diagram and calculate the top
event probability.
6.8 Refer to Problem 6.7. If the top event is “Engine stalls while vehicle is
moving,” complete the following tasks:
(a) Construct a fault tree for the top event.
(b) Determine the minimal cut sets.
(c) Evaluate the probability of the top event.
(d) Convert the fault tree to a BDD and evaluate the top event probability.
7
ACCELERATED LIFE TESTS
7.1 INTRODUCTION
237
238 ACCELERATED LIFE TESTS
critical failure modes observed in the field. In short, ALTs are essential in all
effective reliability programs, attributing to their irreplaceable role in improving
and estimating reliability. The author has been present at five consecutive Annual
Reliability and Maintainability Symposia and noticed that the sessions on ALT
topics were far more heavily attended than any other concurrent sessions.
An ALT can be (1) qualitative or (2) quantitative, depending on the purpose of
the test. A qualitative test is usually designed and conducted to generate failures
as quickly as possible in the design and development phase. Subsequent failure
analyses and corrective actions lead to the improvement of reliability. This type
of test, also known as highly accelerated life testing (HALT), is discussed in
Section 7.9. Other sections of the chapter are dedicated to quantitative tests,
which are aimed at estimating product life distribution: in particular, percentiles
and the probability of failure (i.e., the population fraction failing).
1. Compare and assess the reliability of materials and components. Such tests
take place in the early stage of the design and development phase to select the
appropriate vendors of the materials and components.
2. Determine optimal design alternatives. Design engineers often develop
multiple design alternatives at a low level of the product hierarchical struc-
ture. Prototypes at this level are functionally operable and inexpensive. ALTs
are conducted to evaluate the reliability performance of each design alternative,
on which the selection of the best candidate may be based. ALTs performed in
robust reliability design have such purpose.
3. Confirm the effectiveness of a design change. In designing a new product,
design changes are nearly inevitable during the design and development phase.
DEVELOPMENT OF TEST PLANS 239
Even for a carryover design, some fixes are often necessary. The changes must
be verified as early as possible. ALTs are needed for this purpose.
4. Evaluate the relationship between reliability and stress. Sometimes ALTs
are performed to assess the sensitivity of reliability to certain stresses. The result-
ing information is used to improve the robustness of the design and/or to specify
the limit of use condition.
5. Discover potential failure modes. A test serving this purpose is important
for a new product. Critical failure modes, which can cause severe effects such as
safety hazards, must be eradicated or mitigated in the design and development
phase.
Production at full capacity may begin after the design passes PV testing. ALTs
may be required in the production phase for the following purposes:
1. Identify the special causes for a statistically significant process shift. Statis-
tical process control tools detect such a shift and trigger a series of investigations,
which can include ALTs to find causes of a change in failure mode or life distri-
bution due to process variation.
2. Duplicate the critical failure modes observed in the field for determination
of the failure mechanisms.
3. Acceptance sampling. ALTs may be performed to decide if a particular lot
should be stopped from shipping to customers.
240 ACCELERATED LIFE TESTS
Acceleration
methods
Changing
Increasing Tightening failure
Overstressing control factor
usage rate threshold
level
Constant Increasing
stress speed
Step Reducing
stress off time
Progressive
stress
Cyclic
stress
Random
stress
FIGURE 7.1 Classification of acceleration methods
242 ACCELERATED LIFE TESTS
Stress level
high
low
0
Time
FIGURE 7.2 Life distributions at two levels of constant stress
Stress level
0
Time
FIGURE 7.3 Two step-stress loading patterns
the other group survives longer. This test method is most common in practice
because of the simplicity of stress application and data analysis.
In step-stress testing, units are subjected to a stress level held constant for a
specified period of time, at the end of which, if some units survive, the stress
level is increased and held constant for another specified period. This process
is continued until a predetermined number of units fail or until a predetermined
test time is reached. When a test uses only two steps, the test is called a simple
step-stress test. Figure 7.3 shows two- and multiple-step loading patterns. A step-
stress test yields failures in a shorter time than does a constant-stress test. Thus,
it is an effective test method for discovering failure modes of highly reliable
products. However, models for the effect of step stressing are not well developed
and may result in inaccurate conclusions. Nelson (1990, 2004), and Pham (2003)
describe test method, data analysis, and examples.
In progressive stress testing, the stress level is increased constantly (usually,
linearly) until a predetermined number of test units fail or until a predetermined
test time is reached. The stress loading method is shown in Figure 7.4. The
slopes of the straight lines are the rates at which the stress levels are increased
and represent the severity of the stress. The higher the rate, the shorter the times
to failure. Like step-stress testing, the test method is effective in yielding failures
and imposes difficulties for modeling the data. Nelson (1990, 2004) presents test
method, data analysis, and examples.
In cyclic stress loading, the stress level is changed according to a fixed cyclic
pattern. Common examples of such stress are thermal cycling and sinusoidal
vibration. In contrast to the fixed amplitude of a cyclic stress, the level of a
DEVELOPMENT OF TEST PLANS 243
Stress level
0
Time
FIGURE 7.4 Progressive stress loading patterns
0
Time
FIGURE 7.5 Cyclic and random stress loading
Increasing the Usage Rate Usage is the amount of use of a product. It may be
expressed in miles, cycles, revolutions, pages, or other measures. Usage rate is
the frequency of a product being operated, and may be measured by hertz (Hz),
cycles per hour, revolutions per minute, miles per month, pages per minute,
or others. Many commercial products are operated intermittently. In contrast,
test units are run continuously or more frequently, to reduce the test time. For
example, most automobiles are operated less than two hours a day and may
accumulate 100 miles. In a proving ground, test vehicles may be driven eight
or more hours a day and accumulate 500 or more miles. On the other hand,
some products run at a low speed in normal use. Such products include bearings,
motors, relays, switches, and many others. In testing they are operated at a higher
speed to shorten the test time. For the two types of products, the life is usually
measured by the usage to failure, such as cycles to failure and miles to failure.
Special care should be exercised when applying the acceleration method. It
is usually assumed that usage to failure at a higher usage rate is equal to that
at the usual usage rate. The assumption does not hold in situations where an
increased usage rate results in an additional environmental, mechanical, electrical,
or chemical stress: for example, when raising the operational speed generates
a higher temperature, or reducing the off time decreases the time for heat to
244 ACCELERATED LIFE TESTS
dissipate. Then the equal-usage assumption may be invalid unless such effects are
eliminated by using a compensation measure such as a cooling fan. In many tests
the use of compensation is impractical. Then we must take into account the effect
on life of increased usage rate. Considering the fact that usage to failure may be
shorter or longer at a higher usage rate, G. Yang (2005) proposes an acceleration
model to quantify usage rate effects (also discussed in Section 7.4.5).
Changing the Level of a Control Factor A control factor is a design parameter
whose level can be specified by designers. We have seen in Chapter 5 that the
level of a control factor can affect the life of a product. Therefore, we can inten-
tionally choose the level of one or more control factors to shorten the life of test
units. This acceleration method requires the known effects on life of the control
factors. The known relationship between life and the level of control factors may
be developed in robust reliability design (Chapter 5). Change of dimension is a
common application of this test method. For example, a smaller-diameter shaft
is tested to determine the fatigue life of a larger shaft because the former yields
a shorter life. Large capacitors are subjected to electrical voltage stress to esti-
mate the life of small capacitors with the same design, on the assumption that
the large ones will fail sooner because of a larger dielectric area. Nelson (1990,
2004) describes a size-effect model relating the failure rate of one size to that
of another size. Bai and Yun (1996) generalize that model. A change of geom-
etry in favor of failure is another use of the acceleration method. For example,
reducing the fillet radius of a mechanical component increases the stress concen-
tration and thus shortens life. In practice, other design parameters may be used as
accelerating variables. The control factors must not, however, interact with other
accelerating stresses. Otherwise, the test results may be invalid, as described in
Chapter 5.
Tightening the Failure Threshold For some products, failure is said to have
occurred when one of its performance characteristics crosses a specified threshold.
Clearly, the life of the products is determined by the threshold. The tighter the
threshold, the shorter the life, and vice versa. Thus, we can accelerate the life by
specifying a tighter threshold. For example, a light-emitting diode at a normal
threshold of 30% degradation in luminous flux may survive 5000 hours. If the
threshold is reduced to 20%, the life may be shortened to 3000 hours. This
acceleration method requires a model that relates life to threshold and is discussed
in Chapter 8.
Acceleration Factor An important concept in ALTs is the acceleration factor,
defined as the ratio of a life percentile at stress level S to that at stress level S .
Mathematically,
tp
Af = , (7.1)
tp
where Af is the acceleration factor, p the specified probability of failure (i.e.,
the population fraction failing), and tp (tp ) the 100pth percentile of the life
distribution at S (S ). Often, p is chosen to be 0.5 or 0.632.
DEVELOPMENT OF TEST PLANS 245
Af = exp(µ − µ ), (7.2)
along the direction of electron movement. On the other hand, when metal atoms
are activated by the momentum exchange, they are subjected to an applied elec-
trical field opposite to the electron movement, and move against that movement.
The two movements are accelerated by high temperature and interact to determine
the direction of net mass transfer. As a result of the mass transfer, vacancies and
interstitials are created on the metal. Vacancies develop voids and microcracks,
which may cause, for example, an increased contact resistance or open circuit.
Interstitials are the exotic mass on the surface of the metal and may result in a
short circuit. In addition to temperature and electrical current density, the sus-
ceptibility to electromigration also depends on the material. Silver is the metal
most subject to this failure.
3. Creep. Creep is a gradual plastic deformation of a component exposed to
high temperature and mechanical stress, resulting in elongation of the component.
Before a component fractures, the creep process typically consists of three stages,
as shown in Figure 7.6. Initially, the transient creep occurs in the first stage,
where the creep rate (the slope of the strain–time curve) is high. Then the rate
decreases and remains approximately constant over a long period of time called
the steady-state stage (i.e., the second stage). As time proceeds, creep develops
to the third stage, where the creep rate increases rapidly and the strain becomes
so large that fracture occurs. In practice, many products fail far before creep
progresses to the third stage, due to the loss of elastic strength. For example, the
contact reeds of an electromagnetic relay are subjected to a cyclic load and high
temperature when in operation. Creep occurs to the reeds and results in stress
relaxation or loss of elastic strength, which, in turn, reduces the contact force,
increases the contact resistance, and causes failure.
4. Interdiffusion. When two different bulk materials are in intimate contact at
a surface, molecules or atoms of one material can migrate into the other, and vice
versa. Like electromigration, interdiffusion is a mass transport process which is
sensitive to temperature. When a high temperature is applied, the molecules and
atoms are thermally activated and their motion speeds up, increasing the interdif-
fusion rate. If the diffusion rates for both materials are not equal, interdiffusion
can generate voids in one of the materials and cause the product’s electrical,
chemical, and mechanical performance to deteriorate. Interdiffusion can be the
cause of various observable failure modes, such as increased electrical resistance
and fracture of material.
3
2
Strain
stage 1
0
Time
FIGURE 7.6 Successive stages of a creep process
248 ACCELERATED LIFE TESTS
1 cycle
tmax
Tmax
Temperature
dT/dt
0
Tmin
tmin
Time
FIGURE 7.7 Thermal cycling profile
COMMON STRESSES AND THEIR EFFECTS 249
7.3.3 Humidity
There are two types of humidity measures in use: absolute humidity and relative
humidity. Absolute humidity is the amount of water contained in a unit volume
of moist air. In scientific and engineering applications, we generally employ
relative humidity, defined as the ratio (in percent) of the amount of atmospheric
moisture present relative to the amount that would be present if the air were
saturated. Since the latter amount is dependent on temperature, relative humidity
is a function of both moisture content and temperature. In particular, relative
humidity is inversely proportional to temperature until the dew point is reached,
below which moisture condenses onto surfaces.
Important failure modes due to moisture include short circuit and corrosion.
Corrosion is the gradual destruction of a metal or alloy caused by chemical attack
or electrochemical reaction. The primary corrosion in a humid environment is an
electrochemical process in which oxidation and reduction reactions occur simul-
taneously. When metal atoms are exposed to a damp environment, they can
yield electrons and thus become positively charged ions, provided that an elec-
trochemical cell is complete. The electrons are then consumed in the reduction
process. The reaction processes may occur locally to form pits or microcracks,
which provide sites for fatigue initiation and develop further to fatigue failure.
Corrosion occurring extensively on the surface of a component causes electri-
cal performance and mechanical strength to deteriorate. The corrosion process
is accelerated with high temperatures. This is the reason that humidity stress is
frequently used concurrently with high temperature. For example, 85/85, which
means 85◦ C and 85% relative humidity, is a recommended test condition in
various engineering standards.
In addition to corrosion, short circuiting is sometimes a concern for electronic
products working in a humid environment. Moisture condenses onto surfaces
when the temperature is below the dew point. Liquid water that is deposited on
a circuit may cause catastrophic failures, such as a short circuit. To minimize the
detrimental effects of humidity, most electronic products are hermetically sealed.
7.3.4 Voltage
Voltage is the difference in electrical potential between two points. When voltage
is applied between any two points, it is resisted by the dielectric strength of the
material in between. When Ohm’s law applies, the current through the material
is directly proportional to the voltage. Thus, if the material is insulation, the
current, which is negligible and sometimes called leakage current, increases with
applied voltage. If the voltage is elevated to a certain level, the insulation breaks
down and the current jumps. The failure usually occurs at weak spots or flaws
in the material, where the dielectric strength is relatively low. In general, the
higher the voltage, the shorter the insulation life. Considering this effect, voltage
is often employed as an accelerating variable for testing insulators and electronic
components such as capacitors.
For conductors and electronic components, high voltage means high current;
thus, failure modes caused by high current (which are discussed next) may be
250 ACCELERATED LIFE TESTS
Displacement (mm)
2
1
0
−1
−2
−3
0 2 4 6 8 10
Time (s)
FIGURE 7.8 Sinusoidal vibration
Two types of vibration are common: sinusoidal and random vibrations. Sinu-
soidal vibration takes place at a predominant frequency, and displacement at a
future time is predictable. This vibration is measured by the frequency (Hz) and
displacement (mm), velocity (mm/s), or acceleration (mm/s2 or g). Figure 7.8
shows an example of vibration at a frequency of 0.406 Hz, where the y-axis is
the displacement. In reality, this type of vibration is usually caused by the cyclic
operation of a product. For example, automotive engine firing is a source of
sinusoidal vibration, which disturbs components under the hood. Most products
are more likely to experience the second type of vibration, random vibration.
In contrast to a sinusoidal vibration, a random vibration occurs in a wide range
of frequencies, and instantaneous displacement at a future time is unpredictable.
Figure 7.9 shows a 5-second random vibration where the y-axis is acceleration.
Because of the random nature, the vibration is described by the power spectral
density (PSD), expressed in g2 / Hz. Since the PSD is a function of frequency, a
random vibration profile should specify the PSD at various values of frequency.
Figure 7.10 shows an example of such a profile, which is the vibration test con-
dition for an automobile component. Steinberg (2000), for example, discusses
mechanical vibration in detail.
In some circumstances, vibration is generated purposefully to fulfill certain
functions. For example, ultrasonic vibration welds two parts in the wire bonding
process. In most situations, vibration results in undesirable effects such as fatigue,
3
2
Acceleration (g)
1
0
−1
−2
−3
0 1 2 3 4 5
Time (s)
FIGURE 7.9 Random vibration
252 ACCELERATED LIFE TESTS
0.12
0.1
PSD (g2/Hz)
0.08
0.06
0.04
0.02
0
0 200 400 600 800 1000
Frequency (Hz)
FIGURE 7.10 Random vibration test profile
wear, and loosening of connections. Due to the change in acceleration over time,
vibration generates a cyclic load. As discussed earlier, cyclic stressing initiates
and develops microcracks and eventually causes a fatigue failure. Vibration also
induces mechanical wear, which is the attrition of materials from the surfaces
between two mating components in relative movement. Mechanical wear can be
adhesive, abrasive, fretting, or a combination. The wear mechanisms of each type
are described in books on tribology and wear. Interested readers may consult,
for example, Stachowiak and Batchelor (2000) and Bhushan (2002). Excessive
wear, in turn, causes different apparent failure modes, including acoustic noise,
worse vibration, local overheating, leaking, and loss of machinery precision. A
loosening connection is another failure mode that is frequently observed in a
vibration environment. This failure mode can result in various effects, such as
leaking, deterioration of connection strength, and intermittent electrical contact.
The activation energy is relatively small, indicating that the setting of the design
parameters is probably not optimal.
where the notation is the same as in (7.5). Compared with the Arrhenius rela-
tionship, the Eyring model has an additional term, 1/T . Hence, it may be more
suitable when the temperature has stronger effects on the reaction rate. Despite
the advantage, it has few applications in the literature.
The acceleration factor between temperatures T and T for the Eyring rela-
tionship is
T Ea 1 1
Af = exp − , (7.9)
T k T T
which indicates that the acceleration factor for the Eyring relationship is T /T
times the acceleration factor for the Arrhenius relationship.
where T is the temperature range Tmax –Tmin and A and B are constants charac-
teristic of material properties and product design. B is usually positive. In some
applications, A may be a function of cycling variables such as the frequency and
maximum temperature, in which case the Norris–Landzberg relationship dis-
cussed next is more appropriate. We will see later that (7.10) is a special form
of the inverse power relationship.
For the sake of data analysis, we transform (7.10) into a linear function. Taking
the natural logarithm of (7.10) gives
which describes the relationship between the number (N ) of cycles to failure and
the strain (S). Recent applications of the model include Naderman and Rongen
(1999), Cory (2000), Sumikawa et al. (2001), Basaran et al. (2004), R. Li (2004),
and many others.
Example 7.2 Shohji et al. (2004) evaluate the reliability of chip-scale package
(CSP) solder joints by subjecting them to thermal cycling, where the solder joints
258 ACCELERATED LIFE TESTS
are the alloy Sn–37Pb. In the experiment, 12 thermal cycling profiles were used
as shown in Table 7.2. Under each test condition, five CSPs were tested, each
with multiple solder joints. A CSP is said to have failed when one of its solder
joints disconnects. The tests were run until all units failed. (Note that running all
units to failure is generally a poor practice when we are interested in estimating
the lower tail of the life distribution.) The mean life for a test profile is the
average of the numbers of cycles to failure of the five units that underwent
the same condition. The mean life data are also shown in Table 7.2. By using
the Norris–Landzberg relationship, estimate the activation energy and the mean
life under the use profile, where we assume that Tmin = −10◦ C, Tmax = 25◦ C,
and f = 1 cycle per hour. Also calculate the acceleration factor between the
use profile and the accelerating profile, where Tmin
= −30◦ C, Tmax
= 105◦ C, and
f = 2 cycles per hour.
SOLUTION Equation (7.14) is fitted to the data. The multiple linear regression
analysis was performed with Minitab. The analysis results are summarized in
Table 7.3.
The large F value in the analysis of variance summarized in Table 7.3 indi-
cates that there exists a transformed linear relationship between the mean life and
at least some of the cycling variables. In general, (7.14) needs to be checked for
lack of fit. Doing so usually requires repeated observations at the same test condi-
tions (Hines et al., 2002). Such observations were not given in this paper (Shohji
et al., 2004). The analysis in Table 7.3 also shows that the cycling frequency is
not statistically significant due to its small T value, and may be excluded from
the model. In this example, we keep this term and have
2006.4
ln(L̂) = 9.517 − 2.064 ln(T ) + 0.345 ln(f ) + . (7.16)
Tmax
LIFE–STRESS RELATIONSHIPS 259
Analysis of Variance
Source DF SS MS F P
Regression 3 16.1083 5.3694 88.81 0.000
Residual Error 8 0.4837 0.0605
Total 11 16.5920
where L is the nominal life, V the stress, and A and B are constants depen-
dent on material properties, product design, failure criteria, and other factors.
It is often used for the life of dielectrics subjected to voltage V . It is worth
noting that the inverse power relationship may apply to a stress other than volt-
age, including mechanical load, pressure, electrical current, and some others.
For example, Harris (2001) applies the relationship to the life of a bearing as
a function of mechanical load, and Black (1969) expresses the median life to
electromigration failure of microcircuit conductors as an inverse power function
of the current density at a given temperature. In addition, the Coffin–Manson
relationship (7.10) and the usage rate model (discussed later) are special cases
of the inverse power relationship.
For the convenience of data analysis, we transform (7.17) into a linear rela-
tionship as
ln(L) = a + b ln(V ), (7.18)
where a = ln(A) and b = −B. Both a and b are estimated from test data.
The acceleration factor between two stress levels is
B
V
Af = , (7.19)
V
8.5
8
ln(L)
7.5
6.5
6
4.3 4.4 4.5 4.6 4.7 4.8 4.9
ln(V)
FIGURE 7.11 Scatter plot and regression line fitted to the mean life of the capacitors
SOLUTION The mean life at an elevated voltage is the average of the lifetimes
at that voltage. The resulting mean lives are shown in Table 7.4. Then (7.18) is
used to fit the mean life data at each voltage level. Simple linear regression anal-
ysis gives â = 20.407 and b̂ = −2.738. The regression line and raw life data are
plotted in Figure 7.11. The estimates of A and B are  = exp(20.407) = 7.289 ×
108 and B̂ = 2.738. The mean life at 50 V is L̂50 = 7.289 × 108 /502.738 = 16, 251
hours.
The acceleration factor between 50 and 120 V is Âf = (120/50)2.738 = 10.99.
Then 1500 hours at 120 V is equivalent to 1500 × 10.99 = 16, 485 hours at 50 V.
That is, if a capacitor ran 1500 hours at 120 V without failure, the capacitor would
have survived 16,485 hours at 50 V.
the usage to failure, where the usage is in cycles, revolutions, miles, or other mea-
sures. In other words, the usage to failure at different usage rates may not be
the same. Some experimental results and theoretical explanations are shown in,
for example, Popinceanu et al. (1977), Tamai et al. (1997), Harris (2001), and
Tanner et al. (2002). G. Yang (2005) models nominal life as a power function of
the usage rate. The model is written as
L = Af B , (7.21)
where L is the nominal usage to failure, f is the usage rate, and A and B are
constants dependent on material properties, product design, failure criteria, and
other factors. A may be a function of other stresses if applied simultaneously.
For example, if test units are also subjected to a temperature stress, A may be a
function of temperature, say, the Arrhenius relationship. Then (7.21) is extended
to a combination model containing both the usage rate and temperature.
Increase in usage rate may prolong, shorten, or not change the usage to failure.
Equation (7.21) is flexible in accommodating these different effects. In particular,
the usage to failure decreases as the usage rate increases when B < 0, increases
with usage rate when B > 0, and is not affected by usage rate when B = 0.
In testing a group of units, the usage rate is usually held constant over time.
Then the nominal clock time τ to failure is given by
Af B
τ= = Af B−1 , (7.22)
f
which indicates that (1) increasing usage rate in a test shortens the clock lifetime
and test length when B < 1, (2) does not affect the clock lifetime and test length
when B = 1, and (3) prolongs the clock lifetime and test length when B > 1.
Clearly, the effectiveness of the acceleration method depends on the value of B.
Acceleration is achieved only when B < 1. In reality, the value of B is unknown
before testing. It can be preestimated using historical data, preliminary tests,
engineering experience, or reliability handbooks such as MIL-HDBK-217F (U.S.
DoD, 1995).
Note that (7.21) is a variant of the inverse power relationship. The linear
transformation and the acceleration factor for the usage rate model are similar to
those for the inverse power relationship. The linearized relationship is
ln(L) = a + b ln(f ), (7.23)
where a = ln(A) and b = B. When B < 1, the acceleration factor between two
usage rates is
B
L f
Af = = , (7.24)
L f
where a prime denotes the increased usage rate. It is worth noting that Af < 1
when 0 < B < 1. This also indicates that usage to failure increases with the usage
rate. Nevertheless, the clock time to failure is accelerated, and the test time is
shortened.
LIFE–STRESS RELATIONSHIPS 263
7.2
6.8
ln(L)
6.6
6.4
6.2
−4 −3 −2 −1 0 1 2 3 4
ln(f)
FIGURE 7.12 Regression line fitted to the mean life data of the micro relays
Example 7.4 Tamai et al. (1997) study the effects of switching rate on the
contact resistance and life of micro relays. They report that the number of cycles
to failure increases with the switching rate before a monolayer is formed. A
sample of the relays was exposed to an environment containing silicon vapor
at the concentration of 1300 ppm and loaded with 10 V and 0.5 A dc. The
mean cycles to failure of the relays at switching rates of 0.05, 0.3, 0.5, 1, 5,
10, and 20 Hz are approximately 600, 740, 780, 860, 1100, 1080, and 1250
cycles, respectively, which were read from the charts in the paper. Estimate both
the mean cycles to failure at a switching rate of 0.01 Hz, and the usage rate
acceleration factor between 0.01 and 5 Hz for the given environment.
SOLUTION Equation (7.23) is fitted to the mean cycles to failure and the
switching rate. Simple linear regression analysis gives ln(L̂) = 6.756 + 0.121
ln(f ). Hence, L̂ = 859.45f 0.121 . This regression line is shown in Figure 7.12,
which suggests that (7.21) models the relationship adequately. The estimate of the
mean cycles to failure at 0.01 Hz is L̂ = 859.45 × 0.010.121 = 491 cycles. The
usage rate acceleration factor between 0.01 and 5 Hz is Âf = (0.01/5)0.121 =
0.47. Note that the acceleration factor is less than 1. This indicates that the number
of cycles to failure at 5 Hz is larger than the number at 0.01 Hz. However, the
use of 5 Hz reduces the test clock time because B̂ = 0.121 < 1.
Motivated by Nelson (1990, 2004), Bai and Yun (1996) propose a relationship
between failure rate and product size as
B
s
λ (t) = λ(t), (7.25)
s
where λ(t) and λ (t) are the product failure rates at sizes s and s , respectively,
and B is the size effect, a constant dependent on material properties, failure
criteria, product design, and other factors. When B = 1, the model reduces to
the one in Nelson (1990, 2004).
The size effect relationship is a special form of the proportional hazards model,
which is due to Cox (1972) and discussed in, for example, Meeker and Escobar
(1998) and Blischke and Murthy (2000). Since the model describes the effect of
test condition (or size in this context) on failure rate rather than on lifetime, the
life at the use condition is not simply the one at the test condition multiplied
by an acceleration factor. Due to the complexity, the application of (7.25) to
accelerated life tests is currently limited to a few situations. When the life of a
product is modeled with the Weibull distribution, (7.25) can be written as
B/β
α s
= , (7.26)
α s
where α and α are the characteristic lives of the Weibull distribution at sizes s
and s , respectively, and β is the common shape parameter. From (7.3), we see
that (7.26) is the acceleration factor between the two sizes.
where RH is the relative humidity, A and B are constants, and other notation is
that of the Arrhenius relationship. By analyzing the published data, Peck (1986)
found values of B between −2.5 and −3.0, and values of Ea between 0.77 and
0.81 eV. Then Hallberg and Peck (1991) updated the values to B = −3.0 and
Ea = 0.9 eV. Although the relationship is regressed from a limited number of
products, it may be applicable to others and certainly has different parameter
values. For example, the model fits test data adequately on gallium arsenide
pseudomorphic high-electron-mobility transistor (GaAs pHEMT) switches and
yields an estimate of B = −10.7 (Ersland et al., 2004).
Note that (7.28) can be derived from the generalized Eyring model by omitting
the first and last terms and setting S = ln(RH ). The logarithm of (7.28) gives
c
ln(L) = a + b ln(RH) + , (7.29)
T
where a = ln(A), b = −B, and c = Ea /k. The acceleration factor between the
life at T and RH and the life at T and (RH) is
B
(RH) Ea 1 1
Af = exp − . (7.30)
RH k T T
where V is the voltage, A, B, and C are constants, and other notation is that
of the Arrhenius relationship. In practice, the last term is often assumed nonex-
istent if the interaction between temperature and voltage is not strongly evident
(see, e.g., Mogilevsky and Shirn, 1988; Al-Shareef and Dimos, 1996). Then the
resulting simplified relationship accounts for only the main effects of the stresses,
which are described individually by the Arrhenius relationship and the inverse
power relationship.
266 ACCELERATED LIFE TESTS
where I is the electrical current in amperes, A and B are constants, and other
notation is that of the Arrhenius relationship. In the context of electromigration,
I represents the current density in amperes per square unit length. Then (7.32)
is called Black’s (1969) equation, and it has been used extensively.
Censoring Mechanisms Often, tests must be terminated before all units fail.
Such situations cause censoring. Censoring results in few data observations and
increases statistical errors. When test resources such as time, equipment capac-
ity, and personnel are restricted, we must use this method, albeit reluctantly, to
shorten test time. In practice, there are three types of censoring: type I, type II,
and random.
In type I censoring, also known as time censoring, a test is suspended when
a predetermined time is reached on all unfailed units. That time is called the
censoring time. In situations where product life is characterized by both time
and usage, the censoring mechanism specifies the censoring time and usage. The
GRAPHICAL RELIABILITY ESTIMATION AT INDIVIDUAL TEST CONDITIONS 267
test is terminated at the prespecified time or usage, whichever comes first. For
example, automobiles tested in a proving ground are subject to time and mileage
censoring, and a vehicle is removed from a test as soon as its accumulated time
or mileage reaches the predetermined value. Type I censoring yields a random
number of failures, which sometimes may be zero. It is important to ensure that
the censoring time is long enough to fail some units; otherwise, data analysis
is difficult or impossible. This type of censoring is common in practice, due to
convenient time management.
Type II censoring, also called the failure censoring, results when a test is
terminated when a prespecified number of failures is reached. This censoring
method yields a fixed number of failures, which is appealing to the statistical
data analysis. On the other hand, the censoring time is a random variable, which
imposes a difficulty with time constraints. Because of this disadvantage, type II
censoring is less common in practice.
Random censoring is the termination of a test at random. This type of censoring
is often the result of an accident occurring during testing. For example, the failure
of test equipment or damage to the sample causes suspension of a test. Random
censoring also occurs when a unit fails from a mode that is not of interest. This
type of censoring results in both random test time and a random number of
failures.
Types of Data ALTs may yield various types of data, depending on data collec-
tion methods and censoring methods. If the test units are monitored continuously
during testing, the test yields the exact life when a unit fails. In contrast, test
units are often inspected periodically during testing, and failures are not detected
until inspection. Then the failure times are known to be between the times of the
last and current inspections, and they are interval life data. As a special case, if
a unit has failed before the first inspection time, the life of the unit is said to
be left censored. In contrast, if a unit survives the censoring time, the life of the
unit is right censored. If all surviving units have a common running time at test
termination, their data are called singly right censored. For this type of data to
occur, one needs to plan and conduct a test carefully. In practice, the censored
units often have different running times. The data of such units are said to be
multiply right censored. This situation arises when some units have to be removed
from the test earlier or when the units are started on the test at different times
and censored at the same time. If the censoring is long enough to allow all units
to fail, the resulting failure times are complete life data. But this is usually poor
practice for life tests, especially when only the lower tail of the life distribution
is of interest.
visualization, data are plotted on probability paper which has special scales that
linearize a cdf. If a life data set plots close to a straight line on Weibull probability
paper, the Weibull distribution describes the population adequately. In general, a
linearized cdf can be written as
y = a + bx, (7.33)
where x and y are the transformed time and cdf, and a and b are related to the
distribution parameters. Now let’s work out the specific forms of a, b, x, and y
for the commonly used distributions.
F (t) = 1 − exp(−λt),
where λ is the failure rate. This cdf is linearized and takes the form of (7.33),
where y = ln[1/(1 − F )], x = t, a = 0, and b = λ. Exponential probability paper
can be constructed with the transformed scale ln[1/(1 − F )] on the vertical axis
and the linear scale t on the horizontal axis. Any exponential cdf is a straight line
on such paper. If a data set plots near a straight line on this paper, the exponential
distribution is a reasonable model.
The value of λ is the slope of the cdf line. Since λt = 1 when 1 − F = e−1
or F = 0.632, the estimate of λ is equal to the reciprocal of the time at which
F = 0.632.
where α and β are the characteristic life and the shape parameter, respectively.
This cdf is linearized and takes the form of (7.33), where y = ln ln[1/(1 − F )],
x = ln(t), a = −β ln(α), and b = β. A Weibull probability paper has the trans-
formed scale ln ln[1/(1 − F )] on the vertical axis and ln(t) (a log scale) on the
horizontal axis. The Weibull distribution adequately models a data set if the data
points are near a straight line on the paper.
The parameters α and β can be estimated directly from the plot. Note that
when 1 − F = e−1 or F = 0.632, −β ln(α) + β ln(t) = 0 or α = t0.632 . Namely,
the value of the characteristic life is the time at which F = 0.632. The shape
parameter is the slope of the straight line on the transformed scales. Some Weibull
papers have a special scale for estimating β.
where µ and σ are, respectively, the location and scale parameters or mean and
standard deviation, and (·) is the cdf of the standard normal distribution. This
cdf is linearized and takes the form of (7.33), where y = −1 (F ) and −1 (·) is
the inverse of (·), x = t, a = −µ/σ , b = 1/σ . Normal probability paper has
a −1 (F ) scale on the vertical axis and the linear data scale t on the horizontal
axis. On such paper any normal cdf is a straight line. If data plotted on such
paper are near a straight line, the normal distribution is a plausible model.
The parameters µ and σ can be estimated from the plot. When F = 0.5,
t0.5 = µ. Thus, the value of the mean is the time at which F = 0.5. Similarly,
when F = 0.841, t0.841 = µ + σ . Then σ = t0.841 − µ. Alternatively, σ can be
estimated by the reciprocal of the slope of the straight line.
where µ and σ are the scale and shape parameters, respectively. This cdf is lin-
earized and takes the form of (7.33), where y = −1 (F ), x = ln(t), a = −µ/σ ,
and b = 1/σ . A lognormal probability plot has a −1 (F ) scale on the vertical
axis and an ln(t) scale on the horizontal axis. The plot is similar to the plot for
the normal distribution except that the horizontal axis here is the log scale. If the
life data are lognormally distributed, the plot exhibits a straight line.
The median t0.5 can be read from the time scale at the point where F =
0.5. Then the scale parameter is µ = ln(t0.5 ). Similar to the normal distribution,
when F = 0.841, t0.841 = exp(µ + σ ). Thus, σ = ln(t0.841 ) − µ, where t0.841 is
read from the time scale at the point where F = 0.841. Alternatively, σ can be
estimated by the reciprocal of the slope of the straight line. Here base e (natural)
logarithms are used; base 10 logarithms are used in some applications.
SOLUTION The data are singly censored on the right at 100 and 120◦ C, and
complete at 150◦ C. We first analyze the data at 100◦ C. The failure times are
ordered from smallest to largest, as shown in Table 7.5. The Fi for each t(i) is
20
10
5
3
2
Multiply Right-Censored Exact Data For multiply censored data, the plotting
position calculation is more complicated than that for complete or singly censored
data. Kaplan and Meier (1958) suggest a product-limit estimate given by
i δj
n−j
Fi = 1 − , (7.35)
j =1
n−j +1
where n is the number of observations, i the rank of the ith ordered observa-
tion, and δj the indicator. If observation j is censored, δj = 0; if observation
j is uncensored, δj = 1. Other plotting positions may be used; some software
packages (e.g., Minitab) provide multiple options, including this Kaplan–Meier
272 ACCELERATED LIFE TESTS
position. The plotting procedures are the same as those for complete or singly
censored data, and are illustrated in Example 7.6.
Interval Data Often, test units are not monitored continuously during testing,
due to technological or economic limitations; rather, they are inspected periodi-
cally. Then failures are not detected until inspection. Thus, the failure times do
not have exact values; they are interval data. Let ti (i = 1, 2, . . . , m) denote the
ith inspection time, where m is the total number of inspections. Then the inter-
vals preceding the m inspections are (t0 , t1 ], (t1 , t2 ], . . . , (tm−1 , tm ]. Suppose that
inspection at ti yields ri failures, where 0 ≤ ri ≤ n and n is the sample size. The
exact failure times are unknown; we spread them uniformly over the interval.
Thus, the failure times in (ti−1 , ti ] are approximated by
ti − ti−1
tij = ti−1 + j , i = 1, 2, . . . , m; j = 1, 2, . . . , ri , (7.36)
ri + 1
where tij is the failure time of unit j failed in interval i. Intuitively, when only
one failure occurs in an interval, the failure time is estimated by the midpoint
of the interval. After each failure is assigned a failure time, we perform the
probability plotting by using the approximate exact life data, where the plotting
position is determined by (7.34) or (7.35), depending on the type of censoring.
We illustrate the plotting procedures in the following example.
SOLUTION As the data indicate, test units 4 and 7 of the “before” group have
a failure mode different from the critical one of concern. This is so for units 1 and
2 of the “after” group. They are considered as censored units in the subsequent
data analysis, because the critical modes observed would have occurred later. In
addition, the data of both groups are censored on the right.
To estimate the two life distributions, we first approximate the life of each
failed unit by using (7.36), as shown in Table 7.6. The approximate lives are
treated as if they were exact. The corresponding plotting positions are calculated
using (7.35) and presented in Table 7.6. Since the Weibull distribution is sug-
gested by historical data, the data are plotted on Weibull probability paper. The
plot in Figure 7.14 was produced with Minitab. Figure 7.14 suggests that the dis-
tribution is adequate for the test data. The least squares estimates of the Weibull
GRAPHICAL RELIABILITY ESTIMATION AT INDIVIDUAL TEST CONDITIONS 273
99
95
90
80
70
60
50
40
30
Percent
20
before after
10
5
3
2
parameters before design change are α̂B = 3.29×105 cycles and β̂B = 1.30. The
design engineers were interested in the B10 life, which is estimated from (2.25) as
For the after group, α̂A = 7.33×105 cycles and β̂A = 3.08. The B10 estimate is
B̂10,A = 7.33×105 [− ln(1 − 0.10)]1/3.08 = 3.53×105 cycles.
The design change greatly increased the B10 life. Figure 7.14 shows further
that the life of the after group is considerably longer than that of the before
group, especially at the lower tail. Therefore, it can be concluded that the fix is
effective in delaying the occurrence of the critical failure mode.
where l(θ ) is called the likelihood function. Since t does not depend on θ ,
the term can be omitted in subsequent estimation of the model parameter(s).
Then (7.37) simplifies to
n
l(θ ) = f (ti ; θ ). (7.38)
i=1
For the sake of numerical calculation, the log likelihood is used in applications.
Then (7.38) is rewritten as
n
L(θ ) = ln[f (ti ; θ )], (7.39)
i=1
ANALYTICAL RELIABILITY ESTIMATION AT INDIVIDUAL TEST CONDITIONS 275
where L(θ ) = ln[l(θ )] is the log likelihood and depends on the model parame-
ter(s) θ . The ML estimate of θ is the value of θ that maximizes L(θ ). Sometimes,
the estimate of θ is obtained by solving
∂L(θ )
= 0. (7.40)
∂θ
Other times, it is found by iteratively finding the value of θ that maximizes
L(θ ). The resulting estimate θ̂ , which is a function of t1 , t2 , . . . , tn , is called
the maximum likelihood estimator (MLE). If θ is a vector of k parameters, their
estimators are determined by solving k equations each like (7.40) or by iteratively
maximizing L(θ ) directly. In most situations, the calculation requires numerical
iteration and is done using commercial software. It is easily seen that the form of
the log likelihood function varies with the assumed life distribution. Furthermore,
it also depends on the type of data, because the censoring mechanism and data
collection method (continuous or periodical inspection) affect the joint probability
shown in (7.37). The log likelihood functions for various types of data are given
below.
Complete Exact Data As discussed earlier, such data occur when all units are
run to failure and subjected to continuous inspection. The log likelihood function
for such data is given by (7.39). Complete exact data yield the most accurate
estimates.
Right-Censored Exact Data When test units are time censored on the right
and inspected continuously during testing, the observations are right-censored
exact failure times. Suppose that a sample of size n yields r failures and n − r
censoring times. Let t1 , t2 , . . . , tr denote the r failure times, and tr+1 , tr+2 , . . . , tn
denote the n − r censoring times. The probability that censored unit i would fail
above its censoring time ti is [1 − F (ti ; θ )], where F (t; θ ) is the cdf of f (t; θ ).
Then the sample log likelihood function is
r
n
L(θ ) = ln[f (ti ; θ )] + ln[1 − F (ti ; θ )]. (7.41)
i=1 i=r+1
When the censoring times tr+1 , tr+2 , . . . , tn are all equal, the data are singly
censored data. If at least two of them are unequal, the data are multiply censo-
red data.
Complete Interval Data Sometimes all test units are run to failure and inspected
periodically during testing, usually all with the same inspection schedule. The
situation results in complete interval data. Let ti (i = 1, 2, . . . , m) be the ith
inspection time, where m is the total number of inspections. Then the m inspection
intervals are (t0 , t1 ], (t1 , t2 ], . . . , (tm−1
, tm ]. Suppose that inspection at ti detects
ri failures, where 0 ≤ ri ≤ n, n = m 1 ri , and n is the sample size. Since a
276 ACCELERATED LIFE TESTS
m
L(θ ) = ri ln[F (ti ; θ ) − F (ti−1 ; θ )]. (7.42)
i=1
m
m
L(θ ) = ri ln[F (ti ; θ ) − F (ti−1 ; θ )] + di ln[1 − F (ti ; θ )]. (7.43)
i=1 i=1
1. Calculate all second partial derivatives of the sample log likelihood function
with respect to the model parameters.
ANALYTICAL RELIABILITY ESTIMATION AT INDIVIDUAL TEST CONDITIONS 277
where θi,L and θi,U are the lower and upper bounds, and z1−α/2 is the
100(1 − α/2)th standard normal percentile. Note that (7.46) assumes
that
θ̂i has a normal distribution with mean θi and standard deviation V̂ar(θ̂i ).
The normality may be adequate when the number of failures is moderate
to large (say, 15 or more). The one-sided 100(1 − α)% confidence bound
is easily obtained by replacing z1−α/2 with z1−α and using the appropri-
ate sign in (7.46). When θi is a positive parameter, ln(θ̂i ) may be better
approximated using the normal distribution. The resulting positive confi-
dence interval is
z
1−α/2 V̂ar(θ̂ )
i
[θi,L , θi,U ] = θ̂i exp ± . (7.47)
θ̂i
278 ACCELERATED LIFE TESTS
where the ∂g/∂θi are evaluated at θ̂1 , θ̂2 , . . . , θ̂k . If the correlation between the
parameter estimates is weak, the second term in (7.48) may be omitted.
The two-sided approximate 100(1 − α)% confidence interval for g is
[gL , gU ] = ĝ ± z1−α/2 V̂ar(ĝ). (7.49)
Complete Exact Data The sample log likelihood for such data is obtained by
substituting (7.50) into (7.39). Then we have
1
n
L(θ ) = −n ln(θ ) − ti . (7.51)
θ i=1
The MLE of θ is
1
n
θ̂ = ti . (7.52)
n i=1
The MLE of the failure rate is λ̂ = 1/θ̂ . The estimate of the 100pth percentile,
reliability, probability of failure (population fraction failing), or other quantities
are obtained by substituting θ̂ or λ̂ into the appropriate formula in Chapter 2.
The estimate of the variance of θ̂ is
θ̂ 2
V̂ar(θ̂ ) = . (7.53)
n
ANALYTICAL RELIABILITY ESTIMATION AT INDIVIDUAL TEST CONDITIONS 279
2
where χp;2n is the 100pth percentile of the χ 2 (chi-square) distribution with 2n
degrees of freedom.
The confidence interval for failure rate λ is
1 1
[λL , λU ] = , .
θU θL
The lower 90% confidence bound at 40◦ C is θL = 22.7 × 786.6 = 17, 856 hours.
Since θL = 17,856 > 15,000, we conclude that the design surpasses the MTTF
requirement at the 90% confidence level.
Right-Censored Exact Data Suppose that r out of n units fail in a test and
the remainder are censored on the right (type II censoring). The failure times are
t1 , t2 , . . . , tr , and the censoring times are tr+1 , tr+2 , . . . , tn . Then the sum n1 ti is
the total test time. Formulas (7.52) to (7.55) can be used for the right-censored
exact data by replacing the sample size n with the number of failures r. The
resulting formulas may apply to type I censoring, but the confidence interval
derived from (7.55) is no longer exact.
Example 7.8 Refer to Example 7.7. Suppose that the design verification test
has to be censored at 1100 hours. Determine if the design meets the MTTF
requirement at the 90% confidence level by using the censored data.
where the prime denotes an accelerating condition. Since the number of failures
is small, the normal-approximation confidence interval may not be accurate. We
calculate the confidence interval from the chi-square distribution. The one-sided
lower 90% confidence bound is
2r θ̂ 1414.8
θL = 2
=2×8× = 961.6 hours.
χ(1−α);2r 23.54
The lower 90% confidence bound at 40◦ C is θL = 22.7 × 961.6 = 21, 828 hours.
The lower 90% confidence bound is greater than 15,000 hours. So we conclude
that the design surpasses the MTTF requirement at the 90% confidence level.
But note that the early censoring yields an optimistic estimate of the mean life
as well as a lower confidence bound.
Interval Data Following the notation in (7.42), the sample log likelihood func-
tion for the complete interval data is
m
ti−1 ti
L(θ ) = ri ln exp − − exp − . (7.56)
i=1
θ θ
Equating to zero the derivative of (7.56) with respect to θ does not yield a closed-
form expression for θ . The estimate of θ is obtained by maximizing L(θ ) through
a numerical algorithm: for example, the Newton–Raphson method. The Solver
of Microsoft Excel provides a convenient means for solving a small optimization
ANALYTICAL RELIABILITY ESTIMATION AT INDIVIDUAL TEST CONDITIONS 281
problem like this. Most statistical and reliability software packages calculate this
estimate. Confidence intervals for the mean life and other quantities may be
computed as described earlier.
Using the notation in (7.43), we obtain the sample log likelihood function for
right-censored interval data as
1
m m
ti−1 ti
L(θ ) = ri ln exp − − exp − − di ti . (7.57)
i=1
θ θ θ i=1
where β is the shape parameter and α is the scale parameter or characteristic life.
Complete Exact Data When the data are complete and exact, the sample log
likelihood function is obtained by substituting (7.58) into (7.39). Then we have
β
n
ti
L(α, β) = ln(β) − β ln(α) + (β − 1) ln(ti ) − . (7.59)
i=1
α
The estimates α̂ and β̂ may be got by maximizing (7.59); the numerical cal-
culation frequently uses the Newton–Raphson method, of which the efficiency
and convergence depend on the initial values. Qiao and Tsokos (1994) propose a
more efficient numerical algorithm for solving the optimization problem. Alter-
natively, the estimators can be obtained by solving the likelihood equations. To
do this, we take the derivative of (7.59) with respect to α and β, respectively.
Equating the derivatives to zero and further simplification yield
n β
1
n
i=1 ti ln(ti ) 1
n β − − ln(ti ) = 0, (7.60)
i=1 ti
β n i=1
1/β
1 β
n
α= t . (7.61)
n i=1 i
282 ACCELERATED LIFE TESTS
Equation (7.60) contains only one unknown parameter β and can be solved
iteratively to get β̂ with a numerical algorithm. Farnum and Booth (1997) provide
a good starting β value for the iteration. Once β̂ is obtained, it is substituted
into (7.61) to calculate α̂. The estimates may be heavily biased when the number
of failures is small. Then correction methods provide better estimates. Thoman
et al. (1969) tabulate bias correction coefficients for various values of the sample
size and shape parameter. R. Ross (1994) formulates the correction factor for the
estimate of the shape parameter as a function of the sample size. Hirose (1999)
also provides a simple formula for unbiased estimates of the shape and scale
parameters as well as the percentiles.
The estimate of the 100pth percentile, reliability, failure probability (popula-
tion fraction failing), or other quantities can be obtained by substituting α̂ and β̂
into the corresponding formula in Chapter 2.
The two-sided 100(1 − γ )% confidence intervals for α and β are
[αL , αU ] = α̂ ± z1−γ /2 V̂ar(α̂), (7.62)
[βL , βU ] = β̂ ± z1−γ /2 V̂ar(β̂). (7.63)
The estimates of these variances are computed from the inverse local Fisher
information matrix as described in Section 7.6.2. The log transformation of α̂
and β̂ may result in a better normal approximation. From (7.47), the approximate
confidence intervals are
z1−γ /2 V̂ar(α̂)
[αL , αU ] = α̂ exp ± , (7.64)
α̂
z1−γ /2 V̂ar(β̂)
[βL , βU ] = β̂ exp ± . (7.65)
β̂
Here G(w) is the cdf of the standard smallest extreme value distribution.
ANALYTICAL RELIABILITY ESTIMATION AT INDIVIDUAL TEST CONDITIONS 283
where
2
2up α̂up 2up
V̂ar(tˆp ) = exp V̂ar(α̂) + exp V̂ar(β̂)
β̂ β̂ 2 β̂
2α̂up 2up
− exp Ĉov(α̂, β̂),
β̂ 2 β̂
up = ln[− ln(1 − p)].
Right-Censored Exact Data Suppose that r out of n test units fail and the
remainder are censored on the right (type I censoring). The failure times are
t1 , t2 , . . . , tr , and the censoring times are tr+1 , tr+2 , . . . , tn . The sample log like-
lihood is
β n β
r
ti ti
L(α, β) = ln(β) − β ln(α) + (β − 1) ln(ti ) − − .
i=1
α i=r+1
α
(7.68)
Like (7.59) for the complete exact data, (7.68) does not yield closed-form solu-
tions for α̂ and β̂. The estimates may be obtained by directly maximizing L(α, β),
or by solving the likelihood equations:
n
1
β r
i=1 ti ln(ti ) 1
n β − − ln(ti ) = 0, (7.69)
i=1 ti
β r i=1
1/β
1 β
n
α= t . (7.70)
r i=1 i
When r = n or the test is uncensored, (7.69) and (7.70) are equivalent to (7.60)
and (7.61), respectively. Like the complete data, the censored data yield biased
284 ACCELERATED LIFE TESTS
estimates, especially when the test is heavily censored (the number of failures is
small). Bain and Engelhardt (1991), and R. Ross (1996), for example, present a
bias correction.
The confidence intervals (7.64) to (7.66) for the complete exact data are equ-
ally applicable to the censored data here. In practice, the calculation is done
with commercial software (see the example below). Bain and Engelhardt (1991)
provide approximations to the variances and covariance of the estimates, which
are useful when hand computation is necessary.
Example 7.9 Refer to Example 7.6. Use the ML method to reanalyze the app-
roximate lifetimes. Like the graphical analysis in that example, treat the lifetimes
as right-censored exact data here.
SOLUTION The plots in Example 7.6 show that the Weibull distribution is ade-
quate for the data sets. Now we use the ML method to estimate the model param-
eters and calculate the confidence intervals. The estimates may be computed by
solving (7.69) and (7.70) on an Excel spreadsheet or a small computer program.
Then follow the procedures in Section 7.6.2 to calculate the confidence intervals.
Here the computation is performed with Minitab. For the “before” group, the ML
parameter estimates are α̂B = 3.61 × 105 cycles and β̂B = 1.66. The approxi-
mate two-sided 90% confidence intervals are [αB,L , αB,U ] = [2.47 × 105 , 5.25 ×
105 ], and [βB,L , βB,U ] = [0.98, 2.80], which can be derived from (7.64) and
(7.65), respectively. The corresponding B10 life is B̂10,B = 0.93 × 105 cycles.
Similarly, for the “after” group, Minitab gives α̂A = 7.78 × 105 cycles, β̂A =
3.50, [αA,L , αA,U ] = [6.58 × 105 , 9.18 × 105 ], [βA,L , βA,U ] = [2.17, 5.63], and
B̂10,A = 4.08 × 105 cycles. Note that the ML estimates are moderately differ-
ent from the graphical estimates. In general, the ML method provides better
estimates. Despite the difference, the two estimation methods yield the same
conclusion; that is, the design change is effective because of the great improve-
ment in the lower tail performance. Figure 7.15 shows the two probability plots,
each with the ML fit and the two-sided 90% confidence interval curves for per-
centiles. It is seen that the lower bound of the confidence interval for the after-fix
group in lower tail is greater than the upper bound for the before-fix group. This
confirms the effectiveness of the fix.
Interval Data When all units are on the same inspection schedule t1 , t2 , . . . , tm ,
the sample log likelihood for complete interval data is
m
ti−1 β ti β
L(α, β) = ri ln exp − − exp − , (7.71)
i=1
α α
where ri is the number of failures in the ith inspection interval (ti−1 , ti ], and
m is the number of inspections. The likelihood function is more complicated
than that for the exact data; the corresponding likelihood equations do not yield
closed-form estimates for the model parameters. So the estimates are obtained
ANALYTICAL RELIABILITY ESTIMATION AT INDIVIDUAL TEST CONDITIONS 285
99
95
90
80
70
60
50
40
30
Percent
20
before after
10
5
3
2
with numerical methods. This is also the case for the right-censored interval data,
whose sample log likelihood function is
β
m
ti−1 β ti β m
ti
L(α, β) = ri ln exp − − exp − − di ,
i=1
α α i=1
α
(7.72)
where m, ri , and ti follow the notation in (7.71), and di is the number of units
censored at inspection time ti .
Confidence intervals for the interval data may be calculated from the formulas
for exact data.
Example 7.10 In Examples 7.6 and 7.9, the lifetimes of the transmission part
were approximated as the exact data. For the purpose of comparison, now we
analyze the lifetimes as the interval data they really are. The data are rearranged
according to the notation in (7.72) and are shown in Table 7.7, where only inspec-
tion intervals that result in failure, censoring, or both are listed. The inspection
times are in 105 cycles.
For each group, the sample log likelihood function is obtained by substitut-
ing the data into (7.72). Then α and β can be estimated by maximizing L(α, β)
through a numerical algorithm. Here Minitab performed the computation and
gave the results in Table 7.8. The Weibull plots, ML fits, and percentile con-
fidence intervals are depicted in Figure 7.16. There the plotted points are the
286 ACCELERATED LIFE TESTS
upper endpoints of the inspection intervals. For comparison, Table 7.8 includes
the graphical estimates from Example 7.6 and those from the ML analysis of the
approximate exact data in Example 7.9. Comparison indicates that:
99
95
90
80
70
60
50
40
30
Percent
20
before after
10
5
3
2
Equating to zero the derivatives of (7.74) with respect to µ and σ gives the
likelihood equations. Solving these equations yields the MLEs
1
n
µ̂ = ti , (7.75)
n i=1
288 ACCELERATED LIFE TESTS
1
n
σ̂ 2 = (ti − µ̂)2 . (7.76)
n i=1
Note that (7.75) is the usual unbiased estimate of µ, whereas (7.76) is not the
unbiased estimate of σ 2 . The unbiased estimate is
1
n
s2 = (ti − µ̂)2 . (7.77)
n − 1 i=1
where V̂ar(µ̂) ≈ σ̂ 2 /n and V̂ar(σ̂ ) ≈ σ̂ 2 /2n. The one-sided 100(1 − α)% confi-
dence bound is obtained by replacing z1−α/2 with z1−α and using the appropri-
ate sign.
An approximate confidence interval for the probability of failure F at a particu-
lar time t can be developed using (7.48) and (7.49), where g(µ, σ ) = F (t; µ, σ ).
A more accurate one is
where
t − µ̂
[wL , wU ] = ŵ ± z1−α/2 V̂ar(ŵ), ŵ = ,
σ̂
1
V̂ar(ŵ) = [V̂ar(µ̂) + ŵ2 V̂ar(σ̂ ) + 2ŵĈov(µ̂, σ̂ )],
σ̂ 2
tˆp = µ̂ + zp σ̂ . (7.84)
where V̂ar(tˆp ) = V̂ar(µ̂) + zp2 V̂ar(σ̂ ) + 2zp Ĉov(µ̂, σ̂ ). Note that (7.85) reduces
to (7.81) when p = 0.5.
The lognormal 100pth percentile and confidence bounds are calculated with
the antilog transformation of (7.84) and (7.85), respectively.
SOLUTION The lognormal distribution adequately fits the life data of supplier
1, as indicated by a lognormal probability plot (not shown here). The next step
is to calculate the ML estimates and confidence intervals. This can be done with
Minitab. Here we do manual computation for illustration purposes. First the log
lifetimes are calculated. Then from (7.75), the estimate of µ is
1
µ̂1 = [ln(170) + ln(205) + · · · + ln(701)] = 5.806.
15
290 ACCELERATED LIFE TESTS
1
σ̂12 = {[ln(170) − 5.806]2 + [ln(205) − 5.806]2 + · · · + [ln(701) − 5.806]2 }
15
= 0.155,
σ̂1 = 0.394.
Similarly,
√
1.6449 0.0052
[σ1,L , σ1,U ] = 0.394 exp ± = [0.292, 0.532].
0.394
For complete data, Cov(µ̂1 , σ̂1 ) = 0. Then the estimate of the variance of
ŵ = (t − µ̂1 )/σ̂1 is
1
V̂ar(ŵ) = [0.0103 + (−1.2987)2 × 0.0052] = 0.123.
0.155
For t = ln(200) = 5.298, we have
√
[wL , wU ] = −1.2987 ± (−1.6449 × 0.123) = [−1.8756, −0.7218].
Then from (7.83), the confidence interval for the population fraction failing by
200 hours is [F1,L , F1,U ] = [0.030, 0.235] or [3.0%, 23.5%].
The lognormal confidence interval for the median life is
The results above will be compared with those for supplier 2 in Example 7.12.
ANALYTICAL RELIABILITY ESTIMATION AT INDIVIDUAL TEST CONDITIONS 291
Example 7.12 Refer to Example 7.11. The life data for supplier 2 are right-
censored exact data. The sample size n = 15, the number of failures r = 10, and
the censoring time is 701 hours. Estimate the life distribution and the median life
for supplier 2. Compare the results with those for supplier 1 in Example 7.11.
Then make a recommendation as to which supplier to choose.
where ri is the number of failures in the ith inspection interval (ti−1 , ti ] and
m is the number of inspection times. Unlike the likelihood function for exact
data, (7.87) does not yield closed-form solutions for the parameter estimates.
They must be found using a numerical method. In practice, commercial reliability
software is preferred for this. If such software is not available, we may create
an Excel spreadsheet and use its Solver feature to do the optimization. Excel is
especially convenient for solving this problem because of its embedded standard
normal distribution. Software or an Excel spreadsheet may be used to deal with
the right-censored interval data, whose sample log likelihood is
m m
ti − µ ti−1 −µ ti − µ
L(µ, σ ) = ri ln − + di ln 1− ,
i=1
σ σ i=1
σ
(7.88)
where m, ri , and ti follow the notation in (7.87) and di is the number of units
censored at inspection time ti .
The calculation of confidence intervals for the interval data applies formulas
for the exact data given earlier in this subsection.
y = γ0 + γ1 x1 + · · · + γk xk , (7.89)
ln(α) = γ0 + γ1 x, (7.90)
where x = 1/T . If we use the inverse power relationship (7.18) and the lognor-
mal distribution, (7.89) becomes
µ = γ0 + γ1 x, (7.91)
ln(θ ) = γ0 + γ1 x1 + γ2 x2 + γ3 x3 , (7.92)
1. Plot the life data from each test condition on appropriate probability paper,
and estimate the location and scale parameters, which are denoted ŷi and
σ̂i (i = 1, 2, . . . , m), where m is the number of stress levels. This step has
been described in detail in Section 7.5.
2. Substitute ŷi and the value of xi into the linearized relationship (7.89) and
solve the equations for the coefficients using the linear regression method.
294 ACCELERATED LIFE TESTS
Then calculate the estimate of y at the use stress level, say ŷ0 . Alternatively,
ŷ0 may be obtained by plotting ŷi versus the (linearly transformed) stress
level and projecting the straight line to the use level.
3. Calculate the common scale parameter estimate σ̂0 from
1
m
σ̂0 = ri σ̂i , (7.93)
r i=1
where ri is the number of failures at stress level i and r = m 1 ri . Equa-
tion (7.93) assumes a constant scale parameter, and is an approximate
estimate of the common scale parameter. More accurate, yet complicated
estimates are given in, for example, Nelson (1982).
4. Estimate the quantities of interest at the use stress level using the life
distribution with location parameter ŷ0 and scale parameter σ̂0 .
It should be pointed out that the graphical method above yields approxi-
mate life estimates at the use condition. Whenever possible, the ML method
(Section 7.7.3) should be used to obtain better estimates.
Example 7.13 Refer to Example 7.5. Using the Arrhenius relationship, estimate
the B10 life and the reliability at 10,000 hours at a use temperature of 35◦ C.
SOLUTION In Example 7.5, Weibull plots of the life data at each temperature
yielded the estimates α̂1 = 5394 and β̂1 = 2.02 for the 100◦ C group, α̂2 = 3285
and β̂2 = 2.43 for the 120◦ C group, and α̂3 = 1330 and β̂3 = 2.41 for the 150◦ C
group. As shown in (7.90), ln(α) is a linear function of 1/T . Thus, we plot
ln(α̂i ) versus 1/Ti using an Excel spreadsheet and fit a regression line to the
data points. Figure 7.17 shows the plot and the regression equation. The high R 2
value indicates that the Arrhenius relationship fits adequately. The estimate of
10
estimated a
9.5 ln(a) = 4452.6(1/T) − 3.299
failure time
9 censoring time R2 = 0.9925
Log life
fit to estimated a
8.5
8
7.5
7
0.0023 0.0024 0.0025 0.0026 0.0027 0.0028
1/T
FIGURE 7.17 Plot of the fitted Arrhenius relationship for electronic modules
RELIABILITY ESTIMATION AT USE CONDITION 295
Thus, Weibull fit has shape parameter 2.29 and scale parameter 69,542 hours
at use temperature 35◦ C. The B10 life is B̂10 = 69,542 × [− ln(1 − 0.1)]1/2.29 =
26,030 hours. The reliability at 10,000 hours is R̂(10,000) = exp[−(10,000/
69,542)2.29 ] = 0.9883, which means an estimated 1.2% of the population would
fail by 10,000 hours.
7.7.3 ML Estimation
In Section 7.6 we described the ML method for estimating the life distributions at
individual test stress levels. The life data obtained at different stress levels were
analyzed separately; each distribution is the best fit to the particular data set.
The inferences from such analyses apply to these stress levels. As we know, the
primary purpose of an ALT is to estimate the life distribution at a use condition.
To accomplish this, in Section 7.7.1 we assumed a (transformed) location-scale
distribution with a common scale parameter value and an acceleration relationship
between the location parameter and the stress level. For this model we fit the
model to the life data at all stress levels simultaneously. Obtain estimates at the
use condition as follows:
Weibull distribution, the sample log likelihood functions for the low and
high stress levels are given by (7.68) and (7.59), respectively. Then the
total sample log likelihood function is
β n1
r
t1i t1i β
L(α1 , α2 , β)= ln(β)−β ln(α1 )+(β −1) ln(t1i )− −
i=1
α1 i=r+1
α1
β
n2
t2i
+ ln(β)−β ln(α2 )+(β −1) ln(t2i )− , (7.94)
i=1
α2
where the subscripts 1 and 2 denote the low and high stress levels, respec-
tively.
4. Substitute an appropriate acceleration relationship into the total log like-
lihood function. In the example in step 3, if the Arrhenius relationship is
used, substituting (7.90) into (7.94) gives
β
r
t1i
L(γ0 , γ1 , β) = ln(β) − β(γ0 + γ1 x1 ) + (β − 1) ln(t1i ) −
i=1
eγ0 +γ1 x1
n1
β
t1i
−
i=r+1
eγ0 +γ1 x1
n2
+ ln(β) − β(γ0 + γ1 x2 ) + (β − 1) ln(t2i )
i=1
β
t2i
− , (7.95)
eγ0 +γ1 x2
where x1 and x2 denote the transformed low and high temperatures, respec-
tively.
5. Estimate the model parameters [e.g., γ0 , γ1 , and β in (7.95)] by maximiz-
ing the total log likelihood function directly through a numerical method.
Also, the estimates may be obtained by iteratively solving the likelihood
equations; however, this approach is usually more difficult. In the example,
this step yields the estimates γ̂0 , γ̂1 , and β̂.
6. Calculate the variance–covariance matrix for the model parameters using
the total log likelihood function and the local estimate of Fisher information
matrix described in Section 7.6.2. In the example, this step gives
V̂ar(γ̂0 ) Ĉov(γ̂0 , γ̂1 ) Ĉov(γ̂0 , β̂)
ˆ =
V̂ar(γ̂1 )
Ĉov(γ̂1 , β̂) .
symmetric V̂ar(β̂)
RELIABILITY ESTIMATION AT USE CONDITION 297
7. Calculate the life distribution estimate at the use stress level. The location
parameter estimate of the distribution is calculated from the acceleration
relationship. In the example, the Weibull characteristic life at the use
condition is α̂0 = exp(γ̂1 + γ̂1 x0 ), where x0 = 1/T0 and T0 is the use tem-
perature. The Weibull shape parameter estimate is β̂. Having α̂0 and β̂, we
can estimate the quantities of interest, such as reliability and percentiles.
8. Estimate the variance for the location parameter estimate at the use condi-
tion and the covariance for the location and scale parameter estimates. The
variance is obtained from (7.48) and the acceleration relationship. For the
lognormal distribution, the covariance at a given stress level is
k
Ĉov(µ̂, σ̂ ) = Ĉov(γ̂0 , σ̂ ) + xi Ĉov(γ̂i , σ̂ ), (7.96)
i=1
where k and xi are the same as those in (7.89). Substituting the use stress
levels x10 , x20 , . . . , xk0 into (7.96) results in the covariance of the estimate
of the scale parameter and that of the location parameter at the use stress
levels. Similarly, for the Weibull distribution,
k
Ĉov(α̂, β̂) = α̂ Ĉov(γ̂0 , β̂) + xi Ĉov(γ̂i , β̂) . (7.97)
i=1
In the example above with a single stress, the covariance of α̂0 and β̂
from (7.97) is
9. Calculate the confidence intervals for the quantities estimated earlier. This
is done by substituting the variance and covariance estimates of the model
parameters at the use condition into the confidence intervals for an individ-
ual test condition presented in Section 7.6. In the example, the confidence
interval for the probability of failure at a given time and the use condition
is obtained by substituting V̂ar(α̂0 ), V̂ar(β̂), and Ĉov(α̂0 , β̂) into (7.66).
20
10
5
3
2
SOLUTION The graphical analysis in Example 7.5 shows that the Weibull
distribution fits the life data adequately at each temperature. The Weibull fits
to the three data sets were plotted in Figure 7.13, which suggests that a constant
shape parameter is reasonable although the line for 100◦ C is not quite parallel to
other two. Alternatively, in Figure 7.18, we plot the Weibull fits to the three data
sets with a common shape parameter, which was calculated in Example 7.13 as
β̂0 = 2.29. Figure 7.18 shows that a common shape parameter is reasonable.
The groups at 100 and 120◦ C have right-censored exact data, whereas that at
150◦ C is complete. If the life–temperature relationship is modeled with (7.90),
the total sample log likelihood function is
8
L(γ0 , γ1 , β) = ln(β) − β(γ0 + γ1 x1 ) + (β − 1) ln(t1i )
i=1
β β
t1i 5500
− − 4 × γ +γ x
eγ0 +γ1 x1 e0 11
β
7
t2i
+ ln(β) − β(γ0 + γ1 x2 ) + (β − 1) ln(t2i ) −
i=1
eγ0 +γ1 x2
β
4500
−
eγ0 +γ1 x2
β
10
t3i
+ ln(β) − β(γ0 + γ1 x3 ) + (β − 1) ln(t3i ) − ,
i=1
eγ0 +γ1 x3
RELIABILITY ESTIMATION AT USE CONDITION 299
Example 7.15 In this example we analyze an ALT with two accelerating stresses
using the ML method. G. Yang and Zaghati (2006) present a case on reliabil-
ity demonstration of a type of 18-V compact electromagnetic relays through
ALT. The relays would be installed in a system and operate at 5 cycles per
minute and 30◦ C. The system design specifications required the relays to have
a lower 90% confidence bound for reliability above 99% at 200,000 cycles.
A sample of 120 units was divided into four groups, each tested at a higher-
than-use temperature and switching rate. In testing, the normal closed and open
contacts of the relays were both loaded with 2 A of resistive load. The max-
imum allowable temperature and switching rate of the relays are 125◦ C and
30 cycles per minute, respectively. The increase in switching rate reduces the
cycles to failure for this type of relay, due to the shorter time for heat dissipa-
tion and more arcing. Its effect can be described by the life–usage model (7.21).
The effect of temperature on cycles to failure is modeled with the Arrhenius
relationship. This ALT involved two accelerating variables. Table 7.9 shows the
Yang compromise test plan, which is developed in the next section. The test
plan specifies the censoring times, while the censoring cycles are the censor-
ing times multiplied by the switching rates. The numbers of cycles to failure
are summarized in Table 7.10. Estimate the life distribution at the use tempera-
ture and switching rate, and verify that the component meets the system design
specification.
300 ACCELERATED LIFE TESTS
1 64 10 73 480 288
2 64 30 12 480 864
3 125 10 12 96 57.6
4 125 30 23 96 172.8
SOLUTION We first graphically analyze the life data of the four groups. Groups
1 and 3 are right censored, and 2 and 4 are complete. Probability plots for
individual groups indicate that the Weibull distribution is adequate for all groups,
and a constant shape parameter is reasonable, as shown in Figure 7.19.
The graphical analysis should be followed by the maximum likelihood method.
The total sample log likelihood function is not given here but can be worked
out by summing those for the individual groups. The acceleration relationship
combines the Arrhenius relationship and the life–usage model: namely,
Ea
α(f, T ) = Af expB
, (7.98)
kT
where α(f, T ) is the Weibull characteristic life and the other notation is the same
as in (7.5) and (7.21). Linearizing (7.98) gives
ln[α(x1 , x2 )] = γ0 + γ1 x1 + γ2 x2 , (7.99)
99
95
90
80
70
60
50
40
30
2 group 1
20
Percent
10
4 3
5
3
2
0.1
Instead, we use Minitab to do ML estimation and get γ̂0 = 0.671, γ̂1 = 4640.1,
γ̂2 = −0.445, and β̂ = 1.805.
The Weibull fits to each group with the common β̂ = 1.805 are plotted in
Figure 7.20. The Weibull characteristic life at use temperature (30◦ C) and the
usual switching rate (5 cycles per minute) is estimated by
4640.1
α̂0 = exp[0.671 + − 0.445 × ln(5)] = 4.244 × 106 cycles.
303.15
The reliability estimate at 200,000 cycles is
1.805
200,000
R̂(200,000) = exp − = 0.996.
4.244 × 106
The two-sided 90% confidence interval for Ea is [0.347, 0.452]. The estimate
of the switching rate effect parameter is B̂ = γ̂2 = −0.445. The two-sided 90%
302 ACCELERATED LIFE TESTS
99
95
90
80
70
60
50
40
30
20
Percent
10
5
3
2
1
4 2
3 group 1
0.1
In this section we focus on compromise test plans which optimize the val-
ues of test variables to minimize the asymptotic variance of the estimate of
a life percentile at the use stress level. Other test plans may be found in, for
example, Nelson (1990, 2004) and Meeker and Escobar (1998). Nelson (2005)
provides a nearly complete bibliography of the accelerated life (and degradation)
test plans, which contains 159 publications.
It should be as high as possible to yield more failures and reduce the vari-
ance of the estimate. However, the high stress level should not induce failure
modes different from those at the use stress. Because only two stress levels are
used, the test plans are sensitive to the misspecification of life distribution and
preestimates of model parameters, which are required for calculating the plans.
In other words, incorrect choice of the life distribution and preestimates may
greatly compromise the optimality of the plans and result in a poor estimate.
The use of only two stress levels does not allow checking the linearity of the
assumed relationship and does not yield estimates of the relationship parame-
ters when there are no failures at the low stress level. Hence, these plans are
often not practically useful. However, they have a minimum variance, which is
a benchmark for other test plans. For example, the compromise test plans below
are often compared with the statistically optimum plans to evaluate the loss of
accuracy for the robustness gained. The theory for optimum plans is described
in Nelson and Kielpinski (1976) and Nelson and Meeker (1978). Kielpinski and
Nelson (1975) and Meeker and Nelson (1975) provide the charts necessary for
calculating particular plans. Nelson (1990, 2004) summarizes the theory and the
charts. Meeker and Escobar (1995, 1998) describe optimum test plans with two
or more accelerating variables.
5. Compromise plans. When a single accelerating variable is involved, such
plans use three or more stress levels. The high level must be specified. The
middle stress levels are often equally spaced between the low and high levels,
but unequal spacing may be used. The low stress level and its number of units
are optimized. The number allocated to a middle level may be specified as a fixed
percentage of the total sample size or of the number of units at a low or high
stress level. In the latter situation, the number is a variable. There are various
optimization criteria, which include minimization of the asymptotic variance of
the estimate of a life percentile at a use condition (the most commonly used
criterion), the total test time, and others (Nelson, 2005). Important compromise
plans with one accelerating variable are given in Meeker (1984), Meeker and
Hahn (1985), G. Yang (1994), G. Yang and Jin (1994), Tang and Yang (2002),
and Tang and Xu (2005). When two accelerating variables are involved, G. Yang
and Yang (2002), and G. Yang (2005) give factorial compromise plans, which
use four test conditions. Meeker and Escobar (1995, 1998) describe the 20%
compromise plans, which employ five test conditions. In this section we focus on
Yang’s practical compromise test plans for Weibull and lognormal distributions
with one or two accelerating variables.
TABLE 7.11 Compromise Test Plans for a Weibull Distribution with One
Accelerating Variable
No. a1 a2 a3 b π1 ξ1 V No. a1 a2 a3 b π1 ξ1 V
1 0 0 0 4 0.564 0.560 85.3 46 4 4 2 5 0.909 0.888 3.2
2 1 0 0 4 0.720 0.565 29.5 47 3 3 3 5 0.788 0.717 7.3
3 2 0 0 4 0.795 0.718 10.0 48 4 3 3 5 0.909 0.887 3.2
4 3 0 0 4 0.924 0.915 3.6 49 4 4 3 5 0.909 0.888 3.2
5 1 1 0 4 0.665 0.597 27.9 50 4 4 4 5 0.909 0.888 3.2
6 2 1 0 4 0.788 0.721 10.0 51 0 0 0 6 0.527 0.391 205.6
7 3 1 0 4 0.923 0.915 3.6 52 1 0 0 6 0.652 0.386 78.8
8 2 2 0 4 0.767 0.739 9.7 53 2 0 0 6 0.683 0.483 30.6
9 3 2 0 4 0.921 0.916 3.6 54 3 0 0 6 0.725 0.609 13.3
10 3 3 0 4 0.918 0.919 3.6 55 1 1 0 6 0.587 0.417 73.8
11 1 1 1 4 0.661 0.608 27.8 56 2 1 0 6 0.672 0.487 30.4
12 2 1 1 4 0.795 0.704 9.9 57 3 1 0 6 0.724 0.609 13.3
13 3 1 1 4 0.923 0.901 3.6 58 2 2 0 6 0.639 0.511 29.3
14 2 2 1 4 0.773 0.720 9.6 59 3 2 0 6 0.716 0.614 13.2
15 3 2 1 4 0.920 0.902 3.6 60 3 3 0 6 0.707 0.626 13.1
16 3 3 1 4 0.917 0.904 3.6 61 1 1 1 6 0.586 0.422 73.5
17 2 2 2 4 0.774 0.720 9.6 62 2 1 1 6 0.683 0.469 29.8
18 3 2 2 4 0.920 0.900 3.6 63 3 1 1 6 0.735 0.586 12.7
19 3 3 2 4 0.917 0.902 3.5 64 4 1 1 6 0.805 0.729 6.0
20 3 3 3 4 0.917 0.902 3.5 65 2 2 1 6 0.648 0.490 28.7
21 0 0 0 5 0.542 0.461 138.8 66 3 2 1 6 0.727 0.590 12.7
22 1 0 0 5 0.678 0.459 51.2 67 4 2 1 6 0.803 0.729 6.0
23 2 0 0 5 0.724 0.577 18.9 68 3 3 1 6 0.717 0.600 12.5
24 3 0 0 5 0.793 0.731 7.7 69 4 3 1 6 0.800 0.732 5.9
25 1 1 0 5 0.617 0.491 48.1 70 4 4 1 6 0.798 0.734 5.9
26 2 1 0 5 0.714 0.581 18.8 71 2 2 2 6 0.649 0.490 28.7
27 3 1 0 5 0.792 0.732 7.7 72 3 2 2 6 0.728 0.586 12.6
28 2 2 0 5 0.686 0.605 18.2 73 4 2 2 6 0.804 0.725 5.9
29 3 2 0 5 0.786 0.735 7.7 74 5 2 2 6 0.907 0.885 3.0
30 3 3 0 5 0.780 0.744 7.6 75 3 3 2 6 0.718 0.596 12.5
31 1 1 1 5 0.614 0.499 48.0 76 4 3 2 6 0.800 0.728 5.9
32 2 1 1 5 0.724 0.563 18.5 77 5 3 2 6 0.906 0.885 3.0
33 3 1 1 5 0.800 0.709 7.4 78 4 4 2 6 0.799 0.730 5.9
34 4 1 1 5 0.912 0.889 3.3 79 5 4 2 6 0.906 0.885 3.0
35 2 2 1 5 0.694 0.583 17.9 80 5 5 2 6 0.906 0.885 3.0
36 3 2 1 5 0.794 0.713 7.4 81 3 3 3 6 0.718 0.596 12.5
37 4 2 1 5 0.911 0.889 3.3 82 4 3 3 6 0.800 0.728 5.9
38 3 3 1 5 0.787 0.721 7.3 83 5 3 3 6 0.906 0.885 3.0
39 4 3 1 5 0.909 0.890 3.2 84 4 4 3 6 0.799 0.730 5.9
40 4 4 1 5 0.909 0.891 3.2 85 5 4 3 6 0.906 0.885 3.0
41 2 2 2 5 0.695 0.583 17.9 86 5 5 3 6 0.906 0.885 3.0
42 3 2 2 5 0.795 0.709 7.4 87 4 4 4 6 0.799 0.730 5.9
43 4 2 2 5 0.911 0.886 3.2 88 5 4 4 6 0.906 0.885 3.0
44 3 3 2 5 0.788 0.717 7.3 89 5 5 4 6 0.906 0.885 3.0
45 4 3 2 5 0.909 0.887 3.2 90 5 5 5 6 0.906 0.885 3.0
308 ACCELERATED LIFE TESTS
Then we have
ln(1080) − 5.65 9.69 − 5.65
a1 = = 1.99, a2 = 1.11, a3 = 0.43, b = = 6.03.
0.67 0.67
310 ACCELERATED LIFE TESTS
TABLE 7.12 Actual Compromise Test Plan for the Electronic Module
Temperature Number of Censoring
Group (◦ C) Test Units Time (h)
1 74 34 1080
2 89 5 600
3 105 11 380
TABLE 7.13 Compromise Test Plans for a Lognormal Distribution with One
Accelerating Variable
No. a1 a2 a3 b π1 ξ1 V No. a1 a2 a3 b π1 ξ1 V
1 0 0 0 4 0.476 0.418 88.9 46 4 4 2 5 0.819 0.835 3.1
2 1 0 0 4 0.627 0.479 27.8 47 3 3 3 5 0.700 0.694 6.7
3 2 0 0 4 0.716 0.635 9.6 48 4 3 3 5 0.822 0.833 3.2
4 3 0 0 4 0.827 0.815 3.7 49 4 4 3 5 0.819 0.836 3.1
5 1 1 0 4 0.561 0.515 25.7 50 4 4 4 5 0.819 0.836 3.1
6 2 1 0 4 0.700 0.640 9.4 51 0 0 0 6 0.470 0.287 202.8
7 3 1 0 4 0.827 0.815 3.7 52 1 0 0 6 0.582 0.335 67.4
8 2 2 0 4 0.667 0.662 9.0 53 2 0 0 6 0.631 0.444 25.6
9 3 2 0 4 0.814 0.819 3.7 54 3 0 0 6 0.679 0.570 11.4
10 3 3 0 4 0.803 0.826 3.6 55 1 1 0 6 0.521 0.363 62.3
11 1 1 1 4 0.554 0.545 24.9 56 2 1 0 6 0.614 0.449 25.3
12 2 1 1 4 0.691 0.650 9.4 57 3 1 0 6 0.678 0.570 11.3
13 3 1 1 4 0.825 0.817 3.7 58 2 2 0 6 0.580 0.470 24.1
14 2 2 1 4 0.660 0.673 9.0 59 3 2 0 6 0.661 0.578 11.2
15 3 2 1 4 0.812 0.821 3.7 60 3 3 0 6 0.647 0.589 10.9
16 3 3 1 4 0.801 0.829 3.6 61 1 1 1 6 0.521 0.377 61.0
17 2 2 2 4 0.656 0.682 8.9 62 2 1 1 6 0.609 0.455 25.2
18 3 2 2 4 0.809 0.825 3.6 63 3 1 1 6 0.679 0.570 11.3
19 3 3 2 4 0.798 0.832 3.6 64 4 1 1 6 0.752 0.701 5.5
20 3 3 3 4 0.797 0.833 3.6 65 2 2 1 6 0.579 0.474 24.0
21 0 0 0 5 0.472 0.340 139.9 66 3 2 1 6 0.662 0.577 11.1
22 1 0 0 5 0.599 0.395 45.4 67 4 2 1 6 0.746 0.702 5.5
23 2 0 0 5 0.662 0.523 16.6 68 3 3 1 6 0.649 0.588 10.9
24 3 0 0 5 0.732 0.671 7.0 69 4 3 1 6 0.736 0.708 5.4
25 1 1 0 5 0.536 0.426 42.0 70 4 4 1 6 0.731 0.712 5.4
26 2 1 0 5 0.646 0.528 16.4 71 2 2 2 6 0.578 0.479 23.8
27 3 1 0 5 0.731 0.672 7.0 72 3 2 2 6 0.660 0.580 11.1
28 2 2 0 5 0.612 0.550 15.6 73 4 2 2 6 0.745 0.704 5.5
29 3 2 0 5 0.715 0.678 6.9 74 5 2 2 6 0.845 0.838 2.8
30 3 3 0 5 0.702 0.689 6.7 75 3 3 2 6 0.647 0.591 10.9
31 1 1 1 5 0.534 0.446 40.9 76 4 3 2 6 0.734 0.709 5.4
32 2 1 1 5 0.639 0.536 16.3 77 5 3 2 6 0.839 0.840 2.8
33 3 1 1 5 0.731 0.673 7.0 78 4 4 2 6 0.730 0.713 5.4
34 4 1 1 5 0.837 0.826 3.2 79 5 4 2 6 0.835 0.842 2.8
35 2 2 1 5 0.608 0.557 15.6 80 5 5 2 6 0.834 0.843 2.8
36 3 2 1 5 0.715 0.679 6.9 81 3 3 3 6 0.647 0.591 10.9
37 4 2 1 5 0.832 0.827 3.2 82 4 3 3 6 0.734 0.710 5.4
38 3 3 1 5 0.702 0.689 6.7 83 5 3 3 6 0.839 0.840 2.8
39 4 3 1 5 0.824 0.831 3.2 84 4 4 3 6 0.730 0.714 5.4
40 4 4 1 5 0.821 0.834 3.1 85 5 4 3 6 0.835 0.842 2.8
41 2 2 2 5 0.607 0.564 15.4 86 5 5 3 6 0.834 0.843 2.8
42 3 2 2 5 0.712 0.683 6.8 87 4 4 4 6 0.730 0.714 5.4
43 4 2 2 5 0.831 0.829 3.2 88 5 4 4 6 0.835 0.842 2.8
44 3 3 2 5 0.700 0.693 6.7 89 5 5 4 6 0.834 0.843 2.8
45 4 3 2 5 0.823 0.833 3.2 90 5 5 5 6 0.834 0.843 2.8
312 ACCELERATED LIFE TESTS
Example 7.17 Refer to Example 7.16. Suppose that the life distribution of the
electronic module is mistakenly modeled with the lognormal, and other data
are the same as those in Example 7.16. Determine the test plan. Comment on
the sensitivity of the Yang compromise test plan to the incorrect choice of life
distribution.
SOLUTION For the preestimates in Example 7.16, Table 7.13 and linear inter-
polation yield the optimal values π1 = 0.612, ξ1 = 0.451, and V = 25.2. Then
π2 = (1 − 0.612)/3 = 0.129, ξ2 = 0.451/2 = 0.23, π3 = 1 − 0.612 − 0.129 =
314 ACCELERATED LIFE TESTS
0.259, and ξ3 = 0. As in Example 7.16, the standardized test plan can easily
be transformed to the actual plan.
The test plan above, based on the incorrect lognormal distribution, is evaluated
with the actual Weibull distribution. In other words, the standardized variance V0
for the Weibull test plan is calculated for π1 = 0.612 and ξ1 = 0.451 obtained
above. Then we have V0 = 31.1. The variance increase ratio is 100 × (31.1 −
30.1)/30.1 = 3.3%, where 30.1 is the standardized variance in Example 7.16.
The small increase indicates that the Yang compromise test plan is not sensitive
to the misspecification of life distribution for the given preestimates.
The Yang compromise test plans use rectangular test points, as shown in
Figure 7.21. The test plans are full factorial designs, where each accelerating
variable is a two-level factor. Such test plans are intuitively appealing and
high
Stress 2
low
low high
Stress 1
aij = [ln(ηij )−µ22 ]/σ and is the standardized censoring time; i=1, 2; j=1, 2,
b = (µ00 − µ20 )/σ,
c = (µ00 − µ02 )/σ.
Similar to the single accelerating variable case, the asymptotic variance of the
MLE of the mean, denoted x̂0.43 , at the use stresses (ξ10 = 1, ξ20 = 1) is given
by
σ2
Var[x̂0.43 (1, 1)] = V, (7.106)
n
where V is called the standardized variance. V is a function of aij , b, c, ξ1i , ξ2j ,
and πij (i = 1, 2; j = 1, 2), and independent of n and σ . The calculation of V
is given in, for example, Meeker and Escobar (1995), and G. Yang (2005). As
specified above, ξ12 = 0, ξ22 = 0, and π12 = π21 = 0.1. Given the preestimates
of aij , b, and c, the test plans choose the optimum values of ξ11 , ξ21 , and π11
by minimizing Var[x̂0.43 (1, 1)]. Because n and σ in (7.106) are constant, the
optimization model can be written as
Min(V ), (7.107)
ξ11 , ξ21 , π11
subject to ξ12 = 0, ξ22 = 0, π12 = π21 = 0.1, π22 = 1 − π11 − π12 − π21 , 0 ≤
ξ11 ≤ 1, 0 ≤ ξ21 ≤ 1, and 0 ≤ π11 ≤ 1.
Because x = ln(t), minimizing Var[x̂0.43 (1, 1)] is equivalent to minimizing the
asymptotic variance of the MLE of the mean log life of the Weibull distribution
at the use stresses.
The test plans contain six prespecified values (a11 , a12 , a21 , a22 , b, and c). To
tabulate the test plans in a manageable manner, we consider only two different
censoring times, one for the southwest and northwest points and the other for
the southeast and northeast points. Therefore, a11 = a12 ≡ a1 and a21 = a22 ≡
a2 . This special case is realistic and often encountered in practice, because as
explained earlier, two groups may be tested concurrently on the same equipment
and are subjected to censoring at the same time. When a1 = a2 , all test points
have a common censoring time. Table 7.14 presents the values of ξ11 , ξ21 , π11 ,
and V for various sets (a1 , a2 , b, c). To find a plan from the table, one looks up
the value of c first, then b, a2 , and a1 in order. Linear interpolation may be needed
for a combination (a1 , a2 , b, c) not given in the table, but extrapolation outside
the table is not valid. For a combination (a1 , a2 , b, c) outside the table, numerical
calculation of the optimization model is necessary. The Excel spreadsheet for the
calculation is available from the author. After obtaining the standardized values,
we convert them to the transformed stress levels and sample allocations by using
S1i = S12 + ξ1i (S10 − S12 ), S2j = S22 + ξ2j (S20 − S22 ), nij = πij n.
(7.108)
Then S1i and S2j are further transformed back to the actual stress levels.
COMPROMISE TEST PLANS 317
TABLE 7.14 Compromise Test Plans for a Weibull Distribution with Two
Accelerating Variables
No. a1 a2 b c π11 ξ11 ξ21 V No. a1 a2 b c π11 ξ11 ξ21 V
1 2 2 2 2 0.637 0.769 0.769 10.3 46 4 3 3 4 0.605 0.635 0.610 10.4
2 3 2 2 2 0.730 0.912 0.934 4.2 47 5 3 3 4 0.658 0.754 0.732 5.7
3 2 2 3 2 0.584 0.597 0.636 18.8 48 2 2 4 4 0.511 0.389 0.389 59.8
4 3 2 3 2 0.644 0.706 0.770 8.1 49 3 2 4 4 0.545 0.441 0.455 29.0
5 4 2 3 2 0.724 0.868 0.935 3.8 50 4 2 4 4 0.579 0.529 0.545 15.1
6 2 2 4 2 0.555 0.486 0.543 29.8 51 5 3 4 4 0.614 0.642 0.648 8.4
7 3 2 4 2 0.599 0.577 0.660 13.3 52 4 4 4 4 0.572 0.547 0.547 14.9
8 4 2 4 2 0.654 0.706 0.798 6.5 53 5 4 4 4 0.612 0.648 0.648 8.4
9 2 2 5 2 0.538 0.409 0.474 43.2 54 6 4 4 4 0.661 0.761 0.762 5.0
10 3 2 5 2 0.570 0.488 0.579 19.8 55 2 2 5 4 0.500 0.339 0.353 78.6
11 4 2 5 2 0.611 0.596 0.700 10.0 56 3 2 5 4 0.529 0.387 0.414 38.4
12 2 2 2 3 0.584 0.636 0.597 18.8 57 4 2 5 4 0.556 0.464 0.494 20.4
13 3 2 2 3 0.648 0.731 0.710 8.3 58 3 3 5 4 0.524 0.401 0.419 37.9
14 4 2 2 3 0.729 0.893 0.871 3.9 59 4 3 5 4 0.552 0.472 0.494 20.3
15 2 2 3 3 0.548 0.516 0.516 29.9 60 5 3 5 4 0.584 0.561 0.585 11.6
16 3 2 3 3 0.596 0.593 0.610 13.8 61 5 4 5 4 0.583 0.563 0.585 11.6
17 4 2 3 3 0.651 0.718 0.737 6.8 62 6 4 5 4 0.622 0.661 0.685 7.1
18 3 3 3 3 0.592 0.618 0.618 13.5 63 2 2 6 4 0.493 0.300 0.322 99.9
19 4 3 3 3 0.647 0.732 0.737 6.7 64 3 2 6 4 0.517 0.345 0.381 49.1
20 5 3 3 3 0.719 0.878 0.883 3.6 65 4 2 6 4 0.540 0.414 0.454 26.4
21 2 2 4 3 0.528 0.432 0.454 43.6 66 3 3 6 4 0.513 0.354 0.385 48.6
22 3 2 4 3 0.565 0.500 0.539 20.5 67 4 3 6 4 0.537 0.419 0.454 26.3
23 4 2 4 3 0.606 0.603 0.646 10.4 68 5 3 6 4 0.563 0.498 0.535 15.3
24 3 3 4 3 0.562 0.514 0.544 20.2 69 5 4 6 4 0.563 0.499 0.535 15.3
25 4 3 4 3 0.603 0.612 0.646 10.4 70 6 4 6 4 0.594 0.586 0.625 9.4
26 5 3 4 3 0.655 0.731 0.767 5.7 71 2 2 7 4 0.488 0.269 0.297 123.7
27 2 2 5 3 0.515 0.371 0.404 59.7 72 3 2 7 4 0.508 0.311 0.353 61.0
28 3 2 5 3 0.544 0.431 0.483 28.4 73 4 2 7 4 0.527 0.374 0.421 33.0
29 4 2 5 3 0.577 0.521 0.578 14.8 74 3 3 7 4 0.505 0.317 0.355 60.5
30 3 3 5 3 0.542 0.440 0.486 28.2 75 4 3 7 4 0.525 0.378 0.421 32.9
31 4 3 5 3 0.575 0.526 0.578 14.7 76 5 3 7 4 0.547 0.448 0.495 19.4
32 5 3 5 3 0.615 0.628 0.684 8.3 77 4 4 7 4 0.524 0.378 0.421 32.9
33 2 2 6 3 0.507 0.324 0.365 78.3 78 5 4 7 4 0.547 0.449 0.495 19.4
34 3 2 6 3 0.530 0.380 0.439 37.5 79 6 4 7 4 0.574 0.527 0.577 12.1
35 4 2 6 3 0.556 0.459 0.526 19.8 80 2 2 2 5 0.538 0.474 0.409 43.2
36 3 3 6 3 0.528 0.385 0.440 37.4 81 3 2 2 5 0.581 0.533 0.484 20.7
37 4 3 6 3 0.555 0.464 0.528 19.8 82 4 2 2 5 0.625 0.641 0.592 10.4
38 5 3 6 3 0.587 0.551 0.620 11.3 83 2 2 3 5 0.515 0.404 0.371 59.7
39 2 2 2 4 0.555 0.543 0.486 29.8 84 3 2 3 5 0.552 0.453 0.432 29.2
40 3 2 2 4 0.606 0.615 0.575 13.8 85 4 2 3 5 0.586 0.543 0.523 15.2
41 4 2 2 4 0.663 0.744 0.704 6.8 86 3 3 3 5 0.542 0.486 0.440 28.2
42 2 2 3 4 0.528 0.454 0.432 43.6 87 4 3 3 5 0.578 0.562 0.522 14.9
43 3 2 3 4 0.569 0.513 0.506 20.8 88 5 3 3 5 0.620 0.663 0.627 8.4
44 4 2 3 4 0.611 0.617 0.611 10.6 89 2 2 4 5 0.500 0.353 0.339 78.6
45 3 3 3 4 0.562 0.544 0.514 20.2 90 3 2 4 5 0.532 0.395 0.393 38.9
318 ACCELERATED LIFE TESTS
ln(900) − 4.7
a1 = a11 = a12 = = 3.97, a2 = a21 = a22 = 2.66,
0.53
10 − 7.5 10 − 6.8
b= = 4.72, c= = 6.04.
0.53 0.53
Since the values of both a2 and b are not covered in Table 7.14, repeated linear
interpolations are needed. First, find the plans for (a1 , a2 , b, c) = (4, 2, 4, 6) and
(4, 3, 4, 6), and make linear interpolation to (4, 2.66, 4, 6). Next, find the plans
for (4, 2, 5, 6) and (4, 3, 5, 6), and interpolate the plans to (4, 2.66, 5, 6). Then
interpolate the plans for (4, 2.66, 4, 6) and (4, 2.66, 5, 6) to (4, 2.66, 4.72, 6), and
obtain π11 = 0.531, ξ11 = 0.399, ξ21 = 0.394, and V = 31.8. For the purpose of
comparison, we calculate the optimization model directly for (3.97, 2.66, 4.72,
6.04) and get π11 = 0.530, ξ11 = 0.399, ξ21 = 0.389, and V = 32.6. In this case,
the linear interpolation results in a good approximation.
TABLE 7.15 Actual Compromise Test Plan for the Air Pump
Temperature RMS Acceleration Number of Censoring
Group (◦ C) (Grms ) Test Units Time (h)
1 84 5.3 37 900
2 84 12 7 900
3 120 5.3 7 450
4 120 12 19 450
12
10
S20
8
6
m
S22
4
2
0
S10 S12
S1
Example 7.19 Refer to Example 7.18. If the life distribution of the air pump
is mistakenly modeled as lognormal and other data are the same as those in
Example 7.18, calculate the test plan. Comment on the sensitivity of the Yang
compromise test plan to the misspecification of life distribution.
SOLUTION For the preestimates in Example 7.18, Table 7.16 and linear inter-
polation yield the test plan π11 = 0.509, ξ11 = 0.422, ξ21 = 0.412, and V = 26.4.
These values are close to π11 = 0.507, ξ11 = 0.418, ξ21 = 0.407, and V = 27.0,
which are obtained from the direct calculation of the optimization model for the
set (3.97, 2.66, 4.72, 6.04). With the correct Weibull distribution, the approxi-
mate test plan yields the standardized variance of 32. The variance increase is
100 × (32 − 31.8)/31.8 = 0.6%, where 31.8 is the standardized variance derived
in Example 7.18. The small increase in variance indicates that the Yang compro-
mise test plan is robust against the incorrect choice of life distribution for the
preestimates given.
TABLE 7.16 Compromise Test Plans for a Lognormal Distribution with Two
Accelerating Variables
No. a1 a2 b c π11 ξ11 ξ21 V No. a1 a2 b c π11 ξ11 ξ21 V
1 2 2 2 2 0.565 0.708 0.708 9.3 46 4 3 3 4 0.574 0.640 0.617 9.0
2 3 2 2 2 0.666 0.847 0.861 4.0 47 5 3 3 4 0.630 0.747 0.730 5.1
3 2 2 3 2 0.527 0.564 0.602 15.9 48 2 2 4 4 0.477 0.374 0.374 47.7
4 3 2 3 2 0.595 0.685 0.737 7.2 49 3 2 4 4 0.512 0.451 0.460 23.7
5 4 2 3 2 0.681 0.826 0.885 3.5 50 4 2 4 4 0.551 0.540 0.556 12.9
6 2 2 4 2 0.506 0.469 0.525 24.4 51 5 3 4 4 0.590 0.649 0.656 7.4
7 3 2 4 2 0.556 0.574 0.646 11.4 52 4 4 4 4 0.546 0.558 0.558 12.6
8 4 2 4 2 0.618 0.695 0.776 5.8 53 5 4 4 4 0.589 0.656 0.657 7.3
9 2 2 5 2 0.493 0.401 0.467 34.5 54 6 4 4 4 0.639 0.759 0.761 4.4
10 3 2 5 2 0.532 0.494 0.576 16.5 55 2 2 5 4 0.469 0.328 0.341 62.0
11 4 2 5 2 0.580 0.599 0.693 8.7 56 3 2 5 4 0.498 0.398 0.420 31.1
12 2 2 2 3 0.527 0.602 0.564 15.9 57 4 2 5 4 0.531 0.478 0.507 17.2
13 3 2 2 3 0.598 0.713 0.690 7.3 58 3 3 5 4 0.497 0.406 0.422 30.7
14 4 2 2 3 0.685 0.847 0.837 3.6 59 4 3 5 4 0.528 0.487 0.507 17.0
15 2 2 3 3 0.503 0.491 0.491 24.7 60 5 3 5 4 0.563 0.574 0.598 10.0
16 3 2 3 3 0.555 0.591 0.602 11.7 61 5 4 5 4 0.562 0.578 0.599 10.0
17 4 2 3 3 0.616 0.708 0.727 6.0 62 6 4 5 4 0.603 0.670 0.693 6.2
18 3 3 3 3 0.553 0.607 0.607 11.5 63 2 2 6 4 0.463 0.292 0.315 78.1
19 4 3 3 3 0.612 0.724 0.728 5.9 64 3 2 6 4 0.488 0.356 0.388 39.5
20 5 3 3 3 0.687 0.850 0.858 3.2 65 4 2 6 4 0.516 0.429 0.468 22.0
21 2 2 4 3 0.488 0.415 0.437 35.2 66 3 3 6 4 0.487 0.362 0.389 39.1
22 3 2 4 3 0.529 0.504 0.536 17.0 67 4 3 6 4 0.514 0.435 0.468 21.9
23 4 2 4 3 0.576 0.607 0.647 9.0 68 5 3 6 4 0.544 0.514 0.551 13.1
24 3 3 4 3 0.528 0.514 0.540 16.8 69 5 4 6 4 0.543 0.517 0.552 13.1
25 4 3 4 3 0.573 0.617 0.647 8.9 70 6 4 6 4 0.577 0.600 0.638 8.2
26 5 3 4 3 0.628 0.728 0.762 5.0 71 2 2 7 4 0.459 0.263 0.292 95.8
27 2 2 5 3 0.478 0.360 0.394 47.4 72 3 2 7 4 0.481 0.323 0.360 48.8
28 3 2 5 3 0.512 0.440 0.485 23.3 73 4 2 7 4 0.505 0.390 0.435 27.5
29 4 2 5 3 0.550 0.531 0.585 12.6 74 3 3 7 4 0.480 0.327 0.362 48.4
30 3 3 5 3 0.511 0.446 0.488 23.1 75 4 3 7 4 0.503 0.394 0.435 27.3
31 4 3 5 3 0.548 0.538 0.585 12.5 76 5 3 7 4 0.529 0.466 0.512 16.6
32 5 3 5 3 0.591 0.636 0.688 7.2 77 4 4 7 4 0.503 0.395 0.436 27.2
33 2 2 6 3 0.471 0.318 0.360 61.3 78 5 4 7 4 0.529 0.468 0.513 16.5
34 3 2 6 3 0.499 0.390 0.444 30.5 79 6 4 7 4 0.558 0.543 0.592 10.6
35 4 2 6 3 0.531 0.472 0.535 16.7 80 2 2 2 5 0.493 0.467 0.401 34.5
36 3 3 6 3 0.499 0.394 0.446 30.3 81 3 2 2 5 0.540 0.544 0.493 17.1
37 4 3 6 3 0.530 0.477 0.535 16.6 82 4 2 2 5 0.593 0.641 0.601 9.1
38 5 3 6 3 0.565 0.565 0.629 9.8 83 2 2 3 5 0.478 0.394 0.360 47.4
39 2 2 2 4 0.506 0.525 0.469 24.4 84 3 2 3 5 0.516 0.466 0.442 23.8
40 3 2 2 4 0.562 0.616 0.575 11.7 85 4 2 3 5 0.558 0.554 0.537 12.9
41 4 2 2 4 0.627 0.730 0.700 6.0 86 3 3 3 5 0.511 0.488 0.446 23.1
42 2 2 3 4 0.488 0.437 0.415 35.2 87 4 3 3 5 0.550 0.574 0.536 12.7
43 3 2 3 4 0.531 0.521 0.509 17.2 88 5 3 3 5 0.596 0.668 0.636 7.4
44 4 2 3 4 0.581 0.621 0.618 9.2 89 2 2 4 5 0.469 0.341 0.328 62.0
45 3 3 3 4 0.528 0.540 0.514 16.8 90 3 2 4 5 0.500 0.409 0.403 31.4
COMPROMISE TEST PLANS 323
the usage rate in original units. Then the two-variable test plans for the Weibull
and lognormal distributions are immediately applicable if we specify the censor-
ing usages—not the censoring times. In many applications, censoring times are
predetermined for convenient management of test resources. Then the censoring
usages depend on the respective usage rates, which are to be optimized. This
results in a small change in the optimization models (G. Yang, 2005). But the
COMPROMISE TEST PLANS 325
test plans given in Tables 7.14 and 7.16 are still applicable by using aij , b, and
c calculated from
1 1 f2
aij = [ln(ηij f2 ) − µ22 ] = ln(ηij f2 ) − µ20 − B ln ,
σ σ f0
1
b= (µ00 − µ20 ), (7.109)
σ
1 f2 1 f2
c= µ00 − µ02 + ln = (1 − B) ln ,
σ f0 σ f0
where ηij is the censoring time, B the usage rate effect parameter in (7.21), and
f0 and f2 the usual and maximum allowable usage rates, respectively. Note that
the units of usage rate should be in accordance with those of the censoring time.
For example, if the usage rate is in cycles per hour, the censoring time should
be in hours.
Example 7.20 In Example 7.15 we presented the actual Yang compromise plan
for testing the compact electromagnetic relays at higher temperatures and switch-
ing rates. Develop the test plan for which the necessary data were given in
Example 7.15; that is, the use temperature is 30◦ C, the maximum allowable tem-
perature is 125◦ C, the usual switching rate is 5 cycles per minute, the maximum
allowable switching rate is 30 cycles per minute, the sample size is 120, the cen-
soring time at 125◦ C is 96 hours, and the censoring time at the low temperature
(to be optimized) is 480 hours.
SOLUTION The test of similar relays at 125◦ C and 5 cycles per minute showed
that the cycles to failure can be modeled with the Weibull distribution with shape
parameter 1.2 and characteristic life 56,954 cycles. These estimates approximate
the shape parameter and characteristic life of the compact relays under study. Thus
we have σ = 1/1.2 = 0.83 and µ20 = ln(56, 954) = 10.95. Using the reliability
prediction handbook MIL-HDBK-217F (U.S. DoD, 1995), we preestimate the fail-
ure rates to be 1.39 × 10−4 failures per hour or 0.46 × 10−6 failure per cycle at a
switching rate of 5 cycles per minute, and 14.77 × 10−4 failures per hour or 0.82 ×
10−6 failures per cycle at a switching rate of 30 cycles per minute. The preestimates
of the location parameters of the log life are obtained from (7.103) as µ00 = 14.66
and µ02 = 14.08. From (7.21), the preestimate of B is
µ00 − µ02 14.66 − 14.08
B= = = −0.324.
ln(f0 ) − ln(f2 ) ln(5) − ln(30)
Since B = −0.324 < 1, increasing switching rate shortens the test length.
Using the preestimates and the censoring times given, we have
1
a1 = a11 = a12 = [ln(480 × 30 × 60) − 10.95 + 0.324 ln(30/5)] = 3.98,
0.83
a2 = a21 = a22 = 2.04,
1 1
b= (14.66 − 10.95) = 4.47, c= (1 + 0.324) ln(30/5) = 2.86.
0.83 0.83
326 ACCELERATED LIFE TESTS
As in Example 7.18, Table 7.14 and repeated linear interpolations can yield the
test plan for (4, 2, 4.47, 2.86). Here we calculate the optimization model (7.107)
directly and obtain π11 = 0.604, ξ11 = 0.57, and ξ21 = 0.625. The standardized
test plan is then transformed to the actual test plan using (7.108). The actual plan
is shown in Table 7.9.
We have seen above that the stress levels used in a HALT can exceed the
operational limit, and the resulting failure modes may be different from those in
the field. Such a test may violate the fundamental assumptions for quantitative
ALT. Thus, the lifetimes from a HALT cannot be extrapolated to design stresses
for reliability estimation. Conversely, the quantitative test plans described earlier
are inept for HALT development or debugging. The differences between the two
approaches are summarized in Table 7.17.
PROBLEMS
7.1 Develop an accelerated life test plan for estimating the reliability of a prod-
uct of your choice (e.g., paper clip, light bulb, hair drier). The plan should
include the acceleration method, stress type (constant temperature, thermal
cycling, voltage, etc.), stress levels, sample size, censoring times, failure
definition, data collection method, acceleration model, and data analysis
method, which should be determined before data are collected. Justify the
test plan. Write this plan as a detailed proposal to your management.
7.2 Explain the importance of possessing understanding of the effects of a
stress to be applied when one is planning an accelerated test. What types
of stresses may be suitable for accelerating metal corrosion? What stresses
accelerate conductor electromigration?
7.3 List the stresses that can accelerate fatigue failure, and explain the fatigue
mechanism under each stress.
328 ACCELERATED LIFE TESTS
(c) Decide which life distribution fits better. Explain your choice.
(d) Estimate the parameters of the selected distribution for each test group.
(e) Suppose that the test units of group 1 were censored at 1 × 105 cycles.
Plot the life data of this group on the probability paper selected in part
(c). Does this distribution still look adequate? Estimate the distribution
parameters and compare them with those obtained in part (d).
7.9 A preliminary test on a valve was conducted to obtain a preestimate of the
life distribution, which would be used for the subsequent optimal design
of accelerated life tests. In the test, 10 units were baked at the maximum
allowable temperature and yielded the following life data (103 cycles): 67.4,
73.6∗ , 105.6, 115.3, 119.3, 127.5, 170.8, 176.2, 200.0∗ , 200.0∗ , where an
asterisk implies censored. Historical data suggest that the valve life is ade-
quately described by the Weibull distribution.
(a) Plot the life data on Weibull probability paper.
(b) Comment on the adequacy of the Weibull distribution.
(c) Estimate the Weibull parameters.
(d) Calculate the B10 life.
(e) Estimate the probability of failure at 40,000 cycles.
(f) Use an acceleration factor of 35.8 between the test temperature and the
use temperature, and estimate the characteristic life at the use tempe-
rature.
7.10 Refer to Problem 7.9.
(a) Do Problem 7.9 (c)–(f) using the maximum likelihood method.
(b) Comment on the differences between the results from part (a) and those
from Problem 7.9.
(c) Calculate the two-sided 90% confidence intervals for the Weibull param-
eters, B10 life, and the probability of failure at 40,000 cycles at the test
temperature.
7.11 Refer to Example 7.3.
(a) Plot the life data of the three groups on the same Weibull probability
paper.
(b) Plot the life data of the three groups on the same lognormal probability
paper.
(c) Does the Weibull or lognormal distribution fit better? Select the better
one to model the life.
(d) Comment on the parallelism of the three lines on the probability paper
of the distribution selected.
(e) For each test voltage, estimate the (transformed) location and scale
parameters.
(f) Estimate the common (transformed) scale parameter.
330 ACCELERATED LIFE TESTS
TABLE 7.19 Test Conditions and Life Data for the GaAs pHEMT Switches
Temperature Relative Number of Censoring Failure Times
(◦ C) Humidity (%) Devices Censored Time (h) (h)
8.1 INTRODUCTION
332
DETERMINATION OF THE CRITICAL PERFORMANCE CHARACTERISTIC 333
life, is unknown and not taken into account in life data analysis. In contrast,
a degradation test measures the performance characteristics of an unfailed unit
at different times, including the censoring time. The degradation process and
the distance between the last measurement and a specified threshold are known.
In degradation analysis, such information is also utilized to estimate reliability.
Certainly, degradation analysis has drawbacks and limitations. For example, it
usually requires intensive computations.
In this chapter we describe different techniques for reliability estimation from
degradation data, which may be generated from nondestructive or destructive
inspections. The principle and method for accelerated degradation test with tight-
ened thresholds are also presented. We also give a brief survey of the optimal
design of degradation test plans.
where g(tij ; β1i , β2i , . . . , βpi ) is the true degradation of y of unit i at time tij
and eij is the error term. Often, the error term is independent over i and j and
is modeled with the normal distribution with mean zero and standard deviation
σe , where σe is constant. Although measurements are taken on the same unit, the
potential autocorrelation among eij may be ignored if the readings are widely
spaced. In (8.1), β1i , β2i , . . . , βpi are unknown degradation model parameters
for unit i and should be estimated from test data, and p is the number of such
parameters.
During testing, the inspections on unit i yield the data points (ti1 , yi1 ), (ti2 , yi2 ),
. . . , (timi , yimi ). Since eij ∼ N (0, σe2 ), the log likelihood Li for the measurement
data of unit i is
1
mi
mi
Li = − ln(2π) − mi ln(σe ) − [yij − g(tij ; β1i , β2i , . . . , βpi )]2 .
2 2σe2 j =1
(8.2)
The estimates β̂1i , β̂2i , . . . , β̂pi and σ̂e are obtained by maximizing Li directly.
The parameters may also be estimated by the least squares method. This is
done by minimizing the sum of squares of the deviations of the measurements
from the true degradation path, which is given by
mi
mi
SSDi = eij2 = [yij − g(tij ; β1i , β2i , . . . , βpi )]2 , (8.3)
j =1 j =1
where SSDi is the sum of squares of deviations for unit i. Note that the maximum
likelihood estimates are the same as the least squares estimates.
Once the estimates β̂1i , β̂2i , . . . , β̂pi are obtained, we can calculate the pseu-
dolife. If a failure occurs when y crosses a specified threshold, denoted G, the
life of unit i is given by
0 t1 t2 tn
t
FIGURE 8.1 Relation of degradation path, pseudolife, and life distribution
where g −1 is the inverse of g. Applying (8.4) to each test unit yields the lifetime
estimates tˆ1 , tˆ2 , . . . , tˆn . Apparently, pseudolifetimes are complete exact data. In
Chapter 7 we described probability plotting for this type of data. By using the
graphical method, we can determine a life distribution that fits these life data
adequately and estimate the distribution parameters. As explained in Chapter 7,
the maximum likelihood method should be used for estimation of distribution
parameters and other quantities of interest when commercial software is avail-
able. Figure 8.1 depicts the relation of degradation path, pseudolife, and life
distribution.
For some products, the true degradation path is simple and can be written in
a linear form:
g(t) = β1i + β2i t, (8.5)
where g(t), t, or both may represent a log transformation. Some examples fol-
low. The wear of an automobile tire is directly proportional to mileage and β1i =
0. Tseng et al. (1995) model the log luminous flux of the fluorescent lamp as a lin-
ear function of time. K. Yang and Yang (1998) establish a log-log linear relation-
ship between the variation ratio of luminous power and the aging time for a type
of infrared light-emitting diodes. The MOS field-effect transistors have a linear
relationship between the log current and log time, according to J. Lu et al. (1997).
For (8.5), the least squares estimates of β1i and β2i are
β̂1i = y i − β̂2i t i ,
i mi mi
mi m j =1 yij tij − j =1 yij j =1 tij
β̂2i = ,
mi 2 mi 2
mi j =1 tij − j =1 tij
1 1
mi mi
yi = yij , ti = tij .
mi j =1 mi j =1
100
15
10
y
0.1
10 100 1000 10,000 100,000
t (s)
FIGURE 8.2 Percent transconductance degradation over time
The degradation model above is fitted to each degradation path. Simple linear
regression analysis suggests that the degradation model is adequate. The least
squares estimates for the five paths are shown in Table 8.1. After obtaining the
estimates, we calculate the pseudolifetimes. For example, for unit 5 we have
ln(15) + 2.217
ln(tˆ5 ) = = 12.859 or tˆ5 = 384,285 seconds.
0.383
99
95
90
80
70
60
Percent
50
40
30
20
10
5
Similarly, the pseudolifetimes for the other four units are tˆ1 = 17,553, tˆ2 =
31,816, tˆ3 = 75,809, and tˆ4 = 138,229 seconds. Among the commonly used
life distributions, the lognormal provides the best fit to these data. Figure 8.3
shows the lognormal plot, ML fit, and two-sided 90% confidence interval for
percentiles. The ML estimates of the scale and shape parameters are µ̂ = 11.214
and σ̂ = 1.085.
1 mi
· exp − zij2 + (βi − µβ ) β−1 (βi − µβ ) dβ1i · · · dβpi , (8.6)
2
j =1
where zij = [yij − g(tij ; β1i , β2i , . . . , βpi )]/σe , βi = [β1i , β2i , . . . , βpi ], and |β |
is the determinant of β . Conceptually, the model parameters, including the
mean vector µβ , the variance–covariance matrix β , and the standard deviation
of error σe , can be estimated by directly maximizing the likelihood. In practice,
however, the calculation is extremely difficult unless the true degradation path
takes a simple linear form such as in (8.5).
Here we provide a multivariate approach to estimating the model parameters
µβ and β . The approach is approximately accurate, yet very simple. First, we
fit the degradation model (8.1) to each individual degradation path and calculate
the parameter estimates β̂1i , β̂2i , . . . , β̂pi (i = 1, 2, . . . , n) by maximizing the log
likelihood (8.2) or by minimizing the sum of squares of the deviations (8.3). The
estimates of each parameter are considered as a sample of n observations. The
sample mean vector is
β = [β 1 , β 2 , . . . , β p ], (8.7)
where
1
n
βj = β̂j i , j = 1, 2, . . . , p. (8.8)
n i=1
where
1
n
skj = (β̂ki − β k )(β̂j i − β j ), (8.10)
n i=1
skj
ρkj = √ √ . (8.11)
skk sjj
DEGRADATION ANALYSIS WITH RANDOM-EFFECT MODELS 339
99
95
90 b2
b1
80
70
Percent
60
50
40
30
20
10
5
−3 −2 −1 0
Model parameter
Example 8.2 In Example 8.1 we have fitted the log-log linear model to each
degradation path and obtained the estimates β̂1 and β̂2 for each path. Now β1
and β2 are considered as random variables, and we want to estimate µβ and β
by using the multivariate approach.
distribution of y
G
y
life distribution
0 ti
t
s12 −0.01254
ρ12 = √ √ =√ √ = −0.628.
s11 s22 0.1029 0.00387
The large absolute value suggests that the correlation between the two parameters
cannot be ignored, while the negative sign indicates that β1 increases as β2
decreases, and vice versa. In other words, in this particular case, a unit with a
smaller degradation percentage early in the test time (t = 1 second) will have a
greater degradation rate.
most applications, however, (8.12) has to be evaluated numerically for the given
distributions of the model parameters.
As (8.12) indicates, the probability of failure depends on distribution of the
model parameters, which, in turn, is a function of stress level. As such, an acceler-
ated test is often conducted at an elevated stress level to generate more failures or
a larger amount of degradation before the test is censored. Sufficient degradation
reduces the statistical uncertainty of the estimate of the probability of failure. On
the other hand, (8.12) also indicates that the probability of failure is influenced
by the threshold. Essentially, a threshold is subjective and may be changed in
specific applications. For a monotonically increasing performance characteristic,
the smaller the threshold, the shorter the life and the larger the probability of fail-
ure. In this sense, a threshold can be considered as a stress; tightening a threshold
accelerates the test. This acceleration method was mentioned in Chapter 7 and is
discussed in detail in this chapter.
Example 8.3 In Example 8.2 we computed the mean vector β and the vari-
ance–covariance matrix S for an MOS field-effect transistor. Now we want to
evaluate the probability of failure at 1000, 2000, 3000, . . ., 900,000 seconds by
Monte Carlo simulation.
SOLUTION Using Minitab we generated 65,000 sets of β1 and β2 from the
bivariate normal distribution with mean vector β and the variance–covariance
matrix S calculated in Example 8.2. At a given time (e.g., t = 40,000), we
computed the percent transconductance degradation for each set (β1 , β2 ). Then
count the number of degradation percentages greater than 15%. For t =
40,000, the number is r = 21,418. The probability of failure is F (40,000) ≈
342 DEGRADATION TESTING AND ANALYSIS
0.8
0.6
F(t)
0.4
0.2
0
0 200 400 600 800 1000
t (1000 s)
FIGURE 8.6 Failure probabilities computed from the Monte Carlo simulation data
21,418/65,000 = 0.3295. Repeat the calculation for other times, and plot the
probabilities of failure (Figure 8.6). In the next subsection this plot is compared
with others obtained using different approaches.
Example 8.4 In Example 8.2 we computed the mean vector β and the vari-
ance–covariance matrix S for the MOS field-effect transistor. Now we use (8.13)
to evaluate the probabilities of failure at 1000, 2000, 3000, . . ., 900,000 seconds.
0.8
pseudolife simulation bivariate normal
0.6
F(t)
0.4
0.2
0
0 200 400 600 800 1000
t (1000 s)
FIGURE 8.7 Probabilities of failure calculated using different methods
The probabilities of failure at other times are computed similarly. Figure 8.7
shows probabilities at various times calculated from (8.13). For comparison,
probabilities using Monte Carlo simulation and pseudolife calculation are also
shown in Figure 8.7. The probability plots generated from (8.13) and the Monte
Carlo simulation cannot be differentiated visually, indicating that estimates from
the two approaches are considerably close. In contrast, the pseudolife calcula-
tion gives significantly different results, especially when the time is greater than
150,000 seconds. Let’s look at the numerical differences at the censoring time t =
40,000 seconds. Using (8.13), the probability of failure is F (40,000) = 0.3266.
The Monte Carlo simulation gave the probability as F (40,000) = 0.3295, as
shown in Example 8.3. The percentage difference is only 0.9%. Using the pseu-
dolife approach, the probability is
ln(40,000) − 11.214
F (40,000) = = 0.2847.
1.085
It deviates from F (40,000) = 0.3295 (the Monte Carlo simulation result) by
13.6%. In general, compared with the other two methods, the pseudolife method
provides less accurate results. But its simplicity is an obvious appeal.
be written as
G − β1
F (t) = Pr[g(t) ≥ G] = Pr(β1 + β2 t ≥ G) = Pr β2 ≥
t
µβ2 /(G − β1 ) − 1/t
= . (8.14)
σβ2 /(G − β1 )
Now let’s consider the case where ln(β2 ) can be modeled with a normal dis-
tribution with mean µβ2 and standard deviation σβ2 ; that is, β2 has a lognormal
distribution with scale parameter µβ2 and shape parameter σβ2 . For a monotoni-
cally increasing characteristic, the probability of failure can be expressed as
ln(t) − [ln(G − β1 ) − µβ2 ]
F (t) = Pr[g(t) ≥ G] = . (8.15)
σβ2
This indicates that the time to failure also has a lognormal distribution; the
scale parameter is ln(G − β1 ) − µβ2 and the shape parameter is σβ2 . Substituting
the estimates of the degradation model parameters and G into (8.15) gives an
estimate of the probability of failure.
Example 8.5 A solenoid valve is used to control the airflow at a desired rate.
As the valve ages, the actual airflow rate deviates from the rate desired. The
deviation represents the performance degradation of the valve. A sample of 11
valves was tested, and the percent deviation of the airflow rate from that desired
was measured at different numbers of cycles. Figure 8.8 plots the degradation
paths of the 11 units. Assuming that the valve fails when the percent deviation is
greater than or equal to 20%, estimate the reliability of the valve at 50,000 cycles.
12
10
Percent deviation
0
0 5000 10,000 15,000 20,000 25,000
Cycles
FIGURE 8.8 Degradation paths of the solenoid valves
DEGRADATION ANALYSIS FOR DESTRUCTIVE INSPECTIONS 345
99
95
90
80
70
Percent
60
50
40
30
20
10
5
0.1 1.0
b^2
FIGURE 8.9 Lognormal plot, ML fits, and percentile confidence intervals for the esti-
mates of β2
small, deviations at time zero are negligible. Thus, the true degradation path can
be modeled by g(t) = β2 t, where t is in thousands of cycles. The simple linear
regression analyses yielded estimates of β2 for the 11 test units. The estimates are
0.2892, 0.2809, 0.1994, 0.2303, 0.3755, 0.3441, 0.3043, 0.4726, 0.3467, 0.2624,
and 0.3134. Figure 8.9 shows the lognormal plot of these estimates with ML fit
and two-sided 90% confidence intervals for percentiles. It is seen that β2 can be
approximated adequately by a lognormal distribution with µ̂β2 = −1.19406 and
σ̂β2 = 0.22526.
From (8.15), the cycles to failure of the valve also have a lognormal distribu-
tion. The probability of failure at 50,000 cycles is
ln(50) − [ln(20) + 1.19406]
F (50) = = 0.1088.
0.22526
The reliability at this time is 0.8912, indicating that about 89% of the valves
will survive 50,000 cycles of operation.
described earlier in the chapter are not applicable. In this section we present
the random-process method for degradation testing and data analysis. It is worth
noting that this method is equally applicable to nondestructive products and is
especially suitable for cases when degradation models are complicated. Examples
of such application include K. Yang and Xue (1996), K. Yang and Yang (1998),
and W. Wang and Dragomir-Daescu (2002).
concave
y
convex linear
0 t1 t2 tm
t
FIGURE 8.10 Three shapes of degradation paths and decreasing performance dispersion
DEGRADATION ANALYSIS FOR DESTRUCTIVE INSPECTIONS 347
ž For the convex decreasing degradation path in Figure 8.10, the degradation
rate becomes smaller as time increases. For the degradation amount between
two consecutive inspection times to be noticeable, more units should be
allocated to high time inspections regardless of the performance dispersion.
The effect of unit-to-unit variability is usually less important than the aging
effect.
ž For the concave decreasing degradation path in Figure 8.10, the degradation
rate is flat at a low time. More units should be assigned to low time inspec-
tions. This principle applies to both constant and decreasing performance
dispersion.
Maximizing (8.17) directly gives the estimates of the model parameters β and θ .
If σy is constant and µy (t; β) is a linear function of (log) time as given in (8.16),
then (8.17) will be greatly simplified. In this special case, commercial software
packages such as Minitab and Reliasoft ALTA for accelerated life test analysis
can apply to estimate the model parameters β1 , β2 , and σy . This is done by
treating (8.16) as an acceleration relationship, where tj is considered a stress
level. Once the estimates are computed, the conditional cdf Fy (y; t) is readily
available. Now let’s consider the following cases.
Case 1: Weibull Performance Suppose that the performance characteristic y has
a Weibull distribution with shape parameter βy and characteristic life αy , where
βy is constant and ln(αy ) = β1 + β2 t. Since the measurements are complete exact
data, from (7.59) and (8.17) the total sample log likelihood is
m
nj
L(β1 , β2 , βy ) = ln(βy ) − βy (β1 + β2 tj )
j =1 i=1
yij βy
+ (βy − 1) ln(yij ) − . (8.18)
eβ1 +β2 tj
The estimates β̂1 , β̂2 , and β̂y can be calculated by maximizing (8.18) directly.
As in accelerated life test data analysis, commercial software can be employed
to obtain these estimates. In computation, we treat the performance characteristic
y as life, the linear relationship ln(αy ) = β1 + β2 t as an acceleration model, the
inspection time t as a stress, m as the number of stress levels, and nj as the
number of units tested at “stress level” tj . If y is a monotonically decreasing
characteristic such as strength, the effect of time on y is analogous to that of
stress on life. If y is a monotonically increasing characteristic (mostly in non-
destructive cases), the effects are exactly opposite. Such a difference does not
impair the applicability of the software to this type of characteristic. In this case,
the parameter β2 is positive.
The conditional cdf for y can be written as
β̂y
y
Fy (y; t) = 1 − exp − . (8.19)
eβ̂1 +β̂2 t
Case 2: Lognormal Performance If the performance characteristic y has a log-
normal distribution with scale parameter µy and shape parameter σy , ln(y) has
the normal distribution with mean µy and standard deviation σy . If σy is constant
and µy = β1 + β2 t is used, the total sample log likelihood can be obtained easily
from (7.73) and (8.17). As in the Weibull case, the model parameters may be cal-
culated by maximizing the likelihood directly or by using the existing commercial
software.
The conditional cdf for y can be expressed as
ln(y) − β̂1 − β̂2 t
Fy (y; t) = . (8.20)
σ̂y
DEGRADATION ANALYSIS FOR DESTRUCTIVE INSPECTIONS 349
The existing commercial software do not handle nonconstant σy cases; the esti-
mates β̂1 , β̂2 , θ̂1 , and θ̂2 are calculated by maximizing (8.22) directly. This will
be illustrated in Example 8.6.
The conditional cdf for y is
y − exp(β̂1 + β̂2 t)
Fy (y; t) = . (8.23)
exp(θ̂1 + θ̂2 t)
The three cases above illustrate how to determine the conditional cdf for the
performance characteristic. Now we want to relate the performance distribution
to a life distribution. Similar to the nondestructive inspection case described in
Section 8.4.2, the probability of failure at a given time is equal to the probability
of the performance characteristic crossing a threshold at that time. In particular,
if a failure is defined in terms of y ≤ G, the probability of failure at time t equals
the probability of y(t) ≤ G: namely,
F (t) = Pr(T ≤ t) = Pr[y(t) ≤ G] = Fy (G; t). (8.24)
In some simple cases, it is possible to express F (t) in a closed form. For example,
for case 2, the probability of failure is given by
ln(G) − β̂1 − β̂2 t t − [ln(G) − β̂1 ]/β̂2
F (t) = = . (8.25)
σ̂y −σ̂y /β̂2
This indicates that the time to failure has a normal distribution with mean
[ln(G) − β̂1 ]/β̂2 and standard deviation −σ̂y /β̂2 . Note that β2 is negative for
a monotonically decreasing characteristic, and thus −σ̂y /β̂2 is positive.
Example 8.6 In Section 5.13 we presented a case study on the robust reliabil-
ity design of IC wire bonds. The purpose of the study was to select a setting
of bonding parameters that maximizes robustness and reliability. In the experi-
ment, wire bonds were generated with different settings of bonding parameters
according to the experimental design. The bonds generated with the same setting
350 DEGRADATION TESTING AND ANALYSIS
were divided into two groups each with 140 bonds. One group underwent level
1 thermal cycling and the other group was subjected to level 2. For each group
a sample of 20 bonds were sheared at 0, 50, 100, 200, 300, 500, and 800 cycles,
respectively, for the measurement of bonding strength. In this example, we want
to estimate the reliability of the wire bonds after 1000 cycles of level 2 thermal
cycling, where the bonds were generated at the optimal setting of the bonding
parameters. The optimal setting is a stage temperature of 150◦ C, ultrasonic power
of 7 units, bonding force of 60 gf, and bonding time of 40 ms. As described in
Section 5.13, the minimum acceptable bonding strength is 18 grams.
SOLUTION The strength measurements [in grams (g)] at each inspection time
can be modeled with a normal distribution. Figure 8.11 shows normal fits to the
strength data at, for example, 0, 300, and 800 cycles. In Section 5.13 we show
that the normal mean and standard deviation decrease with the number of ther-
mal cycles. Their relationships can be modeled using (8.21). Figure 8.12 plots the
relationships for the wire bonds that underwent level 2 thermal cycling. Simple
linear regression analysis gives β̂1 = 4.3743, β̂2 = −0.000716, θ̂1 = 2.7638, and
θ̂2 = −0.000501. Substituting these estimates into (8.23) and (8.24) can yield an
estimate of the probability of failure at a given time. To improve the accuracy of
the estimate, we use the maximum likelihood method. The parameters are esti-
mated by maximizing (8.22), where yij are the strength measurements, m = 7,
and nj = 20 for all j . The estimates obtained from linear regression analysis serve
as the initial values. The maximum likelihood estimates are β̂1 = 4.3744, β̂2 =
−0.000712, θ̂1 = 2.7623, and θ̂2 = −0.000495. From (8.23) and (8.24), the esti-
mate of reliability at 1000 cycles is
18 − exp(4.3744 − 0.000712 × 1000)
R(1000) = 1 − = 0.985.
exp(2.7623 − 0.000495 × 1000)
99 300 cycles
95
90
800 cycles 0 cycles
80
70
Percent
60
50
40
30
20
10
5
1
20 30 40 50 60 70 80 90 100 110 120
Bonding strength (g)
FIGURE 8.11 Normal fits to the strength data at different cycles
STRESS-ACCELERATED DEGRADATION TESTS 351
4.5
4
ln(my)
3.5
3
ln(sy)
2.5
2
0 200 400 600 800 1000
t (cycles)
FIGURE 8.12 ln(µy ) and ln(σy ) versus the number of cycles
30
25
100 °C
20 85 °C
65 °C
∆s
s0
15
10
0
0 500 1000 1500 2000 2500 3000
t (hr)
where s is the stress loss by time t, s0 the initial stress, the ratio s/s0 the stress
relaxation (in percent), A and B are unknowns, and other notation is as in (7.4).
Here A usually varies from unit to unit, and B is a fixed-effect parameter. At a
given temperature, (8.26) can be written as
s
ln = β1 + β2 ln(t), (8.27)
s0
SOLUTION First we fit (8.27) to each degradation path, and estimate the para-
meters β1 and β2 for each unit using the least squares method. Figure 8.14 plots
the fits of the degradation model to the data and indicates that the model is
adequate. Then we calculate the approximate lifetime of each test unit using
tˆ = exp{[ln(30) − β̂1 ]/β̂2 }. The resulting lifetimes are 15,710, 20,247, 21,416,
29,690, 41,167, and 42,666 hours at 65◦ C; 3676, 5524, 7077, 7142, 10,846, and
10,871 hours at 85◦ C; and 1702, 1985, 2434, 2893, 3343, and 3800 hours at
100◦ C. The life data are plotted on the lognormal probability paper, as shown in
Figure 8.15. It is seen that the lifetimes at the three temperatures are reasonably
lognormal with a common shape parameter σ .
STRESS-ACCELERATED DEGRADATION TESTS 353
30
25
100 °c
85 °c
20
65 °c
15
∆s
s0
10
0
0 500 1000 1500 2000 2500 3000
t (h)
FIGURE 8.14 Degradation model fitted to the measurement data of stress relaxation
99
95
90
100 °C 85 °C 65 °C
80
70
Percent
60
50
40
30
20
10
5
1
1000 10,000 100,000
Time to failure (h)
Since a failure occurs when s/s0 ≥ 30, from (8.26) the nominal life can be
written as 1/B
30 Ea
t= exp .
A kBT
By using the maximum likelihood method for accelerated life data analysis as
described in Section 7.7, we obtain the ML estimates as γ̂0 = −14.56, γ̂1 =
8373.35, and σ̂ = 0.347. The estimate of the scale parameter at 40◦ C is µ̂ =
−14.56 + 8373.35/313.15 = 12.179. Then the probability of failure of the con-
nectors operating at 40◦ C for 15 years (131,400 hours) is
ln(131,400) − 12.179
F (131,400) = = 0.129 or 12.9%.
0.347
That is, an estimated 12.9% of the connectors will fail by 15 years when used
at 40◦ C.
In some simple cases, F (t) can be expressed in closed form. Let’s consider
the true degradation path
g(t, S) = β1 + γ S + β2 t,
µy (t, S; β) = β1 + β2 t + β3 S. (8.30)
total sample size. Let fy (y; t, S) denote the pdf of y distribution conditional on
t and S. Similar to (8.17), the total sample log likelihood can be expressed as
q mk nj k
L(β, θ ) = ln[fy (yij k ; tj k , Sk )]. (8.31)
k=1 j =1 i=1
For nondestructive inspections, the notations above are slightly different. yij k
denotes the measurement at time tij k on unit i of stress level Sk , where i =
1, 2, . . . , nk , j = 1, 2, . . . , mik , and k = 1, 2, . . . , q; nk is the number of units at
S
k ; and mik is the number of inspections on unit i of stress level Sk . Clearly,
q
k=1 nk = n. The total sample log likelihood is similar to (8.31), but the nota-
tion changes accordingly. Note that the likelihood may be approximately cor-
rect for nondestructive inspections because of potential autocorrelation among
the measurements. To reduce the autocorrelation, inspections should be widely
spaced.
Maximizing the log likelihood directly yields estimates of the model parame-
ters β and θ . If σy is constant and µy (t, S; β) is a linear function of (log) time
and (transformed) stress given by (8.30), then (8.31) will be greatly simplified. In
this special case, commercial software packages for accelerated life test analysis
can be used to estimate the model parameters β1 , β2 , β3 , and σy . In calculation,
we treat (8.30) as a two-variable acceleration relationship, where the time t is
considered as a stress.
Evaluating the Probability of Failure After obtaining the estimates β̂ and θ̂,
we can calculate the conditional cdf for y, denoted by Fy (y; t, S). If a failure
is defined in terms of y ≤ G, the probability of failure at time t and use stress
level S0 is given by
F (t, S0 ) = Fy (G; t, S0 ). (8.32)
For example, if y has the lognormal distribution with µy modeled by (8.30) and
constant σy , the estimate of the probability of failure at t and S0 is
ln(G) − β̂1 − β̂2 t − β̂3 S0
F (t, S0 ) =
σ̂y
t − [ln(G) − β̂1 − β̂3 S0 ]/β̂2
= . (8.33)
−σ̂y /β̂2
Example 8.8 Refer to Example 8.7. Using the random-process method, estimate
the probability of failure of connectors operating at 40◦ C (use temperature) for
15 years (design life).
STRESS-ACCELERATED DEGRADATION TESTS 357
99 99
95 95
90 90
80 80
70 70
Percent
Percent
60 60
50 50
40 40
30 30
20 20
235 h 1355 h 2848 h 105 h 1025 h 1842 h
10 10
5 5
1 1
1 10 1 10
Stress relaxation Stress relaxation
(a) (b)
99
95
90
80
70
Percent
60
50
40
30
20
50 h 633 h 1238 h
10
5
1
10 100
Stress relaxation
(c)
FIGURE 8.16 Lognormal fits to stress relaxation measurements at (a) 65◦ C, (b) 85◦ C,
and (c) 100◦ C
Directly maximizing the likelihood yields the estimates β̂1 = 9.5744, β̂2 = 0.4519,
β̂3 = −3637.75, and σ̂y = 0.1532. The calculation was performed using the
Solver feature of Microsoft Excel. Alternatively, as described earlier, we may
treat the measurement data as if they came from an accelerated life test. The pseu-
dotest involves two stresses (temperature and time), and the acceleration model
358 DEGRADATION TESTING AND ANALYSIS
0.8
random-process
method
0.6
F(t)
pseudo accelerated
0.4
life test method
0.2
0
0 10 20 30 40 50
t (yr)
FIGURE 8.17 Probabilities of failure calculated from two methods
combines the Arrhenius relationship and the inverse power relationship. Minitab
gave the estimates β̂1 = 9.5810, β̂2 = 0.4520, β̂3 = −3640.39, and σ̂y = 0.1532,
which are close to these from the Excel calculation and are used for subsequent
analysis.
Because stress relaxation is a monotonically increasing characteristic, the prob-
ability of failure at t and S0 is the complement of the probability given in (8.33)
and can be written as
ln(t) − [ln(G) − β̂1 − β̂3 S0 ]/β̂2 ln(t) − 12.047
F (t) = = . (8.34)
σ̂y /β̂2 0.3389
This shows that the time to failure has a lognormal distribution with scale param-
eter 12.047 and shape parameter 0.3389. Note that, in Example 8.7, the pseudo
accelerated life test method resulted in a lognormal distribution with scale and
shape parameters equal to 12.179 and 0.347. At the design life of 15 years
(131,400 hours), from (8.34) the probability of failure is 0.2208. This estimate
should be more accurate than that in Example 8.7. For comparison, the probabili-
ties at different times calculated from the two methods are plotted in Figure 8.17.
It is seen that the random-process method always gives a higher probability of
failure than the other method in this case. In general, the random-process method
results in more accurate estimates.
f (t| G0)
f (t| G1)
G0
f (t| G2)
G1
y
G2
0
t
FIGURE 8.18 Relationship between life and threshold for an increasing characteristic
360 DEGRADATION TESTING AND ANALYSIS
Example 8.9 In Example 8.7 the usual threshold for stress relaxation is 30%.
Reducing the threshold shortens the time to failure. Determine the relationship
between the time to failure and the threshold, and evaluate the effect of threshold
on the probability of failure.
SOLUTION From (8.34), the estimate of the mean log life (location parameter
µt ) is
ln(G) − β̂1 − β̂3 S0
µ̂t = = 4.5223 + 2.2124 ln(G),
β̂2
where β̂1 , β̂2 , and β̂3 are as obtained in Example 8.8. It is seen that the mean log
life is a linear function of the log threshold. The influence of the threshold on
life is significant because of the large slope. Figure 8.19 plots the probabilities of
failure with thresholds 30%, 25%, and 20%. It is seen that reducing the threshold
greatly increases the probability of failure. For instance, at a design life of 15
years, the probabilities of failure at the three thresholds are 0.2208, 0.6627, and
0.9697, respectively.
0.8
0.4
0.2
0
0 10 20 30 40 50
t (yr)
FIGURE 8.19 Probabilities of failure with different thresholds
ACCELERATED DEGRADATION TESTS WITH TIGHTENED THRESHOLDS 361
can have m “life” observations. Let tij k denote the “failure” time of unit i at Sk
and at Gj , where i = 1, 2, . . . , nk , j = 1, 2, . . . , m, and k = 1, 2, . . . , q. The life
distribution at the use stress level S0 and the usual threshold G0 is estimated by
utilizing these “life” data.
The most severe threshold Gm should be as tight as possible to maximize the
threshold range and the number of failures at Gm , but it must not fall in the
fluctuating degradation stage caused by the burn-in effects at the beginning of a
test. The space between two thresholds should be as wide as possible to reduce
the potential autocorrelation among the failure times. For this purpose, we usually
use m ≤ 4. In Section 8.8 we describe optimal design of the test plans.
µ(S, G) = β1 + β2 S + β3 G. (8.35)
The life data analysis is to estimate the life distribution at the use stress
level and the usual threshold. This can be done by using the graphical or max-
imum likelihood method described in Section 7.7. The analysis is illustrated in
Example 8.10.
1 15 320 19 635
2 15 320 10 635
3 25 170 19 2550
4 25 170 10 2550
40
Power degradation (%)
20
10
80
70
60
Power degradation(%)
50
40
30
20
10
0
the values of the variation ratio at each inspection time. The data are shown in
Tables 8.9 and 8.10 of Problem 8.8.
The variation ratio of luminous power is a function of time and current, and
can be written as
A
y = B tC, (8.36)
I
ACCELERATED DEGRADATION TESTS WITH TIGHTENED THRESHOLDS 363
unit failed between the two consecutive inspections. Therefore, the failure time
is within the interval [1536,1905]. In contrast, the exact time from interpolation
is 1751.
Figure 8.21 plots the lognormal fits to the failure times listed in Table 8.3. It
is shown that the shape parameter σ is approximately independent of current and
threshold. In addition, from (8.36), the scale parameter µ is a linear function of
the log current and the log threshold: namely,
µ(I, G) = β1 + β2 ln(I ) + β3 ln(G). (8.37)
The estimates of the model parameters were computed using Minitab as β̂1 =
28.913, β̂2 = −4.902, β̂3 = 1.601, and σ̂ = 0.668. The mean log life at the
operating current of 50 mA and the usual threshold of 30% is µ̂ = 15.182. The
probability of failure in 10 years (87,600 hours) is negligible, so the device has
ultrahigh reliability.
80
70
Percent
60
50
40
30
20
170 mA, 19%
10
5
1
10 100 1000 10,000
Time to failure (h)
FIGURE 8.21 Lognormal fits to the failure times at different levels of current and
threshold
ACCELERATED DEGRADATION TEST PLANNING 365
for directly (2.03, 2.03, 3.24, 2.89) and got π11 = 0.477, ξ11 = 0.758, ξ21 =
0.800, and V = 16.1. In this case, the approximation and interpolation yield
fairly accurate results. The standardized plan is then converted to the actual
plan, as shown in Table 8.6. In implementing this plan, the six tires used for the
preliminary test should continue being tested until 13,000 miles as part of the
group at 2,835 pounds. Thus, this group requires only an additional 12 units.
372 DEGRADATION TESTING AND ANALYSIS
ž Minimize the asymptotic variance (or mean square error) of the estimate of a
life percentile or other quantity at a use stress level, subject to a prespecified
cost budget.
ž Minimize the total test cost, subject to the allowable statistical error.
ž Minimize the asymptotic variance (or mean square error) and the total test
cost simultaneously.
and Yum (1999) compare numerically plans they developed for accelerated life
and degradation tests. Unsurprisingly, they conclude that accelerated degradation
test plans provide more accurate estimates of life percentiles, especially when the
probabilities of failure are small. Boulanger and Escobar (1994) design optimum
accelerated degradation test plans for a particular degradation model that may be
suitable to describe a degradation process in which the amount of degradation
over time levels off toward a stress-dependent plateau (maximum degradation).
The design consists of three steps, the first of which is to determine the stress
levels and corresponding proportions of test units by minimizing the variance
of the weighted least squares estimate of the mean of the log plateaus at the
use condition. The second step is to optimize the times at which to measure the
units at a selected stress level, then the results of the two steps are combined to
determine the total number of test units.
For degradation tests at a single constant-stress level, S. Wu and Chang (2002)
propose an approach to the determination of sample size, inspection frequency,
and test termination time. The optimization criterion is to minimize the variance
of a life percentile estimate subject to total test cost. Marseguerra et al. (2003)
develop test plans similar to those of S. Wu and Chang (2002), and consider
additionally simultaneous minimization of the variance and total test cost. The
latter part deals with the third optimization problem described above.
Few publications deal with the second optimization problem. Yu and Chiao
(2002) design optimal plans for fractional factorial degradation experiments with
the aim of improving product reliability. The plans select the inspection fre-
quency, sample size, and test termination time at each run by minimizing the
total test cost subject to a prespecified correct decision probability. Tang et al.
(2004) conduct the optimal design of step-stress accelerated degradation tests.
The objective of the design is to minimize the total test cost subject to a vari-
ance constraint. The minimization yields the optimal sample size, number of
inspections at each intermediate stress level, and number of total inspections.
Generally speaking, the optimal design of accelerated degradation tests is
considerably more difficult than that of accelerated life tests, mainly because the
former involves complicated degradation models and more decision variables. It
is not surprising that there is scant literature on this subject. As we may have
observed, degradation testing and analysis is a promising and rewarding tech-
nique. Increasingly wide applications of this technique require more practically
useful test plans.
PROBLEMS
8.1 A product usually has more than one performance characteristic. Describe
the general approaches to determination of the critical characteristic. Explain
why the critical characteristic selected must be monotone.
8.2 Discuss the advantages and disadvantages of pseudolife analysis. Compared
with the traditional life data analysis, do you expect a pseudolife analysis
to give a more accurate estimate? Why?
374 DEGRADATION TESTING AND ANALYSIS
0 0 0 0 0 0 0 0
15 0.001472 0.001839 0.001472 0.001839 0.001839 0.001839 0.002575
45 0.002943 0.004047 0.003311 0.002943 0.003311 0.002943 0.003679
120 0.005886 0.00699 0.005886 0.004415 0.005518 0.005886 0.005886
150 0.006254 0.008093 0.006622 0.005150 0.006254 0.006990 0.007726
180 0.008461 0.009933 0.008093 0.006622 0.007726 0.008461 0.010301
8.3 K.Yang and Xue (1996) performed degradation analysis of the exhaust
valves installed in a certain internal combustion engine. The degradation
is the valve recession, representing the amount of wear in a valve over
time; the valve recession data at different inspection times are shown in
Table 8.7. Develop a model to describe the degradation paths. If the valve
is said to have failed when the recession reaches 0.025 inch, estimate the
probability of failure at 500 hours through pseudolife analysis.
8.4 Refer to Problem 8.3. Suppose that the degradation model parameters for
the valve recession have random effects and have a bivariate normal dis-
tribution. Calculate the mean vector and variance–covariance matrix using
the multivariate approach. Estimate the probability of failure of the valve
at 500 hours through Monte Carlo simulation.
8.5 Refer to Problem 8.4. Calculate the probability of failure of the valve at
500 hours by using (8.13). Compare the result with those in Problems 8.3
and 8.4.
8.6 In Section 8.5.1 we describe methods for sample allocation to destructive
inspections. Explain how the methods improve the statistical accuracy of a
reliability estimate.
8.7 A type of new polymer was exposed to the alkaline environment at ele-
vated temperatures to evaluate the long-term reliability of the material. The
experiment tested standard bars of the material at 50, 65, and 80◦ C, each
with 25 units. Five units were inspected destructively for tensile strength
at each inspection time during testing. The degradation performance is the
ratio of the tensile strength to the original standard strength. The material
is said to have failed when the ratio is less than 60%. Table 8.8 shows the
values of the ratio at different inspection times (in days) and temperatures.
(a) For each combination of temperatures and inspection times, plot the
data and ML fits on lognormal paper. Comment on the adequacy of the
lognormal distribution.
(b) Does the shape parameter change with time?
PROBLEMS 375
8 98.3 94.2 96.5 98.1 96.0 87.5 85.2 93.3 90.0 88.4 80.8 82.3 83.7 86.6 81.1
25 92.4 88.1 90.5 93.4 90.2 83.2 80.5 85.7 86.3 84.2 73.3 72.3 71.9 74.5 76.8
75 86.2 82.7 84.2 86.1 85.5 77.0 73.2 79.8 75.4 76.2 67.4 65.4 64.3 65.3 64.5
130 82.3 78.5 79.4 81.8 82.3 73.9 70.1 75.8 72.3 71.7 64.3 60.4 58.6 58.9 59.7
180 77.7 74.6 76.1 77.9 79.2 68.7 65.3 69.8 67.4 66.6 60.4 55.3 56.7 57.3 55.7
1 0.1 0.3 0.7 1.2 3.0 6.6 12.1 16.0 22.5 25.3 30.0
2 2.0 2.3 4.7 5.9 8.2 9.3 12.6 12.9 17.5 16.4 16.3
3 0.3 0.5 0.9 1.3 2.2 3.8 5.5 5.7 8.5 9.8 10.7
4 0.3 0.5 0.8 1.1 1.5 2.4 3.2 5.1 4.7 6.5 6.0
5 0.2 0.4 0.9 1.6 3.9 8.2 11.8 19.5 26.1 29.5 32.0
6 0.6 1.0 1.6 2.2 4.6 6.2 10.5 10.2 11.2 11.6 14.6
7 0.2 0.4 0.7 1.1 2.4 4.9 7.1 10.4 10.8 13.7 18.0
8 0.5 0.9 1.8 2.7 6.5 10.2 13.4 22.4 23.0 32.2 25.0
9 1.4 1.9 2.6 3.4 6.1 7.9 9.9 10.2 11.1 12.2 13.1
10 0.7 0.8 1.4 1.8 2.6 5.2 5.7 7.1 7.6 9.0 9.6
11 0.2 0.5 0.8 1.1 2.5 5.6 7.0 9.8 11.5 12.2 14.2
12 0.2 0.3 0.6 0.9 1.6 2.9 3.5 5.3 6.4 6.6 9.2
13 2.1 3.4 4.1 4.9 7.2 8.6 10.8 13.7 13.2 17.0 13.9
14 0.1 0.2 0.5 0.7 1.2 2.3 3.0 4.3 5.4 5.5 6.1
15 0.7 0.9 1.5 1.9 4.0 4.7 7.1 7.4 10.1 11.0 10.5
16 1.8 2.3 3.7 4.7 6.1 9.4 11.4 14.4 16.2 15.6 16.6
17 0.1 0.2 0.5 0.8 1.6 3.2 3.7 5.9 7.2 6.1 8.8
18 0.1 0.1 0.2 0.3 0.7 1.7 2.2 3.0 3.5 4.2 4.6
19 0.5 0.7 1.3 1.9 4.8 7.7 9.1 12.8 12.9 15.5 19.3
20 1.9 2.3 3.3 4.1 5.2 8.9 11.8 13.8 14.1 16.2 17.1
21 3.7 4.8 7.3 8.3 9.0 10.9 11.5 12.2 13.5 12.4 13.8
22 1.5 2.2 3.0 3.7 5.1 5.9 8.1 7.8 9.2 8.8 11.1
23 1.2 1.7 2.0 2.5 4.5 6.9 7.5 9.2 8.5 12.7 11.6
24 3.2 4.2 5.1 6.2 8.3 10.6 14.9 17.5 16.6 18.4 15.8
25 1.0 1.6 3.4 4.7 7.4 10.7 15.9 16.7 17.4 28.7 25.9
(e) For each temperature, estimate the probability of failure at the test ter-
mination time.
(f) Repeat part (e) for a design life of 10 years.
8.8 Example 8.10 describes the degradation analysis for the IRLEDs. The de-
gradation data of the device at different measurement times (in hours) and
current levels (in mA) are shown in Tables 8.9 and 8.10.
(a) Fit the degradation model (8.36) to each degradation path, and estimate
the model parameters using the least squares method.
(b) Compute the pseudolife of each unit.
(c) For each current, plot the pseudolife data and the ML fits on lognor-
mal probability paper. Comment on the adequacy of the lognormal
distribution.
(d) Is it evident that the shape parameter depends on current?
(e) Use the inverse power relationship for current, and estimate the log-
normal scale parameter and the probability of failure at a use cur-
rent of 50 mA. Comment on the difference in results from those in
Example 8.10.
1 4.3 5.8 9.5 10.2 13.8 20.6 19.7 25.3 33.4 27.9
2 0.5 0.9 1.4 3.3 5.0 6.1 9.9 13.2 17.0 20.7
3 2.6 3.6 4.6 6.9 9.5 13.0 15.3 13.5 19.0 19.5
4 0.2 0.4 0.9 2.4 4.5 7.1 13.4 21.2 30.7 41.7
5 3.7 5.6 8.0 12.8 16.0 23.7 26.7 38.4 49.2 47.2
6 3.2 4.3 5.8 9.9 15.2 20.3 26.2 33.6 39.5 53.2
7 0.8 1.7 2.8 4.6 7.9 12.4 20.2 24.8 32.5 45.4
8 4.3 6.5 7.8 13.0 21.7 33.0 42.1 49.9 59.9 78.6
9 1.4 2.7 5.0 7.8 14.5 23.3 29.0 43.3 59.8 77.4
10 3.4 4.6 7.8 13.0 16.8 26.8 34.1 41.5 67.0 65.5
11 3.6 4.7 6.2 9.1 11.7 13.8 14.5 15.5 23.1 24.0
12 2.3 3.7 5.6 8.8 13.7 17.2 24.8 29.1 42.9 45.3
13 0.5 0.9 1.9 3.5 5.9 10.0 14.4 22.0 26.0 31.8
14 2.6 4.4 6.0 8.7 14.6 16.8 17.9 23.2 27.0 31.3
15 0.1 0.4 0.7 2.0 3.5 6.6 12.2 18.8 32.3 47.0
PROBLEMS 377
8.10 Refer to Example 8.1. The percent transconductance degradation data taken
at different times (in seconds) for five units of an MOS field-effect transistor
are shown in Table 8.11.
(a) Without fitting a degradation model, determine the failure time intervals
for each unit at thresholds 3%, 8%, and 13%, respectively.
(b) Estimate the life distributions at each threshold.
(c) Does the scale parameter depend on threshold?
(d) Develop a relationship between the distribution location parameter and
the threshold.
(e) Estimate the distribution location parameter at the usual threshold of
15%. Compare the result with that in Example 8.1.
8.11 To estimate the reliability of a resistor at a use temperature of 50◦ C, the
manufacturer plans to sample 45 units and divide them into two groups,
each tested at an elevated temperature. The failure of the resistor is defined
in terms of a resistance drift greater than 500 ppm (parts per million). The
tightest failure criterion is 100 ppm, and the maximum allowable tempera-
ture is 175◦ C. The time to failure of the resistor has a Weibull distribution
with shape parameter 1.63. Preestimates of the log characteristic life are
µ00 = 12.1, µ20 = 8.3, and µ22 = 5.9. Each group is tested for 2350 hours
or until all units fail, whichever is sooner. Develop a compromise test plan
for the experiment.
9
RELIABILITY VERIFICATION
TESTING
9.1 INTRODUCTION
In the design and development phase of a product life cycle, reliability can
be designed into products proactively using the techniques presented in earlier
chapters. The next task is to verify that the design meets the functional, environ-
mental, reliability, and legal requirements specified in the product planning phase.
This task is often referred to in industry as design verification (DV). Reliability
verification testing is an integral part of DV testing and is aimed particularly
at verifying design reliability. If during testing a design fails to demonstrate
the reliability required, it must be revised following rigorous failure analysis.
Then the redesigned product is resubjected to verification testing. The process of
test–fix–test is continued until the reliability required is achieved. The repetitive
process jeopardizes the competitiveness of the product in the marketplace, due
to the increased cost and time to market. Nowadays, most products are designed
with the aim of passing the first DV testing. Therefore, it is vital to design-in
reliability and to eliminate potential failure modes even before prototypes are
built.
A design is released to production if it passes DV testing successfully. The pro-
duction process is then set up to manufacture products that meet all requirements
with minimum variation. As we know, the designed-in or inherent reliability level
is always degraded by process variation. This is also true for product functionality
and other performances. Therefore, prior to full production, the process must pass
a qualification test, usually called process validation (PV) in industry. Its purpose
379
380 RELIABILITY VERIFICATION TESTING
Test-to-failure testing, testing samples until they fail, often takes longer; how-
ever, it requires fewer samples and generates considerably more information.
The actual reliability level can also be estimated from the test. The test is usually
conducted under accelerating conditions, and thus needs an appropriate accelera-
tion relationship. For some products, failure is defined in terms of a performance
characteristic crossing a threshold. As described in Chapter 8, degradation mea-
surements of products can be used to estimate reliability. Thus, it is not necessary
to test such products until failure. This advantage may make a degradation test
suitable for highly reliable products.
f (S)
0 S0
S
FIGURE 9.1 Real-world usage profile
on the reliability specifications. The test time is, however, usually too long to
be affordable. For example, testing an automotive component to a design life
of 100,000 miles is economically prohibitive and time impermissible. As shown
later, the test duration is further prolonged if a reduction in sample size is essen-
tial. Because the test time has a direct influence on the total cost and time to
market, it is a major concern in planning a reliability verification test. Test plan-
ners often seek ways to shorten the test time, and consider accelerated testing
a natural choice. As stated earlier, elevated stress levels must not cause failure
modes that differ from those in the filed.
If an increase in sample size is acceptable, the test time may be compressed
by testing a larger sample. The reduction in test time, however, has implications
that test planners must consider. If the failure modes have an increasing hazard
rate, failures are caused by wear-out mechanisms, which progress over time.
Thus, sufficient test duration is required to induce a significant amount of wear
out. Most mechanical and some electronic components belong to this category.
If the failure modes display a nonincreasing (constant or decreasing) hazard rate,
testing more samples for a shorter time is effective in precipitating failures. Most
electronic components and systems have such a characteristic; their test time can
safely be shortened by increasing the sample size.
The probability that the number of failures r is less than or equal to the critical
value c is
c
Pr(r ≤ c) = Cni p i (1 − p)n−i , (9.1)
i=0
c
Cni (1 − RL )i RLn−i ≤ 1 − C. (9.3)
i=0
If c, RL , and C are given, (9.3) can be solved for the minimum sample size.
When c = 0, which is the case in bogey testing, (9.3) reduces to
RLn ≤ 1 − C. (9.4)
ln(1 − C)
n= . (9.5)
ln(RL )
ln(1 − 0.9)
n= = 22.
ln(0.9)
350
300
C = 0.95
250
0.9
200
n
150 0.85
100 0.8
50
0
0.8 0.85 0.9 0.95 1
RL
RL = (1 − C)1/n . (9.6)
Example 9.2 A random sample of 30 units was tested for 15,000 cycles and
produced no failures. Calculate the lower 90% confidence bound on reliability.
RL = (1 − 0.9)1/30 = 0.926.
Note that this reliability is at 15,000 cycles under the test conditions.
The probability of the sample of size n0 yielding zero failures is obtained from
(9.1) as β
t0
Pr(r = 0) = exp −n0 . (9.8)
α
ln(1 − C)
n0 = . (9.11)
ln(RL )π β
Equation (9.11) collapses to (9.5) when the bogey ratio equals 1 and indicates
that the sample size can be reduced by increasing the bogey ratio (i.e., extending
the test time). The magnitude of reduction depends on the value of β. The larger
the value, the greater the reduction. Table 9.1 shows the sample sizes for different
values of RL , C, π, and β.
Equation (9.11) can be derived through another approach due partially to C.
Wang (1991). Suppose that a sample of size n0 is tested for time t0 without
failures. According to Nelson (1985), the lower 100C% confidence bound on the
Weibull scale parameter α is
β
1/β
2n0 t0
αL = 2
, (9.12)
χC,2
2
where χC,2 is the 100Cth percentile of the χ 2 distribution with 2 degrees of
freedom. The lower bound on reliability at tL is
β
tL β 2
tL χC,2
RL = exp − = exp − β . (9.13)
αL 2t0 n0
Example 9.4 Refer to Example 9.3. Suppose that the maximum allowable sam-
ple size is 10. Calculate the test time required.
As shown in Examples 9.3 and 9.4, reduction in sample size is at the expense
of increased test time. In many situations it is impossible to prolong a test.
Instead, elevation of test stress levels is feasible and practical. If the acceleration
factor Af is known between the elevated and use stress levels, the actual test
time ta is
t0
ta = . (9.15)
Af
t
0 0
y y
(a) (b)
t
0
y
(c)
FIGURE 9.3 Transfer function yielding (a) a dampened life distribution, (b) an ampli-
fied life distribution, and (c) an unchanged life distribution
9. Draw a sample of size nq from the population; each unit must fall in the
tail area defined by yq .
10. Test the nq units until tL . If no failures occur, RL is demonstrated at a
100C% confidence level.
In this approach, the sample size nq and tail fraction q are two important
quantities, which are discussed in the following subsections.
Pr(tL ≤ t ≤ tq ) 1 − R(tL )
R(tL |tq ) = Pr(t ≥ tL |t ≤ tq ) = =1− , (9.16)
Pr(t ≤ tq ) q
where R(tL ) is the reliability of a unit randomly selected from the entire pop-
ulation. Equation (9.16) shows that R(tL |tq ) decreases with the value of q, and
R(tL |tq ) < R(tL ) when q < 1. When q = 1, that is, the test units are randomly
drawn from the entire population, (9.16) is reduced to R(tL |tq ) = R(tL ).
SAMPLE SIZE REDUCTION BY TAIL TESTING 391
1.0
0.8
0.6
0.99
nq
n
0.4
0.95
0.9
0.2
RL = 0.8
0.0
0 0.2 0.4 0.6 0.8 1
q
FIGURE 9.4 Ratios of nq to n for different values of RL and q
392 RELIABILITY VERIFICATION TESTING
q
t
0 yL y0 yH
y
FIGURE 9.5 Life variation due to variability of transfer function
nonlinear transfer functions and the resulting life distributions. In this figure, the
middle transfer function is assumed to be the correct one, and the other two con-
tain deviations. The correct one results in middle life distribution on the t-axis.
The upper transfer function produces longer lives, whereas the lower transfer
function yields shorter failure times. In the case of overestimation, the 100qth
percentile of the erroneous life distribution would require sampling below yH ,
which is larger than the correct y0 , as shown in Figure 9.5. Obviously, the test
result is optimistic. Better said, the population reliability may not achieve the
required retiability at the confidence level specified even if no failures occur in
the test. In the case of underestimation, the 100qth percentile of the incorrect
life distribution is transferred to yL , which is lower than the correct y0 . Con-
sequently, the test result is pessimistic. Indeed, there is a possibility that the
population reliability meets the reliability requirement even if the sample fails to
pass the test.
The stated risks vanish when the characteristic has a normal distribution and
the transfer function is linear. Suppose that y has a normal distribution N(µy , σy2 ),
and the transfer function is
t = ay + b, (9.19)
where a and b are constants. Then the life is also normal with mean µt = aµy + b
and standard deviation σt = aσy . The 100qth percentile of the life distribution is
which indicates that the (100q )th percentile of y distribution does not depend
on the values of a and b. Therefore, variation in the estimates of a and b does
not impose risks to tail testing. Furthermore, (9.21) also shows that q = q.
Example 9.5 A bogey test is designed to verify that a shaft meets a lower 90%
confidence bound reliability of 95% at 5 × 105 cycles under the cyclic loading
specified. From (9.5) the test requires 45 samples, which is too large in this
case, due to the cost and test time restrictions. So the tail-testing technique is
considered here. Determine the sample size for the test.
SOLUTION Calculation of the tail-testing sample size follows the steps descri-
bed in Section 9.4.1.
1. Choose the shaft characteristic to describe the fatigue life. It is known that
the fatigue life is influenced dramatically by material properties, surface
finish, and diameter. Since the variability in the first two factors is well
under control, diameter is the predominant characteristic and thus is selected
to characterize the fatigue life.
2. Develop the transfer function that relates the fatigue life to shaft diameter.
From the theory of material strength and the S –N curve (a curve plotting
the relationship between mechanical stress and the number of cycles to
failure), we derived the transfer function as
L = ay b , (9.22)
where L is the fatigue life, y is the diameter (in millimeters), and a and b
are constants depending on the material properties. To estimate a and b, the
historical data of a similar part made of the same material were analyzed.
Figure 9.6 shows the fatigue life of the part at various values of diameter,
and the fit of (9.22) to the data. Simple linear regression analysis gives the
estimates â = 3 × 10−27 and b̂ = 24.764.
3. Draw 45 samples randomly from the entire population and take the mea-
surements of the diameter. Probability plot indicates that the diameter has
the normal distribution N (21.21, 0.4122 ).
4. Calculate the lives of the 45 units by using (9.22) with the estimates â
and b̂. The life can be modeled adequately with the lognormal distribution
LN(14.56, 0.4792 ).
5. Choose q = 0.3. The 30th percentile is t0.3 = 1.64 × 106 cycles, which is
obtained from the lognormal distribution.
6. Convert t0.3 to the (100q )th percentile of the diameter distribution and get
yq = 20.989 mm. From the normal distribution, we obtain the lower tail
fraction of the diameter as q = 0.296, which is close to q = 0.3.
394 RELIABILITY VERIFICATION TESTING
900
800
700
Cycles (× 1000)
600
500
400
300
200
100
0
18.8 19 19.2 19.4 19.6 19.8 20 20.2 20.4
Diameter (mm)
FIGURE 9.6 Fatigue life of a similar part at various values of diameter
Now the bogey test plan is to draw 13 units from the lower tail of the
diameter distribution such that the measurements of the diameter are less than
yq = 20.989 mm, and test the 13 units until 5 × 105 cycles under the speci-
fied cyclic loading condition. If no failures occur, we conclude that at a 90%
confidence level, the shaft population achieves 95% reliability at 5 × 105 cycles.
Sequential life testing is to test one unit at a time until it fails or until a pre-
determined period of time has elapsed. As soon as a new observation becomes
available, an evaluation is made to determine if (1) the required reliability is
demonstrated, (2) the required reliability is not demonstrated, or (3) the test
should be continued. Statistically speaking, sequential life testing is a hypothesis-
testing situation in which the test statistic is reevaluated as a new observation is
available and then compared against the decision rules. When rejection or accep-
tance rules are satisfied, the test is discontinued and the conclusion is arrived at.
Otherwise, the test should continue. It can be seen that the sample size required
to reach a conclusion is a random number and cannot be predetermined. Because
of the sequential nature, the test method needs fewer samples than a bogey test.
H0 : θ = θ0 , H1 : θ = θ1 ,
SEQUENTIAL LIFE TESTING 395
n
L(x1 , x2 , . . . , xn ; θ ) = f (xi ; θ ). (9.24)
i=1
ž Accept H0 if LRn ≤ A.
ž Reject H0 if LRn ≥ B.
ž Draw one more unit and continue the test if A < LRn < B.
By following the decision rules above and the definitions of type I and type
II errors, we can determine the bounds as
β
A= , (9.26)
1−α
1−β
B= , (9.27)
α
396 RELIABILITY VERIFICATION TESTING
where α is the type I error (producer’s risk) and β is the type II error (con-
sumer’s risk).
In many applications it is computationally more convenient to use the log
likelihood ratio: namely,
n
f (xi ; θ1 )
ln(LRn ) = ln . (9.28)
i=1
f (xi ; θ0 )
α ≤ 1/B and β ≤ A,
where α and β denote the true values of α and β, respectively. For example, if
a test specifies α = 0.1 and β = 0.05, the true errors are bounded by α ≤ 0.105
and β ≤ 0.056. It can be seen that the upper bounds are slightly higher than the
specified values. Generally, the maximum relative error of α to α is
α − α 1/B − α β
= = .
α α 1−β
The maximum relative error of β to β is
β − β A−β α
= = .
β β 1−α
The operating characteristic (O.C.) curve is useful in hypothesis testing. It
plots the probability of accepting H0 when H0 is true for different true values of
θ . The probability, denoted by Pa(θ ), can be written as
Bh − 1
P a(θ ) = , h = 0, (9.30)
B h − Ah
where h is a constant related to the value of θ . The relationship between h and
θ is defined by
∞
f (x; θ1 ) h
f (x; θ )dx = 1. (9.31)
−∞ f (x; θ0 )
Solving (9.31) gives θ (h). Then we can use the following steps to generate the
O.C. curve:
1. Set a series of arbitrary numbers for h which may be between, for example,
−3 and 3.
SEQUENTIAL LIFE TESTING 397
B−1
P a(θ ) = = 1 − α. (9.32)
B−A
B −1 − 1
P a(θ ) = = β. (9.33)
B −1 − A−1
Example 9.6 Consider a sequential life test for the exponential distribution.
Suppose that θ0 = 2000, θ1 = 1000, α = 0.1, and β = 0.1. Develop the decision
bounds and O.C. curve for the test.
Thus, if a sequential test of n units results in LRn ≤ 0.111, the null hypothesis
θ0 = 2000 is accepted. If LRn ≥ 9, the null hypothesis is rejected. If 0.111 <
LRn < 9, take one more unit and continue the test. To construct the O.C. curve
for the test, we first solve (9.31) for the exponential distribution, where
1 x
f (x; θ ) = exp − , x ≥ 0.
θ θ
(θ0 /θ1 )h − 1
θ= . (9.34)
h(1/θ1 − 1/θ0 )
0.8
0.6
Pa(q)
0.4 1−a
0.2
b
0
0 500 1000 1500 2000 2500 3000
True q
FIGURE 9.7 O.C. curve for the sequential test plan of Example 9.6
H0 : p = p0 , H1 : p = p1 .
For n observations, the log likelihood ratio given by (9.28) can be written as
p1 (1 − p0 ) 1 − p0
ln(LRn ) = r ln − n ln , (9.36)
p0 (1 − p1 ) 1 − p1
SEQUENTIAL LIFE TESTING 399
Bn
Reject H0
An
r
Continue test
Accept H0
0
n
where r is the total number of failures in n trials and r = ni=1 xi .
The continue-test region can be obtained by substituting (9.36) into (9.29).
Further simplification gives
An < r < Bn , (9.37)
where
β 1 − p0 1−β
An = C ln + nC ln , Bn = C ln
1−α 1 − p1 α
1 − p0 p1 (1 − p0 )
+ nC ln , C = ln−1 .
1 − p1 p0 (1 − p1 )
An and Bn are the bounds of the test. According to the decision rules, we
accept H0 if r ≤ An , reject H0 if r ≥ Bn , and draw one more unit and continue
the test if An < r < Bn . An and Bn are two parallel straight lines, as shown in
Figure 9.8. The cumulative number of failures can be plotted on the graph to
show the current decision and track the test progress.
To construct the O.C. curve for this test, we first solve (9.31) for the binomial
distribution defined by (9.35) and obtain
1 − p1 h
1−
1−p
p = h 0 h . (9.38)
p1 1 − p1
−
p0 1 − p0
Bh − 1
P a(p) = , h = 0. (9.39)
B h − Ah
Then the O.C. curve can be generated by following the steps described earlier.
400 RELIABILITY VERIFICATION TESTING
SOLUTION Substituting the given data into (9.37), we obtain the continue-
test region 0.0039n − 0.9739 < r < 0.0039n + 1.2504. Following our decision
rules, we accept H0 (the probability of failure is less than or equal to 0.001
at the specified time and test condition) if r ≤ 0.0039n − 0.9739, reject H0 if
r ≥ 0.0039n + 1.2504, and take an additional unit for test if 0.0039n − 0.9739 <
r < 0.0039n + 1.2504.
The minimum number of trials that lead to acceptance of H0 is determined
from (9.40) as na = 249. The minimum number of trials resulting in rejection of
H0 is calculated from (9.41) as nr = 2.
SEQUENTIAL LIFE TESTING 401
Now we compute the expected number of trials for the test. The supplier was
confident that the airbag achieves the required reliability based on the accelerated
test data of a similar product and has p = 0.0008. Substituting the value of p
into (9.38) gives h = 1.141. With the given α and β values, we obtain A =
0.1053 and B = 18. From (9.39), P a(p) = 0.9658. Then the expected number
of trials is calculated from (9.42) as E(n|p) = 289. The test plan is plotted in
Figure 9.9. The minimum numbers can also be read from the graph.
To construct an O.C. curve for the test, set h to various numbers between −3
and 3. Then calculate the corresponding values of p from (9.38) and of P a(p)
from (9.39). The plot of P a(p) versus p is the O.C. curve, shown in Figure 9.10.
It is seen that the probability of accepting H0 decreases sharply as the true p
increases when it is less than 0.005. That is, the test plan is sensitive to the
change in p in the region.
To compare the sequential life test with the bogey test, we determine the
sample size for the bogey test that demonstrates 99.9% reliability at a 90%
confidence level, which is equivalent to p0 = 0.001 and β = 0.1 in this example.
From (9.5) we obtain n = 2302. The sample size is substantially larger than 289
(the expected number of trials in the sequential life test).
4 r = 0.0039n + 1.2504
3
Reject H0
r
2
r = 0.0039n – 0.9739
Continue test
1
Accept H0
0
0 100 200 300 400 500 600 700
n
FIGURE 9.9 Sequential life test plan for Example 9.7
0.8
0.6
Pa(p)
0.4
0.2
0
0 0.005 0.01 0.015 0.02 0.025
True p
FIGURE 9.10 O.C. curve for the sequential life test of Example 9.7
402 RELIABILITY VERIFICATION TESTING
where t is the lifetime and θ is the mean time to failure. The sequential life
testing is to test the hypotheses
H0 : θ = θ0 , H1 : θ = θ1 ,
n
(1/θ1 ) exp(−ti /θ1 ) θ0 1 1
ln(LRn ) = ln = n ln −T − , (9.43)
i=1
(1/θ0 ) exp(−ti /θ0 ) θ1 θ1 θ0
where n is
the total number of trials and T is the total time to failure of the n
units (T = ni=1 ti ).
From (9.29) and (9.43), the continue-test region is
where
α θ0 1−α
An = C ln + nC ln , Bn = C ln
1−β θ1 β
θ0 θ0 θ 1
+ nC ln , C= .
θ1 θ0 − θ1
Note that the observation in the test is the time to failure. The decision variable
is the total time to failure, not the total number of failures. Thus, the decision rules
are that we accept H0 if T ≥ Bn , reject H0 if T ≤ An , and take an additional unit
and continue the test if An < T < Bn . The shortest route to the reject decision
is testing
ln[(1 − β)/α]
n=
ln(θ0 /θ1 )
SEQUENTIAL LIFE TESTING 403
units which fail at time zero. The shortest route to the accept decision is testing
one unit that survives at least the time given by
1−α θ0
B1 = C ln + C ln .
β θ1
The O.C. curve for the test plan can be developed using (9.30) and (9.34). The
procedure was described in Example 9.6.
The use of (9.44) requires testing units individually to failure. Compared with
a truncation test, the test method reduces sample size and increases test time.
This is recommended when accelerated testing is applicable. Sometimes we may
be interested in simultaneous testing of a sample of sufficient size. The decision
rules and test plans are described in, for example, Kececioglu (1994) and MIL-
HDBK-781 (U.S. DoD, 1996).
Since T > B5 , we conclude that the product achieves the MTTF of 5000 hours.
The sequential test results and decision process are plotted in Figure 9.11. It
is seen that the accumulated test time crosses the Bn bound after a test of
5 units.
404 RELIABILITY VERIFICATION TESTING
60,000
Accept H0
40,000
T (h)
30,000
10,000
Continue test Reject H0
0
0 1 2 3 4 5 6 7 8 9 10 11
n
FIGURE 9.11 Sequential life test plan and results of Example 9.8
where
n
α η0
T = tim , An = C ln
+ nmC ln ,
i=1
1−β η1
1−α η0 (η0 η1 )m
Bn = C ln + nmC ln , C= m .
β η1 η0 − η1m
SEQUENTIAL LIFE TESTING 405
1. Test at least three units, one at a time, until all have failed.
2. Estimate the shape and scale parameters from the test data.
3. Calculate An and Bn using the estimate of the shape parameter.
4. Apply the decision rules to the failure times in the order in which they were
observed. If a reject or accept decision is made, stop the test. Otherwise,
go to step 5.
5. Take one more unit and continue the test until it fails or until a decision to
accept is reached. If it fails, go to step 2.
Although the test data provide a better estimate of the shape parameter, the
estimate may still have a large deviation from the true value. The deviation, of
course, affects actual type I and type II errors. Therefore, it is recommended
that the sensitivity of the test plan be assessed to the uncertainty of the esti-
mate. Sharma and Rana (1993) and Hauck and Keats (1997) present formulas
for examining the response of P a(η) to misspecification of the shape parame-
ter and conclude that the test plan is not robust against a change in the shape
parameter.
SOLUTION Substituting the given data into (9.45), we obtain the continue-test
region as
n
−106 × 106 + 11.1 × 106 n < ti1.5 < 82.7 × 106 + 11.1 × 106 n.
i=1
The test plan is plotted in Figure 9.12. Note that the vertical axis T is the total
transformed failure time. The O.C. curve is constructed by using (9.30) and (9.34)
and the transformation θi = ηim (i = 0, 1). Figure 9.13 shows an O.C. curve that
plots the probability of acceptance at different true values of the Weibull scale
parameter, where η = θ 1/m .
406 RELIABILITY VERIFICATION TESTING
500
T = 82.7 + 11.1n
400
300
T (× 106)
200
Continue test
100
Reject H0
0
0 5 10 15 20 25 30 35
n
FIGURE 9.12 Sequential life test plan of Example 9.9
0.8
0.6
Pa(h)
0.4
0.2
0
40,000 45,000 50,000 55,000 60,000 65,000
True h (cycles)
FIGURE 9.13 O.C. curve for the sequential life test of Example 9.9
n
L(x1 , x2 , . . . , xn |θ ) = h(xi |θ ).
i=1
4. Calculate the joint pdf of the n independent observations from the test and
of parameter θ . This is done by multiplying the conditional joint pdf and
the prior pdf: namely,
L(x1 , x2 , . . . , xn |θ )ρ(θ )
g(θ |x1 , x2 , . . . , xn ) = ,
L(x1 , x2 , . . . , xn |θ )ρ(θ ) dθ
f (x1 , x2 , . . . , xn ; θ )
g(θ |x1 , x2 , . . . , xn ) = .
k(x1 , x2 , . . . , xn )
7. Devise a test plan by using the posterior pdf of parameter θ and the type
I and type II errors specified.
The procedures above are illustrated below through application to the devel-
opment of a bogey test. Although complicated mathematically, sequential life
tests using the Bayesian method have been reported in the literature. Interested
readers may consult Sharma and Rana (1993), Deely and Keats (1994), B. Lee
(2004), and F. Wang and Keats (2004).
As described in Section 9.3.1, a binomial bogey testing is to demonstrate
RL at a 100C% confidence level. The sampling reliability, say R, is a random
408 RELIABILITY VERIFICATION TESTING
variable, and its prior distribution is assumed known. It is well accepted by,
for example, Kececioglu (1994), Kleyner et al. (1997), and Guida and Pulcini
(2002), that the prior information on R can be modeled with a beta distribution
given by
R a−1 (1 − R)b−1
ρ(R) = , 0 ≤ R ≤ 1,
β(a, b)
where β(a, b) = (a) (b)/ (a + b), (·) is the gamma function, and a and b
are unknown parameters to be estimated from past data. Martz and Waller (1982)
provide methods for estimating a and b.
Because a bogey test generates a binary result (either success or failure), the
test outcome is described by the binomial distribution with a given R. The pdf is
R a+n−1 (1 − R)b−1
g(R|x1 = 0, x2 = 0, . . . , xn = 0) = . (9.46)
β(a + n, b)
Note that the posterior distribution is also the beta distribution, but the parameters
are a + n and b.
The bogey test plan with no failures allowed is to determine the sample size
required to demonstrate RL at a 100C% confidence level. This is equivalent to
selecting n such that the probability of R not less than RL is equal to C. Then
we have 1
g(R|x1 = 0, x2 = 0, . . . , xn = 0) dR = C
RL
or 1
R a+n−1 (1 − R)b−1
dR = C. (9.47)
RL β(a + n, b)
Equation (9.47) is solved numerically for n. The sample size is smaller than that
given by (9.5).
If the lower bound is greater than RL , we conclude that the product meets the
reliability requirement at a 100C% confidence level.
The calculation above uses the known specific forms of µy (t; β) and σy (t; θ ).
In practice, they are often unknown but can be determined from test data. First
we estimate the location and scale parameters at each inspection time. Then
the linear or nonlinear regression analysis of the estimates derives the specific
functions. Nonlinear regression analysis is described in, for example, Seber and
Wild (2003). In many applications, the scale parameter is constant. This greatly
simplifies subsequent analysis.
The approach described above is computationally intensive. Here we present
an approximate yet simple method. Suppose that a sample of size n is tested
until t0 , where t0 < tL . If the test were terminated at tL , the sample would yield r
failures. Then p̂ = r/n estimates the probability of failure p at tL . The number of
failures r is unknown; it may be calculated from the pseudolife method described
in Chapter 8. In particular, a degradation model is fitted to each degradation
410 RELIABILITY VERIFICATION TESTING
H0 : p ≤ 1 − RL , H1 : p > 1 − RL .
When the sample size is relatively large and p is not extremely close to zero or
one, the test statistic
r − n(1 − RL )
Z0 = √ (9.48)
nRL (1 − RL )
can be approximated with the standard normal distribution. The decision rule
is that we accept H0 at a 100C% confidence level if Z0 ≤ zC , where zC is the
100Cth percentile of the standard normal distribution.
PROBLEMS
9.1 Describes the pros and cons of bogey test, sequential life test, and degra-
dation test for reliability verification.
9.2 Find the minimum sample size to demonstrate R95/C95 by bogey testing.
If the sample size is reduced to 20, what is the confidence level? If a test
uses a sample of 25 units and generates no failures, what is the lower-bound
reliability demonstrated at a 90% confidence level?
9.3 A manufacturer wants to demonstrate that a new micro relay achieves a
lower 90% confidence bound reliability of 93.5% at 25,000 cycles. The relay
has a Weibull distribution with shape parameter 1.8. How many units shall
be tested for 25,000 cycles? If the test schedule can accommodate 35,000
PROBLEMS 411
cycles, what is the resulting sample size? If only 12 units are available for
the test, how many cycles should be run?
9.4 Redo Problem 9.3 for cases in which the shape parameter has a ±20%
deviation from 1.8. Compare the results with those of Problem 9.3.
9.5 Explain the rationale of tail testing. Discuss the benefits, risks, and limita-
tions of the test method.
9.6 A manufacturer wants to demonstrate the reliability of a new product by
sequential life testing. The required reliability is 0.98, and the minimum
acceptable reliability is 0.95. Develop a binomial sequential test plan to
verify the reliability at α = 0.05 and β = 0.1. How many units on aver-
age does the test need to reach a reject or accept decision? What is the
probability of accepting the product when the true reliability is 0.97?
9.7 The life of an electronic system has the exponential distribution. The system
is designed to have an MTBF of 2500 hours with a minimum acceptable
MTBF of 1500 hours. The agreed-upon producer and consumer risks are
10%. Develop and plot the sequential life test plan and the O.C. curve.
Suppose that the test has yielded two failures in 600 and 2300 hours. What
is the decision at this point?
9.8 A mechanical part has a Weibull distribution with shape parameter 2.2. The
manufacturer is required to demonstrate the characteristic life of 5200 hours
with a minimum acceptable limit of 3800 hours. The probability is 0.95
of accepting the part that achieves the specified characteristic life, while
the probability is 0.9 of rejecting the part that has 3800 hours. What are
the decision rules for the test? Develop an O.C. curve for the test plan.
What are the probabilities of accepting the part when the true values of the
characteristic life are 5500 and 3500 hours?
9.9 Redo Problem 9.8 for cases in which the shape parameter has a ±20%
deviation from 2.2. Comment on the differences due to the changes in
shape parameter.
9.10 Derive the formulas for evaluating the sensitivity of Pa to misspecification
of the shape parameter of the Weibull distribution.
9.11 To demonstrate that a product achieved the lower 90% confidence bound
reliability of 90% at 15,000 hours, a sample of 55 units was subjected
to degradation testing. The test lasted 2500 hours and yielded no failures.
Degradation analysis gave the reliability estimate as
ln(t) − 10.9
R(t) = 1 −
.
1.05
Does the product meet the reliability requirement specified?
10
STRESS SCREENING
10.1 INTRODUCTION
412
SCREENING TECHNIQUES 413
For a product whose performance degrades over time, a failure is said to have
occurred if a performance characteristic (say, y) crosses a specified threshold.
The faster the degradation, the shorter the life. Thus, the life is determined
by the degradation rate of y. A population of the products usually contains a
fraction of both good and substandard units. In practice, good products over-
whelmingly outnumber substandard ones. Stressed at an elevated level during
418 STRESS SCREENING
substandard
y
good
0
t
FIGURE 10.1 Difference between degradation rates causing a bimodal distribution
PART-LEVEL DEGRADATION SCREENING 419
pdf[y (t2)]
y1 (t)
G0
pdf (t)
G* y2 (t)
y
0 t1 t2
t
FIGURE 10.2 Relationship between the bimodal distributions of y and life
In practice, one often conducts two-level screening; that is, part- and module-level
screening, where a module may be a board, subsystem, or system. The purpose
of part-level screening is to weed out the substandard parts by subjecting the
420 STRESS SCREENING
part population to an elevated stress level. The screened parts are then assembled
into a module. Because the assembly process may introduce defects, the module
is then screened for a specified duration. In this section we focus on part-level
degradation screening.
Part-level degradation screening stresses products at an elevated level. Let tp
denote the screen duration at this stress level. Then the equivalent time tp at the
use stress level is
tp = Ap tp , (10.2)
where Ap is the acceleration factor and can be estimated using the theory of
accelerated testing discussed in Chapter 7. For example, if temperature is the
screening stress, the Arrhenius relationship may be applied to determine the
value of Ap .
A part, good or bad, subjected to degradation screening may fail to pass or
survive screening. For a part having a monotonically increasing performance
characteristic, the probability p0 of the part passing the screen can be written as
The probability p1 that a part passing the screen is from the substandard
subpopulation is
α1
p1 = Pr[y1 (tp ) ≤ G∗ ]. (10.4)
p0
The probability p2 that a part passing the screen is from the good subpopula-
tion is
α2
p2 = Pr[y2 (tp ) ≤ G∗ ]. (10.5)
p0
where Rp1 (t|tp ) is the field reliability at t of a substandard part from the screened
population and Rp2 (t|tp ) is the field reliability at t of a good part from the
screened population. Because
Rpi (t + tp ) Pr[yi (t + tp ) ≤ G0 ]
Rpi (t|tp ) = = , i = 1, 2,
Rpi (tp ) Pr[yi (tp ) ≤ G0 ]
PART-LEVEL DEGRADATION SCREENING 421
Pr[y1 (t + tp ) ≤ G0 ] Pr[y2 (t + tp ) ≤ G0 ]
Rp (t) = p1 + p2 . (10.7)
Pr[y1 (tp ) ≤ G0 ] Pr[y2 (tp ) ≤ G0 ]
where
αi Pr[yi (tp ) ≤ G∗ ]
θi = , i = 1, 2. (10.9)
p0 Pr[yi (tp ) ≤ G0 ]
where
dPr[yi (t + tp ) ≤ G0 ]
fpi (t + tp ) = − , i = 1, 2.
dt
where β1i is the mean of the initial values of yi before screening and β2i is
the degradation rate of yi . These parameters can be estimated from preliminary
test data.
422 STRESS SCREENING
From (10.11) and (10.12), the pdf of a part from the screened population is
θ1 t − µt1 θ2 t − µt2
fp (t) = φ + φ , (10.13)
σt1 σt1 σt2 σt2
G0 − β1i − β2i tp σy i
µti = , σti = , i = 1, 2. (10.14)
β2i β2i
where (·) is the cdf of the standard normal distribution. Equation (10.15) indi-
cates that the life distribution of a substandard or good part from the screened
population has a normal distribution with mean µti and standard deviation σti .
0.23 0.018
µy 1 = t = 0.0105t, µy 2 = t = 0.8182 × 10−3 t.
22 22
The equivalent screen time at 35◦ C is tp = 22 × 120 = 2640. From the
data given, the fractions of the substandard and good subpopulations are α̂1 =
12/180 = 0.0667 and α̂2 = 1 − 0.0667 = 0.9333.
The estimate of the probability of a defective component escaping the screen is
∗
∗ G − µy 1 12 − 0.0105 × 2640
Pr[y1 (tp ) ≤ G ] = = = 0.0002.
σy 1 4.5
PART-LEVEL DEGRADATION SCREENING 423
0.0667 × 0.0002
θ1 = = 0.5245 × 10−4 ,
0.9323 × 0.2728
0.9333 × 0.9989
θ2 = = 0.99997.
0.9323 × 1
From (10.14), after screening, the estimates of the mean and standard deviation
of the life distribution of the defective components are
After operating 20,000 hours at 35◦ C, the screened components are estimated
using (10.15) to have a probability of failure
20,000 + 259
−4
F̂p (20,000) = 0.5245 × 10 ×
429
20,000 − 27,915
+ 0.99997 × = 0.0215.
3911
From (10.1), if the component population were not screened, the probability
of failure at 20,000 hours would be
25 − 0.0105 × 20,000
Pr(y ≥ 25) = 0.0667 × 1 −
4.5
25 − 0.8182 × 10−3 × 20,000
+ 0.9333 × 1 −
3.2
= 0.07.
Therefore, the screen reduces the probability of failure at the time of 20,000 hours
by 0.07 − 0.0215 = 0.0485. Figure 10.3 plots the probabilities of failure at
different times for both the screened and unscreened populations. It can be seen
that the improvement retains until the time reaches 22,500 hours, after which
the probability of failure is exacerbated by the screening. This is understandable.
Nearly all defective components would fail before 22,500 hours. After this time,
the failure is dominated by the good components. Because of the screen stress
effects, a screened good component has a greater degradation percentage than an
unscreened good component, causing the higher probability of failure.
1
0.9
0.8
Probability of failure
0.7
0.6 screened
0.5
0.4
0.3
0.2
unscreened
0.1
0
0 5000 10,000 15,000 20,000 25,000 30,000 35,000 40,000
t (h)
The screened parts are assembled into a module according to the design configu-
ration, where a module may refer to a board, subsystem, or system, as described
earlier. The assembly process usually consists of multiple steps, each of which
may introduce different defects, including, for example, weak solder joints, loose
connections, and crack wire bonds. Most of the defects are latent and cannot be
detected in final production tests. When stressed in the field, they will fail in early
life. Therefore, it is often desirable to precipitate and remove such defects before
customer delivery. This can be accomplished by performing module-level screen-
ing. During the screening, defective connections dominate the failure. Meanwhile,
the already-screened parts may also fail. So in this section we model failure of
both parts and connections.
Then the expected number of renewals Np (t) within time interval [0, t] is
t
Np (t) = hp (t) dt. (10.17)
0
The next step is to transform hp (s) inversely to the renewal density function in the
time domain [i.e., hp (t)]. Then Np (t) is calculated from (10.17). Unfortunately,
the Laplace transform for most distributions (e.g., the Weibull) is intractable. In
most situations, it is more convenient to use the following renewal equation to
calculate Np (t). The renewal equation is
t
Np (t) = Fp (t) + Np (t − x)fp (x)dx. (10.19)
0
This renewal equation for Np (t) is a special case of a Volterra integral equation
of the second kind, which lies in the field of numerical analysis. Many numeri-
cal methods have been proposed to solve the equation. However, these methods
426 STRESS SCREENING
typically suffer from an accumulation of round-off error when t gets large. Using
the basic concepts in the theory of Riemann–Stieltjes integration (Nielsen, 1997),
Xie (1989) proposes a simple and direct solution method with good convergence
properties. As introduced below, this method discretizes the time and computes
recursively the renewal function on a grid of points.
For a fixed t ≥ 0, let the time interval [0, t] be partitioned to 0 = t0 < t1 <
t2 < · · · < tn = t, where ti = id for a given grid size d > 0. For computational
simplification set Ni = Np (id), Fi = Fp [(i − 0.5)d], and Ai = Fp (id), 1 ≤ i ≤ n.
The recursion scheme for computing Ni is
1
i−1
Ni = Ai + (Nj − Nj −1 )Fi−j +1 − Ni−1 F1 , 1 ≤ i ≤ n,
1 − F1 j =1
(10.20)
starting with N0 = 0. The recursion scheme is remarkable in resisting the accu-
mulation of round-off error as t gets larger and gives surprisingly accurate results
(Tijms, 1994). Implementation of the recursion algorithm needs a computer pro-
gram, which is easy to code. In computation, the grid size d has a strong influence
on the accuracy of the result. The selection depends on the accuracy required,
the shape of Fp (t), and the length of the time interval. A good way to determine
whether the results are accurate enough is to do the computation for both grid
sizes d and d/2. The accuracy is satisfactory if the difference between the two
results is tolerable.
When t is remarkably longer than the mean of the life distribution, the expected
number of renewals can be simply approximated by
t σ 2 − µ2
Np (t) ≈ + t 2 t, (10.21)
µt 2µt
where µt and σt are the mean and standard deviation of the life distribution fp (t).
Note that (10.21) gives an exact result when µt and σt are equal. This is the case
for the exponential distribution. In practice, the approximation has an adequate
accuracy for a moderate value of t provided that cx2 = σt2 /µ2t is not too large or
close to zero. Numerical investigations indicate that for practical purpose (10.21)
can be used for t ≥ tx (Tijms, 1994), where
3
c2 µ , cx2 > 1
2 x t
tx = µt , 0.2 < cx2 ≤ 1 (10.22)
1
µt , 0 < cx2 ≤ 0.2.
2cx2
SOLUTION Using the data in Example 10.1, we have cx2 ≈ 0.02, and tx =
711,062. Because t = 20,000 is considerably smaller than tx = 711,062, (10.21)
cannot approximate the expected number of renewals. In this case, (10.20) is
used. The recursion scheme is coded in Visual Basic running on Excel. The
source codes are given in Table 10.2 and can readily be modified for other distri-
butions. The recursive calculation yields Np (20,000) = 0.0216, which is nearly
equal to the probability of failure at 20,000 hours obtained in Example 10.1. This
is understandable. As shown in Figure 10.3, the component has an extremely low
probability of failure within 20,000 hours, allowing (10.19) to be approximated
by Np (t) ≈ Fp (t). For a service life of 50,000 hours, the expected number of
renewals is Np (50,000) = 1.146, which is calculated by setting T0 = 50,000
in the computer program. In contrast, the probability of failure at this time is
approximately 1. Np (t) is ploted in Figure 10.4 to illustrate how the expected
number of renewals increases with time. It is seen that Np (t) becomes a plateau
between 35,000 and 45,000 hours. In Problem 10.10 we ask for an explanation.
Sub Np()
Dim N(5000), A(5000), F(5000)
T0 = 20000
D = 10
M = T0/D
N(0) = 0
Mean1 = −259
Sigma1 = 429
Mean2 = 27915
Sigma2 = 3911
Theta1 = 0.00005245
Theta2 = 0.99997
For i = 0 To M
ZF1 = ((i − 0.5) * D − Mean1) / Sigma1
ZF2 = ((i − 0.5) * D − Mean2) / Sigma2
ZA1 = (i * D − Mean1) / Sigma1
ZA2 = (i * D − Mean2) / Sigma2
FP1 = Application.WorksheetFunction.NormSDist(ZF1)
FP2 = Application.WorksheetFunction.NormSDist(ZF2)
AP1 = Application.WorksheetFunction.NormSDist(ZA1)
AP2 = Application.WorksheetFunction.NormSDist(ZA2)
F(i) = Theta1 * FP1 + Theta2 * FP2
A(i) = Theta1 * AP1 + Theta2 * AP2
Next i
For i = 1 To M
Sum = 0
For j = 1 To i − 1
Sum = Sum + (N(j) − N(j − 1)) * F(i − j + 1)
Next j
N(i) = (A(i) + Sum − N(i − 1) * F(1)) / (1 − F(1))
Next i
Cells(1, 1) = N(M)
End Sub
tc = Ac tc ,
where Ac is the acceleration factor between the module-level screen stress and
the use stress for the connection. The value of Ac may vary with the type of
connection, and can be estimated with the theory of accelerated testing described
in Chapter 7.
MODULE-LEVEL SCREENING 429
2
1.8
1.6
1.4
1.2
Np (t)
1
0.8
0.6
0.4
0.2
0
0 10,000 20,000 30,000 40,000 50,000 60,000 70,000
t (h)
FIGURE 10.4 Plot of Np (t)
When a module undergoes screening, all parts within the module are aged at
the same time. The aging effects depend on the type of part. Some parts may
be more sensitive than others to screening stress. Nevertheless, all parts suffer
performance degradation during screening, causing permanent damage. For a
particular part, the amount of degradation is determined by the screen stress level
and duration. The equivalent aging time tcp for a part at the use stress level is
tcp = Acp tc ,
where Acp is the acceleration factor between the module-level screen stress and
the use stress for the part.
Using the mixed Weibull distribution, the reliability of an unscreened connec-
tion of a certain type can be written as
m1 m2
t t
Rc (t) = ρ1 Rc1 (t) + ρ2 Rc2 (t) = ρ1 exp − + ρ2 exp − ,
η1 η2
(10.23)
where
Rc (t) = reliability of an unscreened connection,
Rc1 (t) = reliability of an unscreened substandard connection,
Rc2 (t) = reliability of an unscreened good connection,
ρ1 = fraction of the substandard connections,
ρ2 = fraction of the good connections,
m1 = Weibull shape parameter of the substandard connections,
m2 = Weibull shape parameter of the good connections,
η1 = Weibull characteristic life of the substandard connections,
η2 = Weibull characteristic life of the good connections.
430 STRESS SCREENING
Nf (τ ) = Nc (τ + tc ) − Nc (tc )
m1 m2
τ + tc τ + tc
ρ exp − + ρ2 exp −
1 η1 η2
= − ln m1 m2 .
tc tc
ρ1 exp − + ρ2 exp −
η1 η2
(10.25)
= 0.0299.
From (10.25), the expected number of repairs to a solder joint by the end of
warranty time (τ = 1500) is
N̂f (1500) = − ln
1500 + 372 0.63 1500 + 372 2.85
0.04×exp −
+0.96×exp −
238 12,537
372 0.63 372 2.85
× − + × −
0.04 exp 0.96 exp
238 12,537
= 0.0143.
For an unscreened solder joint, the reliability at the end of warranty time is
calculated from (10.23) as
1500 0.63 1500 2.85
R̂c (1500) = 0.04 × exp − + 0.96 × exp −
238 12,537
= 0.9594.
From (10.24), if the board were not screened, the expected number of repairs to
a solder joint by the end of warranty time would be
N̂c (1500) = − ln[Rc (1500)] = − ln(0.9594) = 0.0414.
The benefit from the module-level screening is obviously noted by comparing
the values of N̂f (1500) and N̂c (1500).
Suppose that a module ceases to function when any part or connection fails;
that is, the parts and connections are in series. As presented in Chapter 4, if the
parts and connections are independent of each other, the reliability of the module
can be written as
nP
nC
Rm (τ ) = [Rpi (τ )]Li [Rcj (τ )]Kj , (10.28)
i=1 j =1
where
Resistor
10 k 1 3 0.9995
390 2 1 0.9998
27 k 3 2 0.9991
Capacitor 4 2 0.9986
LED 5 1 0.9995
Transistor 6 1 0.9961
SM connection 1 16 0.9999
PTH connection 2 7 0.9998
Rpi (τ ) and Rcj (τ ) are calculated from (10.26) and (10.27), respectively.
× 0.999916 × 0.99987
= 0.9864.
total cost
Cost minimum
0
Screen duration
FIGURE 10.5 Costs as a function of screen duration
screen stress levels. Figure 10.5 shows in-house screen cost, field repair cost, and
total cost as a function of screen duration. The in-house screen cost includes the
part- and module-level screen costs, and the field repair cost involves the part
replacement cost and connection repair cost.
The in-house screen and field repair costs incurred by parts consist of the
following elements:
1. Cost of screen setup
2. Cost of screen for a specified duration
3. Cost of good parts being screened out
4. Cost of repair at the module-level screen and in the field
The part-cost model can be written as
nP
nP
TP = Csp + Cpi Li tpi + (Cgpi + Cpi tpi )Li α2 Pr[y2 (tpi ) ≥ G∗i ]
i=1 i=1
nP
nP
+ Cphi Li Npi (tcpi ) + Cpf i Li [Npi (τ + tcpi ) − Npi (tcpi )],
i=1 i=1
(10.29)
Similarly, the in-house screen and field repair costs incurred by connections
are comprised of cost elements 1, 2, and 4 given above. Then the connection-cost
model can be written as
nC
nC
TC = Csc + Cc tc + Cchj Kj Nc (tcj ) + Ccfj Kj [Nc (τ + tcj ) − Nc (tcj )],
j =1 j =1
(10.30)
where j denotes a type j connection, and
TM = TP + TC, (10.31)
where TM is the total cost incurred by a module (parts and connection) due to
the screen and repair. It represents an important segment of the life cycle cost of
the module.
Min(TM), (10.32a)
subject to
Rm (τ ) ≥ R0 , (10.32b)
G∗i ≤ G0i , (10.32c)
G∗i ≥ yai , (10.32d)
tpi , tc , G∗i ≥ 0, (10.32e)
436 STRESS SCREENING
where tc , tpi , and G∗i (i = 1, 2, . . . , nP ) are decision variables and yai is the min-
imum allowable threshold for a part of type i. Constraint (10.32c) is imposed to
accelerate the screening process and to reduce the damage to good parts; (10.32d)
is required for some parts whose degradation is not stable until yai is reached.
The implications of other constraints are straightforward.
The actual number of decision variables in (10.32) depends on nP , which may
be large for a medium-scale module. In these situations, it is important to lump all
similar parts and reduce the size of nP . For example, the three types of resistors
in Example 10.4 may be grouped into one type because their reliabilities and
cost factors are close and the degradation thresholds (defined as the resistance
drift percentage) are the same. Calculation of the optimization model can be
accomplished using a nonlinear programming technique such as the Lagrangian
approach and the penalization method. Bertsekas (1996), for example, provides
a good description of the approaches.
Choose optimal values of tp , tc , and G∗ that minimize TM and meet the reliability
requirement.
SOLUTION The optimization model for the problem is calculated using the
penalization method and yields tp = 51 hours, tc = 13.4 hours, G∗ = 23.1, and
TM = $631.68.
Now let’s discuss the significance and implication of the optimal screen plan.
First, reducing the threshold from the usual one (G0 = 100) to the optimal value
(G∗ = 23.1) lowers the life cycle cost. To show this, Figure 10.6 plots TM for
various values of G∗ . TM is calculated by choosing optimal tp and tc at a given
G∗ . The minimum TM ($631.68) is achieved at G∗ = 23.1. If the usual threshold
(G0 = 100) were used in screening, the TM would be $750. The saving due to
use of the optimal G∗ is (750 − 631.68)/750 = 15.8%.
The optimal G∗ also alleviates the aging effect of screen stress on good parts.
Figure 10.7 shows the mean values µy2 of y2 immediately after the module-level
screening for various values of G∗ . The µy2 decreases with G∗ , indicating that
OPTIMAL SCREEN PLANS 437
800
(100, 750)
TM ($) 750
700
(23.1, 631.68)
650
600
10 20 30 40 50 60 70 80 90 100 110
G*
FIGURE 10.6 TM versus G∗
30
(100, 22.4)
25
20
(23.1, 13.2)
2
15
my
10
5
0
10 20 30 40 50 60 70 80 90 100 110
G*
FIGURE 10.7 µy2 versus G∗
the degradation of a good part caused by the screen stress can be mitigated by
use of a smaller G∗ . If the usual threshold (G0 = 100) were used in screening,
µy2 would be 22.4. Use of the optimal tightened threshold (G∗ = 23.1) reduces
the degradation by (22.4 − 13.2)/22.4 = 41.1%.
Now let’s look at how the value of tp affects the cost elements. Figure 10.8
plots the following costs for various values of tp :
Cost 1 sharply decreases with the increase in tp before it reaches 51 hours.
However, cost 1 increases with tp as it goes beyond 300 hours because exces-
sive screen appreciably degrades good parts. This differs from the classical cost
models, which ignore the aging effects of the screen stress on good parts. Cost 2
438 STRESS SCREENING
900
800 COST 1
700 COST 2
COST 3
600
TP
Cost ($)
500
400
300
200
100
0
0 100 200 300 400
t ′p (h)
12
10
Number of failures
0
0 100 200 300 400
t ′p (h)
increases with tp because the degradation of good parts increases the proba-
bility of screening out good parts. TP has an optimum value which achieves the
best compromise among costs 1, 2, and 3.
The expected number of field failures per 1000 parts by the end of τ =
43, 800 hours is plotted in Figure 10.9 for various values of tp . As tp increases,
the number of field failures decreases. When tp reaches 51 hours, the number
of field failures begins to remain nearly constant. But as tp increases further
beyond 300 hours, the number of field failures increases considerably, due to the
degradation of good parts caused by the screen stress.
PROBLEMS
10.2 Explain the advantages and disadvantages of the commonly used screening
techniques, including burn-in, ESS, HASS, discriminator screening, and
degradation screening.
10.3 For a product that is said to have failed when its monotonically decreasing
performance characteristic crosses a specified threshold, formulate and
depict the relationship between the bimodal distributions of the life and
characteristic.
10.4 A degradation screening requires products to be aged at an elevated stress
level for a certain length of time. Explain why.
10.5. A type of part whose failure is defined in terms of y ≤ G0 , is subjected
to degradation screening at an elevated stress level for a length of time tp .
Develop formulas for calculating the following:
(a) The probability of a part, substandard or good, passing the screen.
(b) The probability that a part passing the screen is from the substandard
subpopulation.
(c) The probability that a part passing the screen is from the good sub-
population.
(d) The reliability of a part from the screened population.
(e) The pdf of a part from the screened population.
10.6 Refer to Problem 10.5. Suppose that the performance characteristic y can
be modeled with the lognormal distribution. Calculate parts (a) through (e).
10.7 Revisit Example 10.1.
(a) Explain why, after screening, the mean life of the defective compo-
nents is negative.
(b) Work out the pdf of the components before screening.
(c) Calculate the pdf of the components after screening.
(d) Plot on the same chart the pdfs of the components before and after
screening. Comment on the shape of the pdf curves.
10.8 An electronic component is said to have failed if its performance char-
acteristic exceeds 85. The component population contains 8% defective
units and is subjected to degradation screening for 110 hours at an ele-
vated stress level. The acceleration factor between the screen and use
stress levels is 18. A unit is considered defective and is weeded out if
the performance reaches 25 at the end of screening. Suppose that the per-
formance is modeled using the normal distribution and that at the use
stress level the degradation models are µy1 = 4.8 + 0.021t, σy1 = 5.5,
µy2 = 4.8 + 0.0018t, and σy2 = 3.7.
(a) Determine the equivalent screen time at the use stress level.
(b) Calculate the probability of a defective component escaping the screen.
(c) Compute the probability of a defect-free component surviving the
screen.
440 STRESS SCREENING
11.1 INTRODUCTION
In the context of the product life cycle, warranty analysis is performed in the
field deployment phase. In the earlier phases, including product planning, design
and development, verification and validation, and production, a product team
should have accomplished various well-orchestrated reliability tasks to achieve
the reliability requirements in a cost-effective manner. However, it does not mean
that the products would not fail in the field. In fact, some products would fail
sooner than others for various reasons, such as improper operation, production
process variation, and inadequate design. The failures not only incur costs to
customers but often result in reputation and potential sales losses to manufactur-
ers. Facing intense global competition, today most manufacturers offer warranty
packages to customers to gain competitive advantage. The role of warranty in
marketing a product is described in, for example, Murthy and Blischke (2005).
In the marketplace, lengthy warranty coverage has become a bright sales point
for many commercial products, especially for those that may incur high repair
costs. Furthermore, it is often employed by manufacturers as a weapon to crack
a new market. A recent example is that of South Korean automobiles, which
entered North American markets with an unprecedented warranty plan covering
the powertrain system for five years or 50,000 miles. This contrasts with the
three-year, 36,000-mile plan offered by most domestic automakers. In addition
to “voluntary” offers, government agencies may also mandate extended warranty
coverage for certain products whose failures can result in severe consequences,
442
WARRANTY POLICIES 443
such as permanent damage to the environment and loss of life. For instance, U.S.
federal regulations require that automobile catalytic converters be warranted for
eight years or 80,000 miles, since failure of the subsystem increases toxic emis-
sions to the environment. In short, warranty offers have been popular in modern
times, so warranty analysis has become increasingly important.
When products fail under warranty coverage, customers return their products
for repair or replacement. The failure data, such as the failure time, failure mode,
and use condition, are made known to manufacturers. Often, manufacturers main-
tain warranty databases to record and track these data. Such data contain precious
and credible information about how well products perform in the field, and thus
should be fully analyzed to serve different purposes. In general, warranty analyses
are performed to:
warranty on the corrosion of automobile parts typically covers five years and
unlimited mileage, because corrosion is closely related to calendar time, not to
mileage. In general, the time scale may be the calendar time, usage (e.g., mileage
or cycles), or others. For many products used intermittently, the failure process
is often more closely related to usage than to calendar time. Thus, usage should
serve as one of the scales for defining a warranty period. However, due to the dif-
ficulty in tracking usage for warranty purposes, calendar time is often employed
instead. Washing machines, for instance, are warranted for a period of time and
not for the cycles of use, although most failures result from use. When accu-
mulated use is traceable, it is often used in conjunction with calendar time. A
common example is the automobile bumper-to-bumper warranty policy described
above, which specifies both the calendar time in service and mileage.
Among the three elements of a warranty policy, the warranty period is proba-
bly most influential on warranty costs. Lengthy warranty coverage erodes a large
portion of revenues and deeply shrinks profit margins; however, it increases cus-
tomer satisfaction and potential sales. Manufacturers often determine an optimal
period by considering the effects of various factors, including, for example, prod-
uct reliability, cost per repair, sales volume, unit price, legal requirements, and
market competition. If the failure of a product can result in a substantial loss
to society, the manufacturer would be impotent in making a warranty decision.
Instead, governmental regulations usually mandate an extended warranty period
for the product. Another important element of a warranty policy is failure cov-
erage. Normally, a warranty covers all failures due to defective materials or
workmanship. However, damage caused by conditions other than normal use,
such as accident, abuse, or improper maintenance, are usually excluded. Fail-
ure coverage is often an industry standard; individual sellers would not like to
override it. In contrast, sellers have greater room to manipulate the seller’s and
buyer’s financial responsibility for warranty services. This results in different
warranty policies. The most common ones are as follows:
1. Free replacement policy. When a product fails within the warranty period
and failure coverage, it is repaired or replaced by the seller free of charge to
the buyer. For a failure to be eligible for the policy, it must meet the warranty
period and failure coverage requirements. Under this policy, the seller has to pay
all costs incurred by the warranty service, including fees for materials, labor,
tax, disposal, and others. Because of the substantial expenditure, sellers usually
limit this policy to a short warranty length unless a longer period is stipulated
by regulations.
2. Pro-rata replacement policy. When a product fails within the warranty
period and failure coverage, it is repaired or replaced by the seller at a fraction
of the repair or replacement cost to the buyer. The cost to the buyer is proportional
to the age of the product at failure. The longer the product has been used, the
more the buyer has to pay; this is reasonable. The cost is a function of the age in
relative to the warranty length. Under this policy, customers may be responsible
for tax and service charges. Let’s consider the tire example given earlier. If a
446 WARRANTY ANALYSIS
tire is blown out in 15 months after the original purchase and the remaining
6
tread depth is 32 inch at failure, it is subject to the pro-rata replacement policy.
Suppose that a comparable new tire has a tread depth of 11 32
inch, and sells at
$70. Then the cost to the customer is (11/32 − 6/32)/(9/32) × 70 = $38.89 plus
9
applicable tax, where 32 inch is the usable life.
3. Combination free and pro-rata replacement policy. This policy specifies
two warranty periods, say t1 and t0 (t1 < t0 ). If a product fails before t1 expires
and the failure is covered, it is repaired or replaced by the seller free of charge
to the buyer. When a failure occurs in the interval between t1 and t0 and is under
the failure coverage, the product is repaired or replaced by the seller at a fraction
of the repair or replacement cost to the buyer.
The warranty policies described above are nonrenewing; that is, the repair or
replacement of a failed product does not renew the warranty period. The repaired
or replaced product assumes the remaining length of the original warranty period.
The warranty policies for repairable products are often nonrenewing. In contrast,
under renewing policies, the repaired or replaced products begin with a new
warranty period. The policies cover mostly nonrepairable products.
A warranty period may be expressed in two dimensions. For most commercial
products, the two dimensions represent calendar time and use. As soon as one
of the two dimensions reaches its warranty limit, the warranty expires, regard-
less of the magnitude of the other dimension. If a product is operated heavily,
the warranty will expire well before the warranty time limit is reached. On the
other hand, if a product is subjected to light use, the warranty will expire well
before the use reaches the warranty usage limit. The two-dimensional warranty
policy greatly reduces the seller’s warranty expenditure and conveys the costs
to customers. This policy is depicted in Figure 11.1, where t and u are, respec-
tively, the calendar time and usage, and the subscript 0 implies a warranty limit.
Figure 11.1 shows that the failures occurring inside the window are covered and
those outside are not. Let’s revisit the automobile and tire warranty examples
given earlier. The automobile bumper-to-bumper warranty is a two-dimensional
(time in service and mileage) free replacement policy, where the warranty time
and mileage limits are 36 months and 36,000 miles. The General Tire warranty
on passenger tires is a two-dimensional combination free and pro-rata replace-
ment policy, where the warranty periods t1 and t0 are two-dimensional vectors
u0
u
0 t0
t
FIGURE 11.1 Two-dimensional warranty coverage
WARRANTY DATA MINING 447
2 2
with t1 equal to 12 months and first 32
inch, and t0 equal to 72 months and 32
inch of tread remaining.
Product Data The data often include product serial number, production date,
plant identification, sales date, sales region, price, accumulated use, warranty
repair history, and others. Some of these data may be read directly from the
failed products, whereas others need to be extracted from serial numbers. The
data may be analyzed for different purposes. For example, the data are useful in
identification of unusual failure patterns in certain production lots, evaluation of
relationship between field reliability and sales region (use environment), study of
customer use, and determination of the time from production to sales. Manufac-
turers often utilize product data, along with failure data and repair data (discussed
below), to perform buyback analysis, which supports a decision as to whether it
is profitable for manufacturers to buy back from customers certain products that
have generated numerous warranty claims.
Failure Data When a failure is claimed, the repair service provider should
record the data associated with the failure, such as the customer complaint symp-
toms, use conditions at failure, and accumulated use. After the failure is fixed,
the diagnosis findings, failure modes, failed part numbers, causes, and postfix test
results must be documented. It is worth noting that the failure modes observed by
repair technicians usually are not the same as the customer complaint symptoms,
since customers often lack product knowledge and express what they observed in
nontechnical terms. However, the symptom description is helpful in diagnosing
and isolating a failure correctly and efficiently.
Repair Data Such data should contain the labor time and cost, part numbers
serviced, costs of parts replaced, technician work identification and affiliation,
date of repair, and others. In warranty repair practice, an unfailed part close to
a failure state may be adjusted, repaired, or replaced. Thus, it is possible that
the parts serviced may outnumber the parts that failed. The repair data should be
analyzed on a regular basis to track warranty spending, identify the opportunity
for improving warranty repair procedures, and increase customer satisfaction. In
addition, manufacturers often utilize the data to estimate cost per repair, warranty
cost per unit, and total warranty cost.
448 WARRANTY ANALYSIS
1. Define the objective of the warranty data analysis. The objective includes,
but is not limited to, determination of monetary reserves for warranty, projection
of warranty repairs or replacements to the end of warranty period, estimation
of field reliability, identification of critical failure modes, manufacturing process
improvement, and evaluation of fix effectiveness. This step is critical, because the
type of data to be retrieved vary with the objective. For example, the estimation
of field reliability uses first failure data, whereas the warranty repair projection
includes repeat repairs.
2. Determine the data scope. A warranty database usually contains three
categories of data, including the product data, failure data, and repair data, as
described earlier. In this step, one should clearly define what specific warranty
data in each category are needed to achieve the objective. For example, if the
objective is to evaluate the effectiveness of a part design change, the products
must be grouped into two subpopulations, one before and one after the time
when the design change is implemented in production, for the purpose of com-
parison. The grouping may be done by specifying the production dates of the
two subpopulations.
WARRANTY DATA MINING 449
3. Create data search filters and launch the search. In this step, one has to
interrogate the warranty database by creating search filters. In the context of
warranty database, a filter is the characteristic of a product, failure, or repair.
The filters are established such that only the data defined in the data scope are
extracted from the database. Upon establishment of the filters, a data search may
be initiated. The time a search takes can vary considerably, depending on the size
of the database, the complexity of the filters, and the speed of the computers.
4. Format the data representation. When data search is completed, one may
download the data sets and orchestrate them in a format with which subsequent
data analyses are efficient. Some comprehensive databases are equipped with
basic statistical tools to generate graphical charts, descriptive statistics, probability
plots, and others. A preliminary analysis using such tools is good preparation for
a more advanced study.
where Fα (t) is the probability that no failure occurs and the OBD system detects
a failure, and Fβ (t) is the probability that a failure occurs and the OBD system
does not detect the failure. The objective of warranty data analysis is to estimate
the reliability of the OBD system installed in Ford Motor Company’s vehicle A
of model year B. Determine the data mining strategy.
SOLUTION To calculate R(t) using (11.1), we first need to work out Fα (t)
and Fβ (t) from the warranty data. The probabilities are estimated from the times
to first failure of the vehicles that generate the α and β errors. The life data can
be obtained by searching Ford’s warranty database, called the Analytic Warranty
System. The data search strategy is as follows.
1. By definition, the warranty claims for α error are those that show a trou-
ble light and result in no part repair or replacement. Hence, the filters
for retrieving such claims specify: Part Quantity = 0, Material Cost = 0,
450 WARRANTY ANALYSIS
product and do not tolerate much performance degradation. Warranty claims are
frequently made against products that have degraded significantly but have not
failed technically. In today’s tough business climate, many such products will
be repaired or replaced to increase customer satisfaction. Even if there is nei-
ther repair nor replacement, premature claims still incur diagnostic costs. Hence,
such claims result in pessimistic estimates of reliability, warranty cost, and other
quantities of interest.
5. The population of products in service decreases with time, and the number
of reductions is often unknown to manufacturers. In warranty analysis, the sales
volume is assumed to be the working population and clearly overestimates the
number of units actually in service. For example, automakers do not have accurate
knowledge of the number of vehicles that are under warranty and have been
salvaged due to devastating accidents. Such vehicles are still counted in many
warranty analyses.
service. Let ni be the number of products sold in time period i and rij be the
number of failures occurring in time period j to the units sold in time period i,
where i = 1, 2, . . . , k, j = 1, 2, . . . , k, and k is the maximum time in service. In
the automotive industry, k is often referred to as the maturity. Note that products
sold in the first time period have the maximum time in service. In general, the data
can be tabulated as in Table 11.1, where TTF stands for time to
failure and
TIS for
time in service, and ri. = ij =1 rij , r.j = ki=j rij , and r.. = ki=1 ri. = kj =1 r.j .
Here ri· is the total number of failures among ni units, r·j the total number of
failures in j periods, and r·· the total number of failures among all products sold.
Since the failure time of a product is less than or equal to its time in service, the
failure data are populated diagonally in Table 11.1.
1 r11 r1 . n1
2 r21 r22 r2 . n2
3 r31 r32 r33 r3 . n3
.. .. .. .. .. ..
. . . . ··· . .
k rk1 rk2 rk3 ... rkk rk . nk
Total r·1 r·2 r·3 ... r·k r··
1 r·1 n1 − r1·
2 r·2 n2 − r2·
3 r·3 n3 − r3·
.. .. ..
. . .
k r·k nk − rk·
RELIABILITY ESTIMATION FROM WARRANTY CLAIM TIMES 453
SOLUTION Because the products are warranted for 12 months, the failure data
for the 765 units in service for 13 months are available only up to 12 months.
These units are treated as having 12 months in service and are combined with
the 1358 units having 12 months in service to calculate the number of survivals.
The life data are shown in Table 11.4.
The life data in Table 11.4 were analyzed using Minitab. Graphical analysis
indicates that the lognormal distribution fits the data adequately. Figure 11.2
shows the lognormal probability plot, least squares fits, and the two-sided 90%
percentile confidence intervals. Further analysis using the maximum likelihood
method yields estimates of the scale and shape parameters as µ̂ = 6.03 and
σ̂ = 1.63. The estimate of the probability of failure at the end of warranty time is
ln(12) − 6.03
F̂ (12) = = 0.0148.
1.63
1 0 0 568
2 0 1 1 638
3 0 1 1 2 823
4 0 0 1 1 2 1,231
5 0 0 1 1 0 2 1,863
6 1 0 1 0 1 2 5 2,037
7 1 1 3 2 1 4 8 20 2,788
8 2 3 2 6 2 1 6 4 26 2,953
9 1 2 0 3 4 2 2 3 6 23 3,052
10 1 3 2 2 3 4 3 5 4 8 35 2,238
11 0 1 2 0 4 2 1 2 0 4 3 19 1,853
12 1 0 3 1 3 2 1 3 3 2 1 2 22 1,358
13 2 0 0 2 1 0 0 2 1 0 2 1 11 765
Total 9 12 16 18 19 17 21 19 14 14 6 3 168 22,167
454 WARRANTY ANALYSIS
1 9 568
2 12 637
3 16 821
4 18 1229
5 19 1861
6 17 2032
7 21 2768
8 19 2927
9 14 3029
10 14 2203
11 6 1834
12 3 2090
99
90
70
50
Percent
30
10
0.1
0.01
FIGURE 11.2 Lognormal plot, least squares fits, and 90% confidence intervals for the
washing machine data
The number of units that would fail by the end of the warranty period is
0.0148 × 22,167 = 328. Since 168 units have failed up to the current month,
there would be an additional 328 − 168 = 160 warranty claims.
sooner) in the United States. The failure of such products is time and usage
dependent; in other words, the reliability of products is a function of time and
use. Modeling two-dimensional reliability provides more realistic estimates. Such
models are needed by manufacturers to evaluate reliability, to predict warranty
claims and costs, and to assess customer satisfaction.
In this section we describe a practical approach to modeling and estimating
two-dimensional reliability from warranty data. More statistical methods for this
topic are given in, for example, Blischke and Murthy (1994, 1996), Lawless et al.
(1995), Eliashberg et al. (1997), S. Yang et al. (2000), H. Kim and Rao (2000),
G. Yang and Zaghati (2002), and Jung and Bai (2006).
where M denotes the usage to failure, T the time to failure, and fM,T (m, t) the
joint probability density function (pdf) of M and T .
The reliability is the probability that a product survives both usage m and time
t, and can be written as
∞ ∞
R(m, t) = Pr(M ≥ m, T ≥ t) = fM,T (m, t) dm dt. (11.3)
t m
II IV
Usage
I III
0 t
Time
FIGURE 11.3 Time–usage plane partitioned into four regions
456 WARRANTY ANALYSIS
Note that this two-dimensional reliability is not the complement of the probability
of failure. This is because failures may occur in regions II and III. The probability
of failure in region II is
t ∞
Pr(M ≥ m, T ≤ t) = fM,T (m, t) dm dt. (11.4)
0 m
Apparently, the probabilities of failure in the four regions add to 1. Then the
two-dimensional reliability is
where FT (t) and FM (m) are, respectively, the marginal probabilities of failure
of T and M, and
where fM|T (m) is the conditional pdf of M at a given time T and fT (t) is the
marginal pdf of T . In the following subsections we present methods for estimating
the two pdf’s from warranty data.
u0
Usage
0 t0
Time
FIGURE 11.4 Usage distributions at various times
distributions at different times in service, where the shaded areas represent the
fractions of the population falling outside the warranty usage limit. In the con-
text of life testing, the two-dimensional warranty policy is equivalent to a dual
censoring. Such censoring biases the estimation of fT (t) and fM|T (m) because
failures occurring at a usage greater than u0 are unknown to manufacturers. The
bias is exacerbated as T increases toward t0 and more products exceed the war-
ranty usage limit. In reliability analysis, it is important to correct the bias. This
can be accomplished by using a usage accumulation model, which describes
the relationship between the usage and time. Here we present two approaches to
modeling usage accumulation: the linear accumulation method and the sequential
regression analysis.
0.50
0.45
0.40
0.35
Fraction
0.30
0.25
0.20
0.15
0.10
0.05
0.00
0 5 10 15 20 25 30 35 40
Months
FIGURE 11.5 Fractions of vehicles exceeding warranty mileage limit at various months
in service
exceeding u0 . The number of products that are dropped out and failed, denoted
rt2 +1 , is estimated by
1 − p0
rt2 +1 = r·(t2 +1) , (11.15)
p0
where r·(t2 +1) is the number of warranted products that fail at time period
t2 + 1, and
p0 = Pr[U ≤ u0 |µu (t2 + 1), σu (t2 + 1)].
Then the usage data at time period t2 + 1 can be viewed as the censoring data. The
usage data of r·(t2 +1) failed products are known, and the unrecorded rt2 +1 failed
units are considered to be right censored at u0 . The same type of distribution
selected earlier is fitted to the censoring data, and we obtain the new estimates
µ̂u (t2 + 1) and σ̂u (t2 + 1). The distribution with these new estimates has a better
fit than the projected distribution, and thus µ̂u (t2 + 1) and σ̂u (t2 + 1) are added to
the existing data series µ̂u (t) and σ̂u (t) (t = t1 , t1 + 1, . . . , t2 ) to update estimates
of the regression model parameters θ1 and θ2 . This projection and regression
process repeats until the maximum time period k is reached. Equations (11.13)
and (11.14) of the last update constitute the final usage accumulation model
and are used to calculate Pr(U ≥ u0 |t). The projection and regression process is
explained pictorially in Figure 11.6.
The cdf of the Weibull distribution with shape parameter β and scale parameter
α is
t β
F (t) = 1 − exp − , t > 0. (11.17)
α
which indicates that the Weibull cumulative hazard rate is a linear function of
time t on a log-log scale. If a data set plotted on this log-log scale is close to
a straight line, the life can be modeled with a Weibull distribution. The Weibull
TWO-DIMENSIONAL RELIABILITY ESTIMATION 461
Start
ˆ ˆ
Estimate r′t2 + 1.
t2 = t2 + 1
N
t2 > k ?
Y
ˆ ˆ
End
shape parameter equals the slope of the straight line; the scale parameter is
calculated from the slope and intercept.
Similarly, the exponential distribution with hazard rate λ has
µ 1
−1 [1 − e−H (t) ] = − + t. (11.21)
σ σ
462 WARRANTY ANALYSIS
µ 1
−1 [1 − e−H (t) ] = − + ln(t). (11.22)
σ σ
Nelson (1972, 1982) describes hazard plotting in detail.
To utilize the transformed linear relationships (11.19) through (11.22) to esti-
mate the life distribution, one first must calculate the hazard rate. For a continuous
nonnegative random variable T representing time to failure, the hazard function
h(t) is defined by
N (t) − N (t + t)
h(t) = lim , (11.23)
t→0 N (t)t
r(t)
h(t) = , (11.25)
Pr(U ≤ u0 |t)N (t)
where r(t) is the number of first warranty repairs during time t and t + 1.
Applying (11.25) to the warranty data in Table 11.1, we obtain the estimate of
the hazard rate in time period j as
r·j
ĥj = , (11.26)
Pr(U ≤ u0 |j )Nj
where
j −1
k
Nj = ni − ril . (11.27)
i=j l=1
TWO-DIMENSIONAL RELIABILITY ESTIMATION 463
j
Ĥj = ĥi , (11.28)
i=0
SOLUTION The total number of first failures in each month is given in Table 11.5
and repeated in Table 11.6 for the sake of calculation. The number of surviving vehi-
cles at month j is calculated from (11.27). For example, at month 3, the number is
11 2
N3 = ni − ril = [20,806 − (1 + 3)] + [18,165 − (3 + 2)] + · · ·
i=3 l=1
1 2 2 12,571
2 2 0 2 13,057
3 1 3 3 7 20,806
4 3 2 5 4 14 18,165
5 3 5 4 3 6 21 16,462
6 1 3 5 3 7 4 23 13,430
7 1 1 3 5 4 3 5 22 16,165
8 2 0 1 2 5 4 5 6 25 15,191
9 0 2 1 2 2 3 5 4 4 23 11,971
10 2 0 3 4 4 5 3 4 2 3 30 5,958
11 1 0 1 1 2 2 1 0 1 0 0 9 2,868
Total 18 16 26 24 30 21 19 14 7 3 0 178 146,645
The estimate of the hazard rate at month j is calculated from (11.26). For
example, the hazard rate at month 3 is estimated as
26
ĥ3 = = 0.000215 failures per month.
120,961
The estimates of the hazard rate for j = 1, 2, . . . , 11 are given in Table 11.6.
Then the cumulative hazard rate up to j months is computed using (11.28).
For example, the cumulative hazard rate up to three months is estimated as
3
Ĥ3 = ĥi = 0 + 0.000123 + 0.000119 + 0.000215 = 0.000457.
i=0
The pdf will be used in Example 11.6. The reliability of the mechanical assembly
at the end of the warranty time (36 months) is estimated as
36 1.415
R̂(36) = exp − = 0.9833.
645.9
−5
−6
−7
ln[H(t)]
−8
−9
−10
0 0.5 1 1.5 2 2.5
ln(t)
FIGURE 11.7 Weibull fit to estimates of the cumulative hazard rate
466 WARRANTY ANALYSIS
mean usage
m
0 1 t2
t
FIGURE 11.8 Usage to failure versus mean usage at various times
TWO-DIMENSIONAL RELIABILITY ESTIMATION 467
40,000
35,000
30,000
warranty mileage
25,000
m (miles)
limit
20,000
15,000
10,000
5,000 mean usage
0
0 1 2 3 4 5 6 7 8 9 10 11 12 13
t (months)
FIGURE 11.9 Mileage-to-failure data at different months
When k > t2 , the warranty dropout fraction is no longer negligible, and esti-
mation of fM|T (m) should take into account all failed products, including those
exceeding the warranty usage limit. The estimation is relatively complicated but
can be done by using the sequential regression method described earlier for
modeling usage accumulation. When applying this method, the usage data are
replaced by the usage-to-failure data, and the calculation process remains the
same. The analysis yields µm (t) and σm (t), in contrast to µu (t) and σu (t) from
the usage accumulation modeling.
Example 11.5 Refer to Examples 11.3 and 11.4. When the sport utility vehicles
failed and claimed for warranty repair, the mileages and times at failure were
recorded. Figure 11.9 plots the mileages to failure at each month in service up
to 10 months. Table 11.6 shows the number of claims in each month. Estimate
the pdf of the mileage to failure M conditional on the month in service T .
SOLUTION Because the vehicles had only 11 months in service at the time of
data analysis, the warranty dropout fraction is negligible. It is reasonable to con-
sider that Figure 11.9 includes all failed vehicles. To detect the tendency of failure
occurrence, we plot the mean usage of the vehicle population in Figure 11.9.
Here the mean usage at month t is exp[6.85 + ln(t) + 0.5 × 0.722 ] = 1223.2t.
Figure 11.9 shows that the failures tended to occur on high mileage accumula-
tors. Hence, fM|T (m) is not equal to fU |T (u). In this case, fM|T (m) is calculated
directly from the warranty data.
Since there are no warranty repairs in month 11, and only 3 repairs in month
10, the calculation of fM|T (m) uses the data for the first nine months. The
mileage-to-failure data in each of the nine months are adequately fitted by the
Weibull distribution, as shown in Figure 11.10. The Weibull fits approximately
parallel, indicating a common shape parameter to all months. The maximum
likelihood estimates of the Weibull characteristic life α̂m for the nine months are
calculated and plotted versus time in Figure 11.11. The plot suggests a linear
relationship: αm = bt, where b is a slope. The slope can be estimated using the
468 WARRANTY ANALYSIS
99
95
90
80
70
60
50 month 1
40 2
30
Percent
3
20
4
10 5
6
5
7
3
8
2
9
1
30,000
25,000
20,000
ˆ m (miles)
15,000
a
10,000
5,000
0
0 1 2 3 4 5 6 7 8 9 10 11 12 13
t (months)
FIGURE 11.11 Estimates of Weibull characteristic life at different months
least squares method. Here the maximum likelihood method is used to get better
estimates. From (7.59), the log likelihood function can be written as
β
9
r·j
mij
L(b, β) = ln(β) − β ln(btj ) + (β − 1) ln(mij ) − ,
j =1 i=1
btj
where β is the common shape parameter, r·j the total number of first failures at
month j (see Table 11.6 for values), mij the ith mileage to failure at month j ,
and tj = j . Substituting the data into the equation above and then maximizing
the log likelihood gives β̂ = 2.269 and b̂ = 2337. Hence, the conditional pdf of
TWO-DIMENSIONAL RELIABILITY ESTIMATION 469
M at a given month T is
2.269
m1.269 m
fM|T (m) = exp − .
1.939 × 107 × t 2.269 2337 × t
k
N2 (u0 , t0 ) = Pr(M > u0 , T ≤ t0 ) ni , (11.30)
i=1
where the probability is computed from (11.4). These failures occur in region II
of Figure 11.3 and are not reimbursed by the manufacturers.
The number of first failures occurring outside the warranty time limit t0 and
within the usage limit u0 is
k
N3 (u0 , t0 ) = Pr(M ≤ u0 , T > t0 ) ni , (11.31)
i=1
where the probability is calculated from (11.5). These failures fall in region III
of Figure 11.3 and are not eligible for warranty coverage.
The number of first failures occurring outside both the warranty usage and
time limits (u0 and t0 ) is
k
N4 (u0 , t0 ) = Pr(M > u0 , T > t0 ) ni , (11.32)
i=1
470 WARRANTY ANALYSIS
where the probability equals R(u0 , t0 ) and is calculated from (11.3). These fail-
ures occur in region IV of Figure 11.3, where both the usage and time exceed
the warranty limits.
Example 11.6 For the mechanical assembly of the sport utility vehicles dealt
with in Examples 11.4 and 11.5, estimate the two-dimensional reliability at the
warranty mileage and time limits (u0 = 36,000 miles and t0 = 36 months), and
calculate N1 (u0 , t0 ), N2 (u0 , t0 ), N3 (u0 , t0 ), and N4 (u0 , t0 ).
SOLUTION For the mechanical assembly, Examples 11.4 and 11.5 have cal-
culated fT (t) and fM|T (m), respectively. From (11.10) the joint pdf can be
written as
m1.269
fM,T (m, t) = 7.706 × 10−12 × 1.854
t
2.269 1.415
m t
× exp − − .
2337 × t 645.9
The two-dimensional reliability at the warranty mileage and time limits is
obtained from (11.3) as
∞ ∞
R(36,000, 36) = fM,T (m, t) dm dt.
36 36,000
Numerical integration of the equation above gives R(36,000, 36) = 0.9809. From
Example 11.4 the total number of vehicles is 146,645. Then the number of vehi-
cles surviving 36 months and 36,000 miles is given by (11.32) as N4 (36,000, 36)
= 0.9809 × 146,645 = 143,850.
Similarly, the probability of failure in region I is FM,T (36,000, 36) = 0.00792.
Then the number of first failures under warranty coverage is N1 (36,000, 36) =
0.00792 × 146,645 = 1162.
The probability of failure in region II is estimated as Pr(M > 36,000,
T ≤ 36) = 0.00884. Then the number of first failures occurring in 36 months
and beyond 36,000 miles is N2 (36,000, 36) = 0.00884 × 146,645 = 1297.
The probability of failure in region III is estimated as Pr(M ≤ 36,000,
T > 36) = 0.00342. This gives the number of first failures occurring beyond 36
months and within 36,000 miles as N3 (36,000, 36) = 0.00342 × 146,645 = 502.
From (11.8) the estimate of the marginal reliability of T at 36 months is
RT (36) = 1 − FT (36) = 1 − 0.00792 − 0.00884 = 0.9832.
It is approximately equal to that in Example 11.4.
second and subsequent failures in the warranty period are negligible. This situa-
tion arises when a product in question is highly reliable or the warranty period is
relatively short compared to the product mean life. In practice, however, a prod-
uct may generate multiple failures from the same problem within the warranty
period, and thus the approximation underestimates the true warranty repairs. In
this section we consider the good-as-new repair, same-as-old repair and the gen-
eralized renewal process, and present warranty repair models allowing for the
possibility of multiple failures within the warranty period.
where F (t) and f (t) are the cdf and pdf of the product, respectively. The cdf
and pdf can be estimated from accelerated life test or warranty data. The estima-
tion from warranty data was presented in Section 11.4. Equation (11.33) is the
renewal function and was discussed in Chapter 10; it can be solved using the
recursive algorithm described by (10.20).
W (t0 ) ≈ F (t0 ).
Example 11.8 For the washing machines studied in Example 11.2, calculate the
expected number of repairs during a warranty period of 12 months for a volume
of 22,167 units sold.
SOLUTION Example 11.2 showed that the life of the washing machines can
be adequately fitted with a lognormal distribution with scale parameter 6.03 and
shape parameter 1.63, and the probability of failure at the end of warranty period
(12 months) is F̂ (12) = 0.0148. From (11.34), the expected number of repairs
within the warranty period is W (12) = ln[1/(1 − 0.0148)] = 0.0149. Note that
the value of W (12) is approximately equal to F̂ (12) = 0.0148, which resulted
from the reliability analysis. The expected number of repairs for a volume of
22,167 units is 0.0149 × 22,167 = 331.
the good-as-new and same-as-old cases. Kijima and Sumita (1986) and Kijima
(1989) propose a generalized renewal process which treats these repair strategies
as special cases. In this subsection we describe briefly the generalized renewal
process.
Let Vi and Si denote, respectively, the virtual age and real age of a product
immediately after the ith repair. Here the real age is the elapsed time since a
product is put in operation and the virtual age is a fraction of the real age and
reflects the condition of a product after a repair. The relationship between the
virtual age and real age can be expressed as
Vi = qSi , (11.35)
where q is the restoration factor of the ith repair and measures the effectiveness of
the repair. If q = 0, the virtual age right after the ith repair is zero, meaning that
the product is restored to the new condition. Thus, q = 0 corresponds to a good-
as-new repair. If q = 1, the virtual age immediately after the ith repair is equal to
the real age, indicating that the product is restored to the same condition as right
before failure. Thus, q = 1 represents a same-as-old repair. If 0 < q < 1, the
virtual age is between zero and the real age, and thus the repair is a better-than-
old-but-worse-than-new repair. In addition, if q > 1, the virtual age is greater
than the real age. In this case, the product is damaged by the repair to a higher
degree than it was right before the respective failure. Such a repair is often called
a worse-than-old repair.
By using the generalized renewal process, the expected number of repairs
W (t0 ) within the warranty period t0 can be written as
t0 τ
W (t0 ) = g(τ |0) + w(x)g(τ − x|x) dx dτ, (11.36)
0 0
where
f (t + qx) dW (x)
g(t|x) = , t, x ≥ 0; w(x) = ;
1 − F (qx) dx
and f (·) and F (·) are the pdf and cdf of the time to first failure distribution.
Note that g(t|0) = f (t). Equation (11.36) contains distribution parameters and
q, which must be estimated in order to evaluate W (t0 ). Kaminskiy and Krivtsov
(1998) provide a nonlinear least squares technique for estimating the parameters
and a Monte Carlo simulation method for calculating (11.36). Yanez et al. (2002)
and Mettas and Zhao (2005) give the maximum likelihood estimates. Kaminskiy
and Krivtsov (2000) present an application of the generalized renewal process to
warranty repair prediction.
Example 11.9 For the washing machines studied in Examples 11.2 and 11.7,
calculate the expected warranty cost for a volume sold of 22,167 units. The
average cost per repair is $155.
where t is the life of the product and t0 is the warranty period. Cw (t) is a function
of life and thus is a random variable. Since Cw (t) = 0 when t > t0 , the expected
cost is
∞
µ(t0 )
E[Cw (t)] = Cw (t) dF (t) = cp F (t0 ) − , (11.39)
0 t0
WARRANTY COST ESTIMATION 475
where F(t) is the cdf of the product life and µ(t0 ) is the partial expectation of
life over the warranty period, given by
t0
µ(t0 ) = tdF (t). (11.40)
0
That is, the manufacturer will pay an expected warranty cost of $31.89 for each
unit it sells. Note that the cost is an approximation because of the variation in
customer usage rate.
cp
Cw(t)
0 t1 t0
t
FIGURE 11.12 Warranty cost of a unit under a combination policy
function of age once the pro-rata policy is invoked. Then the warranty cost per
unit to the manufacturer can be written as
cp , 0 ≤ t ≤ t1 ,
cp (t0 − t)
Cw (t) = , t1 < t ≤ t0 , (11.41)
t0 − t1
0, t > t0 ,
where t is the life of the product. Cw (t) is shown graphically in Figure 11.12.
Cw (t) is a function of life and thus is a random variable. The expectation of
the cost is given by
∞ t1 t0
cp (t0 − t)
E[Cw (t)] = Cw (t) dF (t) = cp dF (t) + dF (t)
0 0 t1 t0 − t1
cp
= [t0 F (t0 ) − t1 F (t1 ) − µ(t0 ) + µ(t1 )], (11.42)
t0 − t1
where F(t) is the cdf of the product life and µ(t0 ) and µ(t1 ) are the partial
expectations of life over t0 and t1 as defined in (11.40).
Example 11.11 A type of passenger car battery is sold at $125 a unit with a
combination warranty policy, which offers free replacement in the first 18 months
after initial purchase, followed by a linear proration for additional 65 months.
The life of the battery is modeled by a Weibull distribution with shape parameter
1.71 and characteristic life 235 months. Estimate the expected warranty cost per
unit to the manufacturer of the battery.
The occurrence of failures at the first time period can be considered to have a
binomial distribution. Thus, a p-chart is appropriate for control of the probability.
478 WARRANTY ANALYSIS
UCL
center line
pˆi
LCL
0
Production period, i
FIGURE 11.13 p-chart
where UCL stands for upper control limit and LCL for lower control limit.
Figure 11.13 illustrates the concept of the control chart. It is worth noting that
the control limits are variable and depend on the volume of each production
period. The control limits become constant and form two straight lines when the
production volumes are equal.
The actual operation of the control chart consists of computing the probability
of failure p̂i from (11.43) and the corresponding control limits from (11.45) for
subsequent production periods, and plotting p̂i and the control limits on the chart.
As long as p̂i remains within the control limits and the sequence of the plotted
points does not display any systematic nonrandom behavior, we can conclude
that the infant mortality does not change significantly and the production process
is under control. If p̂i stays outside the control limits, or if the plotted points
develop a nonrandom trend, we can conclude that infant mortality has drifted
significantly and the process is out of control. In the latter case, investigation
should be initiated to determine the assignable causes.
production month are shown in Table 11.7. Develop a p-chart for controlling the
probability of failure. In production month 11, 10,325 heaters were manufactured
and 58 units failed in the first month. Determine whether the production process
was under control in month 11.
SOLUTION From (11.44), the average probability of failure in the first month
in service over 10 months of production is
36 + 32 + · · · + 96
p= = 0.00355.
9636 + 9903 + · · · + 18,342
Thus, the centerline of the p-chart is 0.00355.
The control limits are calculated from (11.45). For example, the LCL and
UCL for production month 1 are
LCL = 0.00355 − 3 0.00355 × (1 − 0.00355)/9636 = 0.00173,
UCL = 0.00355 + 3 0.00355 × (1 − 0.00355)/9636 = 0.00537.
The control limits for the 10 production months are calculated in a similar way
and summarized in the “With Month 10 Data” columns of Table 11.7.
The probability of failure in the first month in service for each production
month is computed using (11.43). For example, for production month 1, we have
p̂1 = 36/9636 = 0.00374. The estimates of the probability of failure for the 10
production months are shown in Table 11.7.
The control limits, centerline, and p̂i (i = 1, 2, . . ., 10) are plotted in
Figure 11.14. It is seen that p̂10 exceeds the corresponding UCL, indicating that
the process was out of control in month 10. As such, we should exclude the
month 10 data and revise the control chart accordingly. The new centerline is p =
0.00337. The control limits are recalculated and shown in the “Without Month 10
Data” columns of Table 11.7. The control chart is plotted in Figure 11.15. On this
chart, the control limits for month 10 are LCL = 0.00209 and UCL = 0.00466.
480 WARRANTY ANALYSIS
0.0060
0.0055
0.0050
0.0045
0.0040
pi
0.0035
0.0030
0.0025
0.0020
0.0015
0.0010
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15
Production month
FIGURE 11.14 Control chart including month 10 data
0.0060
0.0055
0.0050
0.0045
0.0040
pˆi
0.0035
0.0030
0.0025
0.0020
0.0015
0.0010
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15
Production month
FIGURE 11.15 Revised control chart for p̂i
PROBLEMS
11.1 Describe the purposes of warranty analysis. How can warranty analysis
help reduce the life cycle cost of a product?
11.2 What are the elements of a warranty policy? Explain the following war-
ranty policies:
(a) Free replacement.
(b) Pro-rata replacement.
(c) Combination free and pro-rata replacement.
(d) Renewing.
(e) Nonrenewing.
(d) Two-dimensional.
11.3 For a warranty repair, what types of data should be recorded in a warranty
database? Discuss the use of the data and describe the steps for warranty
data mining.
11.4 Explain the limitations of warranty data and how they affect the estimation
of product reliability and warranty cost.
11.5 Refer to Example 11.2. If the washing machine warranty data were ana-
lyzed two months earlier, what would be the reliability estimate and
number of upcoming failures? Compare the results with those in the
example.
PROBLEMS 483
1 1 836
2 1 2 2063
3 2 0 1 2328
4 1 2 2 1 2677
5 2 1 2 1 2 3367
6 1 2 3 2 2 1 3541
7 2 2 3 3 2 2 3 3936
8 1 2 1 2 1 3 2 2 3693
9 0 2 2 1 0 2 3 1 2 2838
10 2 3 2 1 2 1 2 1 2 2 2362
11 1 1 2 1 0 1 2 1 2 1 1 2056
12 1 1 0 2 1 1 0 1 0 1 1 1 1876
11.6 Refer to Example 11.4. Suppose that the warranty dropout due to the
accumulated mileage exceeding 36,000 miles is negligible in the first 11
months. Estimate the life distribution of the mechanical assembly, and
calculate the reliability at the end of the warranty period. Compare the
results with those in the example.
11.7 An electronic module installed in a luxury car is warranted for 48 months
and 48,000 miles, whichever comes first. The mileage accumulation rate of
the car can be modeled with a lognormal distribution with scale parameter
7.37 and shape parameter 1.13. The vehicles have a maximum time in
service (also called the maturity) of 12 months, during which the repeat
repairs are negligible. The sales volumes and failure data are shown in
Table 11.8. Suppose that the mileage to failure distribution each month is
the same as the usage distribution that month.
(a) Calculate the fractions of warranty dropout at the end of 12 and 48
months, respectively.
(b) Determine the month at which the fraction of warranty dropout is
10%.
(c) What would be the warranty mileage limit if the manufacturer wanted
50% of the vehicles to be out of warranty coverage at the end of 48
months?
(d) Calculate the hazard rate estimates ĥj for j = 1, 2, . . . , 12.
(e) Compute the cumulative hazard rate estimates Ĥj for j = 1, 2, . . . , 12.
(f) Estimate the marginal life distribution of the electronic module.
(g) Estimate the reliability and number of first failures by the end of 48
months.
(h) Write down the joint pdf of the time and mileage to failure.
484 WARRANTY ANALYSIS
Volume 6963 7316 7216 7753 8342 8515 8047 8623 8806 8628 8236 7837
Number of failures 15 18 12 17 13 15 19 26 16 14 21 14
data of the first month in service to detect unusual infant mortality. The
production volume and number of failures in the first month in service for
each monthly production period are shown in Table 11.9.
(a) Develop a p-chart.
(b) Make comments on the control limits.
(c) Can the variable control limits be approximated by two straight lin-
es? How?
(d) In production month 13, 7638 units were made and 13 of them failed
in the first month in service. Determine whether the process was under
control in that month.
APPENDIX
∗
The material in this appendix is reproduced with permission from Dr. Genichi Taguchi with assis-
tance from the American Supplier Institute, Inc. More orthogonal arrays, linear graphs, and interaction
tables may be found in Taguchi et al. (1987, 2005).
486
ORTHOGONAL ARRAYS, LINEAR GRAPHS, AND INTERACTION TABLES 487
488 ORTHOGONAL ARRAYS, LINEAR GRAPHS, AND INTERACTION TABLES
ORTHOGONAL ARRAYS, LINEAR GRAPHS, AND INTERACTION TABLES 489
490 ORTHOGONAL ARRAYS, LINEAR GRAPHS, AND INTERACTION TABLES
′
ORTHOGONAL ARRAYS, LINEAR GRAPHS, AND INTERACTION TABLES 491
492 ORTHOGONAL ARRAYS, LINEAR GRAPHS, AND INTERACTION TABLES
ORTHOGONAL ARRAYS, LINEAR GRAPHS, AND INTERACTION TABLES 493
494 ORTHOGONAL ARRAYS, LINEAR GRAPHS, AND INTERACTION TABLES
REFERENCES
Adams, V., and Askenazi, A. (1998), Building Better Products with Finite Element Anal-
ysis, OnWord Press, Santa Fe, NM.
Aggarwal, K. K. (1993), Reliability Engineering, Kluwer Academic, Norwell, MA.
AGREE (1957), Reliability of Military Electronic Equipment, Office of the Assistant Secre-
tary of Defense Research and Engineering, Advisory Group of Reliability of Electronic
Equipment, Washington, DC.
Ahmad, M., and Sheikh, A. K. (1984), Bernstein reliability model: derivation and esti-
mation of parameters, Reliability Engineering, vol. 8, no. 3, pp. 131–148.
Akao, Y. (1990), Quality Function Deployment, Productivity Press, Cambridge, MA.
Allmen, C. R., and Lu, M. W. (1994), A reduced sampling approach for reliability veri-
fication, Quality and Reliability Engineering International, vol. 10, no. 1, pp. 71–77.
Al-Shareef, H., and Dimos, D. (1996), Accelerated life-time testing and resistance degra-
dation of thin-film decoupling capacitors, Proc. IEEE International Symposium on
Applications of Ferroelectrics, vol. 1, pp. 421–425.
Amagai, M. (1999), Chip scale package (CSP) solder joint reliability and modeling, Micro-
electronics Reliability, vol. 39, no. 4, pp. 463–477.
Ames, A. E., Mattucci, N., MacDonald, S., Szonyi, G., and Hawkins, D. M. (1997),
Quality loss functions for optimization across multiple response surfaces, Journal of
Quality Technology, vol. 29, no. 3, pp. 339–346.
ANSI/ASQ (2003a), Sampling Procedures and Tables for Inspection by Attributes,
ANSI/ASQ Z1.4–2003, American Society for Quality, Milwaukee, WI, www.asq.org.
495
496 REFERENCES
— — —(2003b), Sampling Procedures and Tables for Inspection by Variables for Percent
Nonconforming, ANSI/ASQ Z1.9–2003, American Society for Quality, Milwaukee,
WI, www.asq.org.
Armacost, R. L., Componation, P. J., and Swart, W. W. (1994), An AHP framework for
prioritizing customer requirement in QFD: an industrialized housing application, IIE
Transactions, vol. 26, no. 4, pp. 72–78.
Bai, D. S., and Yun, H. J. (1996), Accelerated life tests for products of unequal size, IEEE
Transactions on Reliability, vol. 45, no. 4, pp. 611–618.
Bain, L. J., and Engelhardt, M. (1991), Statistical Analysis of Reliability and Life-Testing
Models: Theory and Methods, (2nd ed.), Marcel Dekker, New York.
Barlow, R. E., and Proschan, F. (1974), Importance of System Components and Fault Tree
Analysis, ORC-74-3, Operations Research Center, University of California,
Berkeley, CA.
Basaran, C., Tang, H., and Nie, S. (2004), Experimental damage mechanics of micro-
electronic solder joints under fatigue loading, Mechanics of Materials, vol. 36, no. 11,
pp. 1111–1121.
Baxter, L., Scheuer, E., McConaloguo, D., and Blischke, W. (1982), On the tabulation of
the renewal function, Technometrics, vol. 24, no. 2, pp. 151–156.
Bazaraa, M. S., Sherali, H. D., and Shetty, C. M. (1993), Nonlinear Programming: Theory
and Algorithms, (2nd ed.), Wiley, Hoboken, NJ.
Becker, B., and Ruth, P. P. (1998), Highly accelerated life testing for the 1210 dig-
ital ruggedized display, Proc. SPIE, International Society for Optical Engineering,
pp. 337–345.
Bertsekas, D. P. (1996), Constrained Optimization and Lagrange Multiplier Methods,
Academic Press, San Diego, CA.
Bhushan, B. (2002), Introduction to Tribology, Wiley, Hoboken, NJ.
Birnbaum, Z. W. (1969), On the importance of different components in a multicomponent
system, in Multivariate Analysis—II, P. R. Krishnaiah, Ed., Academic Press, New
York, pp. 581–592.
Black, J. R. (1969), Electromigration: a brief survey and some recent results, IEEE Trans-
actions on Electronic Devices, vol. ED-16, no. 4, pp. 338–347.
Blischke, W. R., and Murthy, D. N. P. (1994), Warranty Cost Analysis, Marcel Dekker,
New York.
— — — Eds. (1996), Product Warranty Handbook, Marcel Dekker, New York.
— — —(2000), Reliability: Modeling, Prediction, and Optimization, Wiley, Hoboken, NJ.
Boland, P. J., and El-Neweihi, E. (1995), Measures of component importance in reliability
theory, Computers and Operations Research, vol. 22, no. 4, pp. 455–463.
Bossert, J. L. (1991), Quality Function Deployment, ASQC Quality Press, Milwaukee, WI.
Bouissou, M. (1996), An ordering heuristic for building a decision diagram from a fault-
tree, Proc. Reliability and Maintainability Symposium, pp. 208–214.
Boulanger, M., and Escobar, L. A. (1994), Experimental design for a class of accelerated
degradation tests, Technometrics, vol. 36, no. 3, pp. 260–272.
Bowles, J. B. (2003), An assessment of RPN prioritization in failure modes effects and
criticality analysis, Proc. Reliability and Maintainability Symposium, pp. 380–386.
REFERENCES 497
Bowles, J. B., and Pelaez, C. E. (1995), Fuzzy logic prioritization of failures in a system
failure mode, effects and criticality analysis, Reliability Engineering and System Safety,
vol. 50, no. 2, pp. 203–213.
Box, G. (1988), Signal-to-noise ratios, performance criteria, and transformation, Techno-
metrics, vol. 30, no. 1, pp. 1–17.
Brooks, A. S. (1974), The Weibull distribution: effect of length and conductor size of test
cables, Electra, vol. 33, pp. 49–61.
Broussely, M., Herreyre, S., Biensan, P., Kasztejna, P., Nechev, K., and Staniewicz, R.
J. (2001), Aging mechanism in Li ion cells and calendar life predictions, Journal of
Power Sources, vol. 97-98, pp. 13–21.
Bruce, G., and Launsby, R. G. (2003), Design for Six Sigma, McGraw-Hill, New York.
Carot, V., and Sanz, J. (2000), Criticality and sensitivity analysis of the components of a
system, Reliability Engineering and System Safety, vol. 68, no. 2, pp. 147–152.
Chan, H. A., and Englert, P. J., Eds. (2001), Accelerated Stress Testing Handbook: Guide
for Achieving Quality Products, IEEE Press, Piscataway, NJ.
Chao, A., and Hwang, L. C. (1987), A modified Monte Carlo technique for confidence
limits of system reliability using pass–fail data, IEEE Transactions on Reliability, vol.
R-36, no. 1, pp. 109–112.
Chao, L. P., and Ishii, K. (2003), Design process error-proofing: failure modes and effects
analysis of the design process, Proc. ASME Design Engineering Technical Conference,
vol. 3, pp. 127–136.
Chen, M. R. (2001), Robust design for VLSI process and device, Proc. 6th International
Workshop on Statistical Metrology, pp. 7–16.
Chi, D. H., and Kuo, W. (1989), Burn-in optimization under reliability and capacity
restrictions, IEEE Transactions on Reliability, vol. 38, no. 2, pp. 193–198.
Chiao, C. H., and Hamada, M. (2001), Analyzing experiments with degradation data
for improving reliability and for achieving robust reliability, Quality and Reliability
Engineering International, vol. 17, no. 5, pp. 333–344.
Coffin, L. F., Jr. (1954), A study of the effects of cyclic thermal stresses on a ductile
metal, Transactions of ASME, vol. 76, no. 6, pp. 931–950.
Coit, D. W. (1997), System-reliability confidence-intervals for complex-systems with esti-
mated component-reliability, IEEE Transactions on Reliability, vol. 46, no. 4,
pp. 487–493.
Condra, L. W. (2001), Reliability Improvement with Design of Experiments, (2nd ed.),
Marcel Dekker, New York.
Cory, A. R. (2000), Improved reliability prediction through reduced-stress temperature
cycling, Proc. 38th IEEE International Reliability Physics Symposium, pp. 231–236.
Corzo, O., and Gomez, E. R. (2004), Optimization of osmotic dehydration of cantaloupe
using desired function methodology, Journal of Food Engineering, vol. 64, no. 2, pp.
213–219.
Cox, D. R. (1962), Renewal Theory, Wiley, Hoboken, NJ.
— — —(1972), Regression models and life tables (with discussion), Journal of the Royal
Statistical Society, ser. B, vol. 34, pp. 187–220.
Creveling, C. M. (1997), Tolerance Design: A Handbook for Developing Optimal Speci-
fications, Addison-Wesley, Reading, MA.
498 REFERENCES
Croes, K., De Ceuninck, W., De Schepper, L., and Tielemans, L. (1998), Bimodal failure
behaviour of metal film resistors, Quality and Reliability Engineering International,
vol. 14, no. 2, pp. 87–90.
Crowder, M. J., Kimber, A. C., Smith, R. L., and Sweeting, T. J. (1991), Statistical
Analysis of Reliability Data, Chapman & Hall, London.
Dabbas, R. M., Fowler, J. W., Rollier, D. A., and McCarville, D. (2003), Multiple response
optimization using mixture-designed experiments and desirability functions in semi-
conductor scheduling, International Journal of Production Research, vol. 41, no. 5, pp.
939–961.
Davis, T. P. (1999), A simple method for estimating the joint failure time and failure
mileage distribution from automobile warranty data, Ford Technical Journal, vol. 2,
no. 6, Report FTJ-1999-0048.
Deely, J. J., and Keats, J. B. (1994), Bayes stopping rules for reliability testing with the
exponential distribution, IEEE Transactions on Reliability, vol. 43, no. 2, pp. 288–293.
Del Casttillo, E., Montgomery, D. C., and McCarville, D. R. (1996), Modified desirability
functions for multiple response optimization, Journal of Quality Technology, vol. 28,
no. 3, pp. 337–345.
Derringer, G., and Suich, R. (1980), Simultaneous optimization of several response vari-
ables, Journal of Quality Technology, vol. 12, no. 4, pp. 214–219.
Dhillon, B. S. (1999), Design Reliability: Application and Fundamentals, CRC Press,
Boca Raton, FL.
Dieter, G. E. (2000), Engineering Design: A Materials and Processing Approach, McGraw-
Hill, New York.
Dimaria, D. J., and Stathis, J. H. (1999), Non-Arrhenius temperature dependence of reli-
ability in ultrathin silicon dioxide films, Applied Physics Letters, vol. 74, no. 12, pp.
1752–1754.
Dugan, J. B. (2003), Fault-tree analysis of computer-based systems, tutorial at the Relia-
bility and Maintainability Symposium.
Eliashberg, J., Singpurwalla, N. D., and Wilson, S. P. (1997), Calculating the reserve for
a time and usage indexed warranty, Management Science, vol. 43, no. 7, pp. 966–975.
Elsayed, E. A. (1996), Reliability Engineering, Addison Wesley Longman, Reading, MA.
Ersland, P., Jen, H. R., and Yang, X. (2004), Lifetime acceleration model for HAST tests
of a pHEMT process, Microelectronics Reliability, vol. 44, no. 7, pp. 1039–1045.
Evans, R. A. (2000), Editorial: Populations and hazard rates, IEEE Transactions on Reli-
ability, vol. 49, no. 3, p. 250 (first published in May 1971).
Farnum, N. R., and Booth, P. (1997), Uniqueness of maximum likelihood estimators of
the 2-parameter Weibull distribution, IEEE Transactions on Reliability, vol. 46, no. 4,
pp. 523–525.
Feilat, E. A., Grzybowski, S., and Knight, P. (2000), Accelerated aging of high volt-
age encapsulated transformers for electronics applications, Proc. IEEE International
Conference on Properties and Applications of Dielectric Materials, vol. 1, pp. 209–212.
Fleetwood, D. M., Meisenheimer, T. L., and Scofield, J. H. (1994), 1/f noise and radiation
effects in MOS devices, IEEE Transactions on Electron Devices, vol. 41, no. 11, pp.
1953–1964.
REFERENCES 499
Franceschini, F., and Galetto, M. (2001), A new approach for evaluation of risk priorities
of failure modes in FMEA, International Journal of Production Research, vol. 39, no.
13, pp. 2991–3002.
Fussell, J. B. (1975), How to hand-calculate system reliability and safety characteristics,
IEEE Transactions on Reliability, vol. R-24, no. 3, pp. 169–174.
Garg, A., and Kalagnanam, J. (1998), Approximation for the renewal function, IEEE
Transactions on Reliability, vol. 47, no. 1, pp. 66–72.
Gertsbakh, I. B. (1982), Confidence limits for highly reliable coherent systems with expo-
nentially distributed component life, Journal of the American Statistical Association,
vol. 77, no. 379, pp. 673–678.
— — —(1989), Statistical Reliability Theory, Marcel Dekker, New York.
Ghaffarian, R. (2000), Accelerated thermal cycling and failure mechanisms for BGA and
CSP assemblies, Transactions of ASME: Journal of Electronic Packaging, vol. 122,
no. 4, pp. 335–340.
Gillen, K. T., Bernstein, R., and Derzon, D. K. (2005), Evidence of non-Arrhenius behav-
ior from laboratory aging and 24-year field aging of polychloroprene rubber materials,
Polymer Degradation and Stability, vol. 87, no. 1, pp. 57–67.
Gitlow, H. S., and Levine, D. M. (2005), Six Sigma for Green Belts and Champions:
Foundations, DMAIC, Tools, Cases, and Certification, Pearson/Prentice Hall, Upper
Saddle River, NJ.
Gnedenko, B., Pavlov, I., and Ushakov, I. (1999), in Statistical Reliability Engineering,
Chakravarty, S., Ed., Wiley, Hoboken, NJ.
Goddard, P. L. (1993), Validating the safety of real time control systems using FMEA,
Proc. Reliability and Maintainability Symposium, pp. 227–230.
— — —(2000), Software FMEA techniques, Proc. Reliability and Maintainability Sym-
posium, pp. 118–123.
Guida, M., and Pulcini, G. (2002), Automotive reliability inference based on past data
and technical knowledge, Reliability Engineering and System Safety, vol. 76, no. 2, pp.
129–137.
Hallberg, O., and Peck, D. S. (1991), Recent humidity acceleration, a base for testing
standards, Quality and Reliability Engineering International, vol. 7, no. 3, pp. 169–180.
Han, J., and Kamber, M. (2000), Data Mining: Concepts and Techniques, Morgan Kauf-
mann, San Francisco, CA.
Harris, T. A. (2001), Rolling Bearing Analysis, (4th ed.), Wiley, Hoboken, NJ.
Harter, H. L., and Moore, A. H. (1976), An evaluation of exponential and Weibull test
plans, IEEE Transactions on Reliability, vol. R-25, no. 2, pp. 100–104.
Hauck, D. J., and Keats, J. B. (1997), Robustness of the exponential sequential probability
ratio test (SPRT) when weibull distributed failures are transformed using a ‘known’
shape parameter, Microelectronics Reliability, vol. 37, no. 12, pp. 1835–1840.
Henderson, T., and Tutt, M. (1997), Screening for early and rapid degradation in GaAs/Al-
GaAs HBTs, Proc. 35th IEEE International Reliability Physics Symposium,
pp. 253–260.
Henley, E. J., and Kumamoto, H. (1992), Probabilistic Risk Assessment: Reliability Engi-
neering, Design, and Analysis, IEEE Press, Piscataway, NJ.
Hines, W. W., Montgomery, D. C., Goldsman, D. M., and Borror, C. M. (2002), Proba-
bility and Statistics in Engineering, (4th ed.), Wiley, Hoboken, NJ.
500 REFERENCES
Hirose, H. (1999), Bias correction for the maximum likelihood estimates in the two-
parameter Weibull distribution, IEEE Transactions on Dielectrics and Electrical Insu-
lation, vol. 6, no. 1, pp. 66–68.
Hobbs, G. K. (2000), Accelerated Reliability Engineering: HALT and HASS, Wiley, Chich-
ester, West Sussex, England.
Hwang, F. K. (2001), A new index of component importance, Operations Research Letters,
vol. 28, no. 2, pp. 75–79.
IEC (1985), Analysis Techniques for System Reliability: Procedure for Failure Mode and
Effects Analysis (FMEA), IEC 60812, International Electromechanical Commission,
Geneva, www.iec.ch.
— — —(1998, 2000), Functional Safety of Electrical/Electronic/Programmable Electronic
Safety-Related Systems, IEC 61508, International Electromechanical Commission,
Geneva, www.iec.ch.
IEEE Reliability Society (2006), http://www.ieee.org/portal/site/relsoc.
Ireson, W. G., Coombs, C. F., and Moss, R. Y. (1996), Handbook of Reliability Engi-
neering and Management, McGraw-Hill, New York.
Jeang, A. (1995), Economic tolerance design for quality, Quality and Reliability Engi-
neering International, vol. 11, no. 2, pp. 113–121.
Jensen, F. (1995), Electronic Component Reliability, Wiley, Chichester, West Sussex,
England.
Jensen, F., and Petersen, N. E. (1982), Burn-in: An Engineering Approach to the Design
and Analysis of Burn-in Procedures, Wiley, Chichester, West Sussex, England.
Jiang, G., Purnell, K., Mobley, P., and Shulman, J. (2003), Accelerated life tests and
in-vivo test of 3Y-TZP ceramics, Proc. Materials and Processes for Medical Devices
Conference, ASM International, pp. 477–482.
Johnson, R. A. (1998), Applied Multivariate Statistical Analysis, Prentice Hall, Upper
Saddle River, NJ.
Joseph, V. R., and Wu, C. F. J. (2004), Failure amplification method: an information
maximization approach to categorical response optimization (with discussion), Tech-
nometrics, vol. 46, no. 1, pp. 1–12.
Jung, M., and Bai, D. S. (2006), Analysis of field data under two-dimensional warranty,
Reliability Engineering and System Safety, to appear.
Kalbfleisch, J. D., Lawless, J. F., and Robinson, J. A. (1991), Method for the analysis
and prediction of warranty claims, Technometrics, vol. 33, no. 3, pp. 273–285.
Kalkanis, G., and Rosso, E. (1989), Inverse power law for the lifetime of a Mylar–polyure-
thane laminated dc hv insulating structure, Nuclear Instruments and Methods in Physics
Research, Series A: Accelerators, Spectrometers, Detectors and Associated Equipment,
vol. 281, no. 3, pp. 489–496.
Kaminskiy, M. P., and Krivtsov, V. V. (1998), A Monte Carlo approach to repairable
system reliability analysis, Proc. Probabilistic Safety Assessment and Management,
International Association for PSAM, pp. 1063–1068.
— — —(2000), G-renewal process as a model for statistical warranty claim prediction,
Proc. Reliability and Maintainability Symposium, pp. 276–280.
Kaplan, E. L., and Meier, P. (1958), Nonparametric estimation from incomplete observa-
tions, Journal of the American Statistical Association, vol. 54, pp. 457–481.
REFERENCES 501
Lee, C. L. (2000), Tolerance design for products with correlated characteristics, Mecha-
nism and Machine Theory, vol. 35, no. 12, pp. 1675–1687.
Levin, M. A., and Kalal T. T. (2003), Improving Product Reliability: Strategies and Imple-
mentation, Wiley, Hoboken, NJ.
Lewis, E. E. (1987), Introduction to Reliability Engineering, Wiley, Hoboken, NJ.
Li, Q., and Kececioglu, D. B. (2003), Optimal design of accelerated degradation tests,
International Journal of Materials and Product Technology, vol. 20, no. 1–3,
pp. 73–90.
Li, R. S. (2004), Failure Mechanisms of Ball Grid Array Packages Under Vibration and
Thermal Loading, SAE Technical Paper Series 2004-01-1686, Society of Automotive
Engineers, Warrendale, PA.
Lomnicki, Z. (1996), A note on the Weibull renewal process, Biometrics, vol. 53, no.
3–4, pp. 375–381.
Lu, J. C., Park, J., and Yang, Q. (1997), Statistical inference of a time-to-failure distribution
derived from linear degradation data, Technometrics, vol. 39, no. 4, pp. 391–400.
Lu, M. W. (1998), Automotive reliability prediction based on early field failure warranty
data, Quality and Reliability Engineering International, vol. 14, no. 2, pp. 103–108.
Lu, M. W., and Rudy, R. J. (2000), Reliability test target development, Proc. Reliability
and Maintainability Symposium, pp. 77–81.
Manian, R., Coppit, D. W., Sullivan, K. J., and Dugan, J. B. (1999), Bridging the gap
between systems and dynamic fault tree models, Proc. Reliability and Maintainability
Symposium, pp. 105–111.
Mann, N. R. (1974), Approximately optimum confidence bounds on series and parallel
system reliability for systems with binomial subsystem data, IEEE Transactions on
Reliability, vol. R-23, no. 5, pp. 295–304.
Manson, S. S. (1966), Thermal Stress and Low Cycle Fatigue, McGraw-Hill, New York.
Marseguerra, M., Zio, E., and Cipollone, M. (2003), Designing optimal degradation tests
via multi-objective generic algorithms, Reliability Engineering and System Safety, vol.
79, no. 1, pp. 87–94.
Martz, H. F., and Waller, R. A. (1982), Bayesian Reliability Analysis, Wiley, Hoboken, NJ.
Meeker, W. Q. (1984), A comparison of accelerated life test plans for Weibull and log-
normal distributions and Type I censoring, Technometrics, vol. 26, no. 2, pp. 157–171.
Meeker, W. Q., and Escobar, L. A. (1995), Planning accelerated life tests with two or
more experimental factors, Technometrics, vol. 37, no. 4, pp. 411–427.
— — —(1998), Statistical Methods for Reliability Data, Wiley, Hoboken, NJ.
Meeker, W. Q., and Hahn, G. J. (1985), How to Plan an Accelerated Life Test: Some
Practical Guidelines, volume of ASQC Basic References in Quality Control: Statistical
Techniques, American Society for Quality, Milwaukee, WI, www.asq.org.
Meeker, W. Q., and Nelson, W. B. (1975), Optimum accelerated life-tests for the Weibull
and extreme value distributions, IEEE Transactions on Reliability, vol. R-24, no. 5,
pp. 321–332.
Meng, F. C. (1996), Comparing the importance of system components by some structural
characteristics, IEEE Transactions on Reliability, vol. 45, no. 1, pp. 59–65.
— — —(2000), Relationships of Fussell–Vesely and Birnbaum importance to structural
importance in coherent systems, Reliability Engineering and System Safety, vol. 67,
no. 1, pp. 55–60.
REFERENCES 503
Menon, R., Tong, L. H., Liu, Z., and Ibrahim, Y. (2002), Robust design of a spin-
dle motor: a case study, Reliability Engineering and System Safety, vol. 75, no. 3,
pp. 313–319.
Meshkat, L., Dugan, J. B., and Andrews J. D. (2000), Analysis of safety systems with on-
demand and dynamic failure modes, Proc. Reliability and Maintainability Symposium,
pp. 14–21.
Mettas, A. (2000), Reliability allocation and optimization for complex systems, Proc.
Reliability and Maintainability Symposium, pp. 216–221.
Mettas, A., and Zhao, W. (2005), Modeling and analysis of repairable systems with general
repair, Proc. Reliability and Maintainability Symposium, pp. 176–182.
Misra, K. B. (1992), Reliability Analysis and Prediction: A Methodology Oriented Treat-
ment, Elsevier, Amsterdam, The Netherlands.
Misra, R. B., and Vyas, B. M. (2003), Cost effective accelerated testing, Proc. Reliability
and Maintainability Symposium, pp. 106–110.
Mogilevsky, B. M., and Shirn, G. (1988), Accelerated life tests of ceramic capacitors,
IEEE Transactions on Components, Hybrids and Manufacturing Technology, vol. 11,
no. 4, pp. 351–357.
Mok, Y. L., and Xie, M. (1996), Planning and optimizing environmental stress screening,
Proc. Reliability and Maintainability Symposium, pp. 191–198.
Montanari, G. C., Pattini, G., and Simoni, L. (1988), Electrical endurance of EPR insulated
cable models, Conference Record of the IEEE International Symposium on Electrical
Insulation, pp. 196–199.
Montgomery, D. C. (2001a), Introduction to Statistical Quality Control, Wiley,
Hoboken, NJ.
— — —(2001b), Design and Analysis of Experiments, 5th ed., Wiley, Hoboken, NJ.
Moore, A. H., Harter, H. L., and Sneed, R. C. (1980), Comparison of Monte Carlo
techniques for obtaining system reliability confidence limits, IEEE Transactions on
Reliability, vol. R-29, no. 4, pp. 178–191.
Murphy, K. E., Carter, C. M., and Brown, S. O. (2002), The exponential distribution: the
good, the bad and the ugly: a practical guide to its implementation, Proc. Reliability
and Maintainability Symposium, pp. 550–555.
Murthy, D. N. P., and Blischke, W. R. (2005), Warranty Management and Product Man-
ufacture, Springer-Verlag, New York.
Myers, R. H., and Montgomery, D. C. (2002), Response Surface Methodology: Process
and Product Optimization Using Designed Experiments, 2nd ed., Wiley, Hoboken, NJ.
Nachlas, J. A. (1986), A general model for age acceleration during thermal cycling,
Quality and Reliability Engineering International, vol. 2, no. 1, pp. 3–6.
Naderman, J., and Rongen, R. T. H. (1999), Thermal resistance degradation of surface
mounted power devices during thermal cycling, Microelectronics Reliability, vol. 39,
no. 1, pp. 123–132.
Nair, V. N. (1992), Taguchi’s parameter design: a panel discussion, Technometrics, vol.
34, no. 2, pp. 127–161.
Nair, V. N., Taam, W., and Ye, K. (2002), Analysis of functional responses from robust
design studies, Journal of Quality Technology, vol. 34, no. 4, pp. 355–371.
Natvig, B. (1979), A suggestion of a new measure of importance of system components,
Stochastic Processes and Their Applications, vol. 9, pp. 319–330.
504 REFERENCES
Nelson, W. B. (1972), Theory and application of hazard plotting for censored failure data,
Technometrics, vol. 14, no. 4, pp. 945–966. (Reprinted in Technometrics, vol. 42, no.
1, pp. 12–25.)
— — —(1982), Applied Life Data Analysis, Wiley, Hoboken, NJ.
— — —(1985), Weibull analysis of reliability data with few or no failures, Journal of
Quality Technology, vol. 17, no. 3, pp. 140–146.
— — —(1990), Accelerated Testing: Statistical Models, Test Plans, and Data Analysis,
Wiley, Hoboken, NJ.
— — —(2003), Recurrent Events Data Analysis for Product Repairs, Diseases Recur-
rences, and Other Applications, ASA and SIAM, Philadelphia, PA, www.siam.org.
— — —(2004), paperback edition of Nelson (1990) with updated descriptions of software,
Wiley, Hoboken, NJ.
— — —(2005), A bibliography of accelerated test plans, IEEE Transactions on Reliability,
vol. 54, no. 2, pp. 194–197, and no. 3, pp. 370–373. Request a searchable Word file
from [email protected].
Nelson, W. B., and Kielpinski, T. J. (1976), Theory for optimum censored accelerated
life tests for normal and lognormal life distributions, Technometrics, vol. 18, no. 1, pp.
105–114.
Nelson, W. B., and Meeker, W. Q. (1978), Theory for optimum accelerated censored life
tests for Weibull and extreme value distributions, Technometrics, vol. 20, no. 2, pp.
171–177.
Neufeldt, V., and Guralnik, D. B., Eds. (1997), Webster’s New World College Dictionary,
3rd ed., Macmillan, New York.
Nielsen, O. A. (1997), An Introduction to Integration and Measure Theory, Wiley, Hobo-
ken, NJ.
Norris, K. C., and Landzberg, A. H. (1969), Reliability of controlled collapse intercon-
nections, IBM Journal of Research and Development, vol. 13, pp. 266–271.
O’Connor, P. D. T. (2001), Test Engineering: A Concise Guide to Cost-Effective Design,
Development and Manufacture, Wiley, Chichester, West Sussex, England.
— — —(2002), Practical Reliability Engineering, 4th ed., Wiley, Chichester, West Sus-
sex, England.
Oraee, H. (2000), Quantative approach to estimate the life expectancy of motor insulation
systems, IEEE Transactions on Dielectrics and Electrical Insulation, vol. 7, no. 6, pp.
790–796.
Ozarin, N. W. (2004), Failure modes and effects analysis during design of computer
software, Proc. Reliability and Maintainability Symposium, pp. 201–206.
Park, J. I., and Yum, B. J. (1997), Optimal design of accelerated degradation tests for
estimating mean lifetime at the use condition, Engineering Optimization, vol. 28, no.
3, pp. 199–230.
— — —(1999), Comparisons of optimal accelerated test plans for estimating quantiles of
lifetime distribution at the use condition, Engineering Optimization, vol. 31, no. 1–3,
pp. 301–328.
Peck, D. S. (1986), Comprehensive model for humidity testing correlation, Proc. 24th
IEEE International Reliability Physics Symposium, pp. 44–50.
Phadke, M. S., and Smith, L. R. (2004), Improving engine control reliability through
software optimization, Proc. Reliability and Maintainability Symposium, pp. 634–640.
REFERENCES 505
Steinberg, D. S. (2000), Vibration Analysis for Electronic Equipment, 3rd ed., Wiley,
Hoboken, NJ.
Strifas, N., Vaughan, C., and Ruzzene, M. (2002), Accelerated reliability: thermal and
mechanical fatigue solder joints methodologies, Proc. 40th IEEE International Relia-
bility Physics Symposium, pp. 144–147.
Strutt, J. E., and Hall, P. L., Eds. (2003), Global Vehicle Reliability: Prediction and
Optimization Techniques, Professional Engineering Publishing, London.
Suh, N. P. (2001), Axiomatic Design: Advances and Applications, Oxford University Press,
New York.
Sumikawa, M., Sato, T., Yoshioka, C., and Nukii, T. (2001), Reliability of soldered
joints in CSPs of various designs and mounting conditions, IEEE Transactions on
Components and Packaging Technologies, vol. 24, no. 2, pp. 293–299.
Taguchi, G. (1986), Introduction to Quality Engineering, Asian Productivity Organization,
Tokyo.
— — —(1987), System of Experimental Design, Unipub/Kraus International, New York.
— — —(2000), Robust Engineering, McGraw-Hill, New York.
Taguchi, G., Konishi, S., Wu, Y., and Taguchi, S. (1987), Orthogonal Arrays and Linear
Graphs, ASI Press, Dearborn, MI.
Taguchi, G., Chowdhury, S., and Wu, Y. (2005), Taguchis Quality Engineering Handbook,
Wiley, Hoboken, NJ.
Tamai, T., Miyagawa, K., and Furukawa, M. (1997), Effect of switching rate on contact
failure from contact resistance of micro relay under environment containing silicone
vapor, Proc. 43rd IEEE Holm Conference on Electrical Contacts, pp. 333–339.
Tang, L. C., and Xu, K. (2005), A multiple objective framework for planning accelerated
life tests, IEEE Transactions on Reliability, vol. 54, no. 1, pp. 58–63.
Tang, L. C., and Yang, G. (2002), Planning multiple levels constant stress accelerated life
tests, Proc. Reliability and Maintainability Symposium, pp. 338–342.
Tang, L. C., Yang, G., and Xie, M. (2004), Planning of step-stress accelerated degradation
test, Proc. Reliability and Maintainability Symposium, pp. 287–292.
Tanner, D. M., Walraven, J. A., Mani S. S., and Swanson, S. E. (2002), Pin-joint design
effect on the reliability of a polysilicon microengine, Proc. 40th IEEE International
Reliability Physics Symposium, pp. 122–129.
Teng, S. Y., and Brillhart, M. (2002), Reliability assessment of a high CBGA for high
availability systems, Proc. IEEE Electronic Components and Technology Conference,
pp. 611–616.
Thoman, D. R., Bain, L. J., and Antle, C. E. (1969), Inferences on the parameters of the
Weibull distribution, Technometrics, vol. 11, no. 3, pp. 445–460.
Tian, X. (2002), Comprehensive review of estimating system-reliability confidence-limits
from component-test data, Proc. Reliability and Maintainability Symposium, pp.
56–60.
Tijms, H. (1994), Stochastic Models: An Algorithmic Approach, Wiley, Hoboken, NJ.
Tseng, S. T., Hamada, M., and Chiao, C. H. (1995), Using degradation data to improve
fluorescent lamp reliability, Journal of Quality Technology, vol. 27, no. 4, pp. 363–369.
Tsui, K. L. (1999), Robust design optimization for multiple characteristic problems, Inter-
national Journal of Production Research, vol. 37, no. 2, pp. 433–445.
508 REFERENCES
Tu, S. Y., Jean, M. D., Wang, J. T., and Wu, C. S. (2006), A robust design in hard-
facing using a plasma transfer arc, International Journal of Advanced Manufacturing
Technology, vol. 27, no. 9–10, pp. 889–896.
U.S. DoD (1984), Procedures for Performing a Failure Mode, Effects and Criticality
Analysis, MIL-STD-1629A, U.S. Department of Defense, Washington, DC.
— — —(1993), Environmental Stress Screening (ESS) of Electronic Equipment, MIL-
HDBK-344A, U.S. Department of Defense, Washington, DC.
— — —(1995), Reliability Prediction of Electronic Equipment, MIL-HDBK-217F, U.S.
Department of Defense, Washington, DC.
— — —(1996), Handbook for Reliability Test Methods, Plans, and Environments for Engi-
neering: Development, Qualification, and Production, MIL-HDBK-781, U.S. Depart-
ment of Defense, Washington, DC.
— — —(1998), Electronic Reliability Design Handbook, MIL-HDBK-338B, U.S. Depart-
ment of Defense, Washington, DC.
— — —(2000), Environmental Engineering Considerations and Laboratory Tests, MIL-
STD-810F, U.S. Department of Defense, Washington, DC.
— — —(2002), Test Method Standard for Electronic and Electrical Component Parts,
MIL-STD-202G, U.S. Department of Defense, Washington, DC.
— — —(2004), Test Method Standard for Microcircuits, MIL-STD-883F, U.S. Depart-
ment of Defense, Washington, DC.
Ushakov, I. E., Ed. (1996), Handbook of Reliability Engineering, Wiley, Hoboken, NJ.
Usher J. M., Roy, U., and Parsaei, H. R. (1998), Integrated Product and Process Devel-
opment, Wiley, Hoboken, NJ.
Vesely, W. E. (1970), A time dependent methodology for fault tree evaluation, Nuclear
Engineering and Design, vol. 13, no. 2, pp. 337–360.
Vesely, W. E., Goldberg, F. F., Roberts, N. H., and Haasl, D. F. (1981), Fault Tree
Handbook, U.S. Nuclear Regulatory Commission, Washington, DC.
Vining, G. G. (1998), A compromise approach to multiresponse optimization, Journal of
Quality Technology, vol. 30, no. 4, pp. 309–313.
Vlahinos, A. (2002), Robust design of a catalytic converter with material and manufactur-
ing variations, SAE Series SAE-2002-01-2888, Presented at the Powertrain and Fluid
Systems Conference and Exhibition, www.sae.org.
Vollertsen, R. P., and Wu, E. Y. (2004), Voltage acceleration and t63.2 of 1.6–10 nm gate
oxides, Microelectronics Reliability, vol. 44, no. 6, pp. 909–916.
Wang, C. J. (1991), Sample size determination of bogey tests without failures, Quality
and Reliability Engineering International, vol. 7, no. 1, pp. 35–38.
Wang, F. K., and Keats, J. B. (2004), Operating characteristic curve for the exponential
Bayes-truncated test, Quality and Reliability Engineering International, vol. 20, no. 4,
pp. 337–342.
Wang, W., and Dragomir-Daescu, D. (2002), Reliability qualification of induction motors:
accelerated degradation testing approach, Proc. Reliability and Maintainability Sympo-
sium, pp. 325–331.
Wang, W., and Jiang, M. (2004), Generalized decomposition method for complex systems,
Proc. Reliability and Maintainability Symposium, pp. 12–17.
Wang, W., and Loman J. (2002), Reliability/availability of k-out-of-n system with M cold
standby units, Proc. Reliability and Maintainability Symposium, pp. 450–455.
REFERENCES 509
Wang, Y., Yam, R. C. M., Zuo, M. J., and Tse, P. (2001), A comprehensive reliability
allocation method for design of CNC lathes, Reliability Engineering and System Safety,
vol. 72, no. 3, pp. 247–252.
Wen, L. C., and Ross, R. G., Jr. (1995), Comparison of LCC solder joint life predictions
with experimental data, Journal of Electronic Packaging, vol. 117, no. 2, pp. 109–115.
Whitesitt, J. E. (1995), Boolean Algebra and Its Applications, Dover Publications, New
York.
Willits, C. J., Dietz, D. C., and Moore, A. H. (1997), Series-system reliability-estimation
using very small binomial samples, IEEE Transactions on Reliability, vol. 46, no. 2,
pp. 296–302.
Wu, C. C., and Tang, G. R. (1998), Tolerance design for products with asymmetric quality
losses, International Journal of Production Research, vol. 36, no. 9, pp. 2529–2541.
Wu, C. F. J., and Hamada, M. (2000), Experiments: Planning, Analysis, and Parameter
Design Optimization, Wiley, Hoboken, NJ.
Wu, C. L., and Su, C. T. (2002), Determination of the optimal burn-in time and cost using
an environmental stress approach: a case study in switch mode rectifier, Reliability
Engineering and System Safety, vol. 76, no. 1, pp. 53–61.
Wu, H., and Meeker, W. Q. (2002), Early detection of reliability problems using infor-
mation from warranty databases, Technometrics, vol. 44, no. 2, pp. 120–133.
Wu, S. J., and Chang, C. T. (2002), Optimal design of degradation tests in presence of
cost constraint, Reliability and System Safety, vol. 76, no. 2, pp. 109–115.
Xie, M. (1989), On the solution of renewal-type integral equation, Communications in
Statistics, vol. B18, no. 1, pp. 281–293.
Yan, L., and English, J. R. (1997), Economic cost modeling of environmental-stress-
screening and burn-in, IEEE Transactions on Reliability, vol. 46, no. 2, pp. 275–282.
Yanez, M., Joglar, F., and Modarres, M. (2002), Generalized renewal process for analy-
sis of repairable systems with limited failure experience, Reliability Engineering and
System Safety, vol. 77, no. 2, pp. 167–180.
Yang, G. (1994), Optimum constant-stress accelerated life-test plans, IEEE Transactions
on Reliability, vol. 43, no. 4, pp. 575–581.
— — —(2002), Environmental-stress-screening using degradation measurements, IEEE
Transactions on Reliability, vol. 51, no. 3, pp. 288–293.
— — —(2005), Accelerated life tests at higher usage rate, IEEE Transactions on Relia-
bility, vol. 54, no. 1, pp. 53–57.
Yang, G., and Jin, L. (1994), Best compromise test plans for Weibull distributions with
different censoring times, Quality and Reliability Engineering International, vol. 10,
no. 5, pp. 411–415.
Yang, G., and Yang, K. (2002), Accelerated degradation-tests with tightened critical values,
IEEE Transactions on Reliability, vol. 51, no. 4, pp. 463–468.
Yang, G., and Zaghati, Z. (2002), Two-dimensional reliability modeling from warranty
data, Proc. Reliability and Maintainability Symposium, pp. 272–278.
— — —(2003), Robust reliability design of diagnostic systems, Proc. Reliability and
Maintainability Symposium, pp. 35–39.
— — —(2004), Reliability and robustness assessment of diagnostic systems from war-
ranty data, Proc. Reliability and Maintainability Symposium, pp. 146–150.
510 REFERENCES
— — —(2006), Accelerated life tests at higher usage rates: a case study, Proc. Reliability
and Maintainability Symposium, pp 313–317.
— — —and Kapadia, J. (2005), A sigmoid process for modeling warranty repairs, Proc.
of Reliability and Maintainability Symposium, pp. 326–330.
Yang, K., and El-Haik, B. (2003), Design for Six Sigma: A Roadmap for Product Devel-
opment, McGraw-Hill, New York.
Yang, K., and Xue, J. (1996), Continuous state reliability analysis, Proc. Reliability and
Maintainability Symposium, pp. 251–257.
Yang, K., and Yang, G. (1998), Robust reliability design using environmental stress testing,
Quality and Reliability Engineering International, vol. 14, no. 6, pp. 409–416.
Yang, S., Kobza, J., and Nachlas, J. (2000), Bivariate failure modeling, Proc. Reliability
and Maintainability Symposium, pp. 281–287.
Yassine, A. M., Nariman, H. E., McBride, M., Uzer, M., and Olasupo K. R. (2000), Time
dependent breakdown of ultrathin gate oxide, IEEE Transactions on Electron Devices,
vol. 47, no. 7, pp. 1416–1420.
Ye, N., Ed. (2003), The Handbook of Data Mining, Lawrence Erlbaum Associates, Mah-
wah, NJ.
Yeo, C., Mhaisalka, S., and Pang, H. (1996), Experimental study of solder joint reliability
in a 256 pin, 0.4 mm pitch PQFP, Journal of Electronic Manufacturing, vol. 6, no. 2,
pp. 67–78.
Young, D., and Christou, A. (1994), Failure mechanism models for electromigration, IEEE
Transactions on Reliability, vol. 43, no. 2, pp. 186–192.
Yu, H. F. (1999), Designing a degradation experiment, Naval Research Logistics, vol. 46,
no. 6, pp. 689–706.
— — —(2003), Designing an accelerated degradation experiment by optimizing the esti-
mation of the percentile, Quality and Reliability Engineering International, vol. 19,
no. 3, pp. 197–214.
Yu, H. F., and Chiao, C. H. (2002), An optimal designed degradation experiment for
reliability improvement, IEEE Transactions on Reliability, vol. 51, no. 4, pp. 427–433.
INDEX
511
512 INDEX
Binary state, 10, 174 Confidence level, 383, 385, 390, 408,
Bogey ratio, 386 409
Bogey testing, 383 Confirmation test, 132, 161
binomial, 383 Connection, 427
Weibull, 385 Constant-stress testing, 241
Boolean algebra, 220 Constant temperature, 246
Bottom-up process, 106, 195, 221 Consumer’s risk, 384, 396
Boundary diagram, 132 Contact resistance, 14, 153, 246–247,
Burn-in, 413 250
Control chart, 478
Calendar time, 11, 446 Control factor, 134, 244
Capacitor, 19, 70, 244, 260 dispersion and mean adjustment
Censoring, 266 factor, 156
random, 267 dispersion factor, 156
Type I (time), 266 insignificant factor, 156
Type II (failure), 267 mean adjustment factor, 156
Center line, 478 Control limit
Characteristic life, 20 lower, 478
Coefficient of variation, 27, 159 upper, 478
Coffin-Manson relationship, 256 Corrective action, 196, 204, 213, 234,
Cold standby, 80 326
Common cause, 218 Correlation coefficient, 338, 342
Complete life data, 267 Corrosion, 249
Complexity, 108 Cost model, 433
Component importance, 99 Creep, 247
Birnbaum’s measure, 100 Critical failure mode, 239, 272
criticality importance, 102 Critical value, 43, 383
Fussell-Vesely’s measure, 104 Cross array, 146
Compromise test plans Cumulative distribution function (cdf),
one accelerating variable 12
lognormal, 310 Cumulative hazard function, 14
Weibull, 304 Current density, 250, 260, 266
two accelerating variables Customer axis, 35
lognormal, 321
Customer desirability, 37
usage rate, 321
Customer expectations, 34, 42, 114
Weibull, 314
basic want, 34
Concept design, 2. See also System
excitement want, 34
design
performance want, 34
Concurrent engineering, 7
Confidence intervals Customer feedback analysis, 55
exact, 279, 288 Customer satisfaction, 34–35, 38, 114
for system reliability, 91 degree of, 43
function of parameters, 278 minimum allowable, 43
percentile, 279, 283, 289 Customer usage, 134, 272, 382
probability of failure, 279, 282, 288 Customer wants, 34–35. See also
lognormal-approximation, 95 Customer expectations
model parameters, 277 Cut set, 89, 217
exponential, 279 Cyclic stress loading, 242
lognormal, 288
normal, 288 Data mining, 448
Weibull, 282 Decision rules, 395
normal-approximation, 91 Defects
one-sided, 92, 96, 277, 409 latent, 14, 55, 412
two-sided, 92, 93, 96, 277, 282, patent, 14, 419, 477
288–289 Degradation, 332
INDEX 513