Black-box testing using flowgraphs
an experimental assessment of
effectiveness and automation
potential
STEPHEN H. EDWARDS*
DEPARTMENT OF COMPUTER SCIENCE, VIRGINIA TECH, 660 MCBRYDE HALL, BLACKSBURG, VA 240610106,
U.S.A.
Prepared for: Prof Manal Ismail
Prepared by: Amany Shousha& Mohammed Redwan Al-Jannan
Agenda
Definitions
What is the Problem
Proposed Solution and assumptions
Component Specification in RESOLVE
FlowGraph
Automatically Generating Test Case
Definitions- 1
Idiom
Definition
Object Oriented
Testing
Inobject-oriented systems,testingencompasses three levels,
namely, unittesting, subsystemtesting Cluster of related
classes, and systemtesting.
Adequacy criteria
set of test obligations
the statement coverage adequacy criterion is satisfied by test suite S
for program P if each executable statement in P is executed by at least
one test case in S, and the outcome of each test execution was pass
Formal Specification
Aformalsoftwarespecificationis aspecificationexpressed in a
language whose vocabulary, syntax and semantics
areformallydefined. This need for aformal definition means that
thespecificationlanguages must be based on mathematical concepts
whose properties are well understood.
Definitions- 2
Idiom
Definition
Programming by
contract
Design by contract is an approach for designing software. It prescribes that
software designers should define formal, precise and verifiable interface
specifications for software components, which extend the ordinary
definition of abstract data types with preconditions, post conditions and
invariants. These specifications are referred to as "contracts"
The DbC approach assumes all client components that invoke an operation on a
server component will meet the preconditions specified as required for that
operation. Where this assumption is considered too risky (as in multichannel client-server or distributed computing)
Interface violation
In component-based software, interfaces (or specific actions) are separated
from implementations and serve as contracts between users and
implementers of components. In practice, many failures in component-based
systems arise because of interface violations among components-where one
party breaks the contract. Specifically the isolations occur when
A client component fails to satisfy a requirement of a component it is
reusing, or
A component implementation fails to fulfill its obligations to the client
What is the Problem
Modular software construction through the assembly of independently
developed components is a popular approach in software engineering
Component Interface is separated from its implementation and is used as
a contract between the clients and the implementer(s) of the component
In practice, failures in component based systems often arise because of
semantic interface violations among component where one party
breaks the contract
Violation may not discover until system integration and may after
deployment
component-based development increases
the need for more thorough testing and for automated techniques that support testing activities.
Proposed Solution
Testing to the contract is at the heart of specification-based testing
Describes Strategy for generating black-box test sets for individual
software components
The strategy involves generating a flowgraph from a components
specification
Applying analogues of traditional graph coverage techniques
Proposed Solution Assumption
A component must have a well-defined interface that is
clearly distinguishable from its implementation together
with a formal description of its intended behavior.
The research described here uses formally specified
interfaces described in resolve
Component Specification in RESOLVE
concept Queue_Template
context
global context
facility Standard_Boolean_Facility
parametric context
type Item
Interface
type Queue is modeled by string of math[Item]
exemplar q
initialization
ensures q = empty_string
operation Enqueue (
alters q : Queue
consumes x : Item )
ensures q = #q * <#x>
operation Dequeue (
alters q : Queue
produces x : Item)
requires q /= empty_string
ensures <x> * q = #q
operation Is_Empty (
preserves q : Queue) : Boolean
ensures Is_Empty iff q = empty_string
end Queue_Template
Template
Generic Component
Data Type
Initialize: auto invoked on
3 operations declaration
Finalize: auto invoked when
it goes out of scope
Swap: data movement,
instead of assignment
Requires
Precondition
ensure
Post condition
#q
Initial value
<#X>
String
Initialize
Enqueue
Dequeue
Is Empty
Flow Graph -1
Each vertex represents one operation provided by the component
Directed edge from vertex v1 to vertex v2 indicates the possibility that
control may flow from v1 to v2
Intuitively, an arc from v1 to v2 means that there is some legal
sequence of operations on the component
For software components that provide initialization or finalization
operations (also called constructors or destructors)
All of the parameters for a specific operation, including the implicit
self or this parameter in an object-oriented language, must be
considered
A definition occurs at a node for a given parameter if the operation may
potentially alter its value
A use occurs if the incoming value may affect the behavior of the operation
For each parameter in each operation, definitions and uses may be
directly deduced from the postcondition of the operation
Alternatively, definitions and uses may be deduced by examining
parameter modes: modes that correspond to in or in/out data flow
identify uses
Flow Graph -2
Given such a flowgraph, potential testing strategies become
evident
describe natural analogues of white-box control- and data-flow testing
strategies adapted to black-box flowgraphs including
node coverage
branch coverage
definition coverage
Use coverage
DU-path coverage (Definition use)
k-length path coverage
Unlike program-level testing, where a test case consists of input data for
the program, here a test case corresponds to a sequence of operation
invocations with associated parameter values.
Because branches in the graph represent different choices for
method calls in a sequence, it is easier to generate test cases
that cover any given branch.
Flow Graph -3 Infeasible path
In whitebox control flowgraphs, an infeasible path may result from an
unreachable node or edge
For specifications, unreachable nodes (i.e. operations that can never
be called under any circumstances) are exceedingly rare in practice
but infeasible paths arising from incompatible sequences of
operations are common
Each edge (say, from v1 to v2) in the flowgraph indicates that there is
some legitimate object lifetime that includes v1 followed by v2; it
specifically does not imply that every lifetime containing v1 followed by
v2 is legitimate.
Queue Example: Initialize >> Enqueue(q, x) >> Dequeue(q, x)
>> Dequeue(q, x) >> Finalize
This path is infeasible because it violates the precondition of
Dequeue
In effect, every legitimate object lifetime is represented by some path in
the graph, but not all paths in the graph correspond to legitimate object
lifetimes. Such paths are infeasible.
Component Diagram Sample
Library Management System
Register Page (visitor / vendor)
Login Page (user / librarian / vendor)
Search Page (user / librarian / vendor)
Request Vendor Page (librarian)
Request Book Issue Page (user / vendor)
Issue Status Page (librarian)
Make Payment Page (librarian / vendor)
Provide Books Page (librarian)
Logout Page (user / librarian / vendor)
Search Component
PreCondition: Login
Request Book Component
Precondition: Search
Issue Component
Precondition: Request Book
Automatically Generating Test
Case
select an adequacy criterion
Generating flowgraphs
Enumerating paths
Choosing parameter values
Graceful degredation for informal
specifications
1- select an adequacy criterion
All nodes: every node in a flow graph, constructed from the specification of the class to be tested
must be exercised at least once
All branches
All definitions: Test cases for each definition of each variable
All uses: Test cases for every use of the variable, there is a path from the definition of that variable
to the use.
All DU paths: This is the strongest data flow testing strategy. Every du path from every definition
of every variable to every use of that definition
All k-length paths
All paths: all paths leading from the initial to the final node
Three of the remaining criteria were selected for implementation
in the initial prototype: all nodes, all definitions and all uses.
2- Generating flowgraphs
In generating a flowgraph, identifying nodes, definitions and uses is
straightforward
The primary issue is how to decide correctly and efficiently which edges
should be included in the graph
The prototype uses a simplistic solution
feasible and infeasible edges
Experience with resolve-specified components indicates that the vast majority of operations
have relatively simple preconditions
although postconditions are usually more complex
As a result, it is practical in many instances to match the precondition for one
operation structurally with one clause in the postcondition of another
This allow one to identify a subset of the edges that are always feasible (or always
infeasible) for every object lifetime
2- Generating flowgraphs
Remaining edges Difficult Edges
omit all difficult >> it ensures no infeasible edges are included. The cost of
this conservatism is exclusion of some feasible edges, and hence exclusion of
desirable test cases rarely leads to desirable results
decide on an edge-by-edge >> Cost + Effort + Human Error
include all difficult edges >> inclusion of some infeasible edges, and hence the
inclusion of undesirable test cases that force operations to be exercised when
their preconditions are false >> risky >> it is possible to screen out test cases
that exercise infeasible edges automatically
Experience with the prototype suggests that it is much easier to include more test cases than
necessary at generation time and automatically weed out infeasible cases later
3- Enumerating paths
All three of the criteria used in the prototype generate a set of easily identifiable
paths for coverage
One test frame can be generated for each such path P (from v1 to v2)
Initialization subpath for P: from the Initialize node to v1
finalization subpath for P: from v2 to Finalization node
composing the three paths to form the sequence of operations in the test frame
The primary difficulty is again satisfiability: one must not choose an
infeasible object lifetime.
Solution1: Generate all test frames, later on filter out the infeasible paths
Drawback: this is less ideal because of the large number of infeasible test frames produced
Example: Initialize >> Enqueue(q, x) >> Dequeue(q,x) >> Dequeue(q, x) >> Finalize is the
test frame generated when v1 = v2 = Dequeue
3- Enumerating paths
Solution2:
compute the initialization subpath for v1 (Initialize > v1,1 > > v1,m > v1)
Compute the initialization subpath for v2 (Initialize > v2,1 > > v2,n > v2)
Use (Initialize > v1,1 > > v1,m > v2,1 > > v2,n > v1) as the initialization
subpath for P, provided that the edges v1,m > v2,1 and v2,n > v1 exist.
Extensions are possible when P contains more than two nodes
Solution3
one can modify the method of selecting initialization and finalization subpaths by weighting
difficult edges
so that paths with a minimum number of difficult edges are selected
Although these heuristics are not guaranteed to produce feasible paths
Finally, one should note that this enumeration strategy typically results in nave test
framesthose with the minimal number of operations and minimal number of distinct objects
necessary. The result is a larger number of fairly small, directed test cases that use as little
additional information as possible.
4- Choosing parameter values
One must instantiate the test frames with specific parameter values
This is a difficult problem for scalar parameters
Random
Boundary Values Analysis BVA: was added so that scalars with easily describable
operational domains could be more effectively supported
Even more difficult when testing generic components that may have potentially
complex data structures as parameters
In the end, the question of satisfiability for specific parameter values is
completely sidestepped, test cases are simply generated and then those that
are infeasible are later filtered
The side-effect of this is more critical: legitimate test frames may be thrown out
in practice because the nave selection of parameter values
5- Graceful degredation for
informal specifications
The prototype assume that components must have formal specification
The author claims that semi-formal or informal specifications can also be
supported
One can generate a flowgraph from as little information as a set of operation
signatures that includes parameter modes
One can liberally include edges in the flowgraph and then generate test cases
in the normal fashion.
This runs the risk of generating infeasible test cases, but automatic detection
and filtering make this option practical
Suggestion
I think one can use the UML diagrams (Sequence, Component) instead of
Formal specification, I think its easy to use and can easily generate the test
cases
4.Automatic detection of interface violations
The effectiveness of the testing strategy described here hinges in great part on
automaticallydetecting interface contract violations for the component under test.
Previous workdescribes a strategy for building wrapper components to
perform this function.
An interfaceviolation detection wrapper (or detection wrapper, for short)
is a decorator that providesexactly the same client interface as the component
it encases.
In essence, such a wrapperperforms a run-time precondition check for each
operation it implements before delegatingthe call to the wrapped component
under test.
4.Automatic detection of interface
violations Cont
After the underlying component completes its work, the detection wrapper
performs a run-time post-condition check on the results.
In addition, a detection wrapper may also check component-level invariant
properties before and after the delegated call.
Edwards et al. describe a novel architecture for the structure of detection
wrappers and discuss the potential for partially automated wrapper
construction.
Encasing a component under test in a detection wrapper has many significant
benefits.
A precondition violation identified by the wrapper indicates an invalid (infeasible)
test case.
This is the primary means for addressing the satisfiability issues raised in Section
3. A
Post-condition violation identified by the detection wrapper indicates a failure that
has occurred within the component.
4.Automatic detection of interface
violations Cont
Further, this indication will be raised in the operation or method where the failure
occurred, whether or not the failure would be detected by observing the top-level
output produced for the test case.
Finally, invariant checking ensures that internal faults that manifest themselves via
inconsistencies in an objects state will be detected at the point where they occur.
Without invariant checking, such faults would require observable differences in
output produced by subsequent operations in order to be detected.
Finally, if the component under test is built on top of other components, one should
also encase those lower-level components in violation detection wrappers (at least
wrappers that check preconditions).
4.Automatic detection of interface
violations Cont
This is necessary to spot instances where the component under test violates its
client-level obligations in invoking the methods of its collaborators.
Overall, the use of detection wrappers significantly magnifies the fault-detection
ability of any testing strategy.
That is critically important for specification-based approaches, which in theory
cannot guarantee that all reachable statements within the component are executed
and thus cannot subsume many white-box adequacy criteria.
The use of violation detection wrappers can lead to an automated testing approach
that has a greater fault revealing capability than traditional black-box strategies.
5. An experimental
assessment
As discussed in Section
3, a prototype test set generator for three of Zweben
et al.sadequacy criteria has been implemented.
The design of the prototype included several tradeoffs, some of which are
quite simplistic, that might adversely affect the usefulness of the approach.
At the same time, the only empirical test of the fault-detecting ability of test
sets following this approach is the original analysis reported by Zweben et al.
[2].
As a result, an experiment was designed and carried out to examine the
fault-detecting ability of test sets generated using this approach
specifically, those created using the tradeoffs embodied in the prototype.
5.1. Method
To measure the fault-detecting effectiveness of the test sets under
consideration, one can:
1.
select a set of components for analysis;
2.
use fault injection techniques to generate buggy versions of the components;
3.
generate a test set for each of the three criteria, for each of the components;
4.
execute each buggy version of each subject component on each test set;
5.
check the test outputs (only) to determine which faults were revealed without using
violation detection wrappers;
6.
check the violation detection wrapper outputs to determine which faults were
revealed by precondition, post-condition and invariant checking
5.1. Methodcont
Four resolve-specified components were selected for this study: a queue, a stack,
a one way list and a partial map.
All are container data structures with implementations ranging from simple to fairly
complex.
The queue and stack components both use singly linked chains of dynamically
allocated nodes for storage.
The primary differences are that the queue uses a sentinel node at the beginning of
its chain and maintains pointers to both ends of the chain, while the stack maintains
a single pointer and does not use a sentinel node.
The one-way list also uses a singly linked chain of nodes with a sentinel node at the
head of the chain.
Internally, it also maintains a pointer to the last node in the chain together with
a pointer to represent the current list position. The partial map is the most complex
data structure in the set
5.1. Methodcont
It is essentially a dictionary that stores domain/range pairs.
Internally, its implementation uses a fixed-size hash table with one-way lists
for buckets.
The ordering of elements in an individual bucket is unimportant; buckets are
searched linearly.
Although these components are all relatively small, the goal is to support the
testing of software components, including object-oriented classes.
The components selected here are more representative of classes that
maintain complex internal state than they are of complete programs, which is
appropriate.
5.1. Methodcont
In addition, the size of the subject components was important in supporting a
comprehensive fault injection strategy without incurring undue cost.
The fault injection strategy chosen was based on expression-selective mutation
testing
This version of mutation testing uses only five mutation operators: ABS, AOR, LCR,
ROR
and UOI [10]; it dramatically reduces the number of mutants generated, but has
been
experimentally shown to achieve almost full mutation coverage [9].
5.1. Methodcont
Each mutant generated by applying one mutation operator at a unique location in
one of the methods supported by a component.
The number of generated mutants was small enough to admit hand-identification of
equivalent mutants.
All remaining mutants were guaranteed to differ from the original program on some
legitimate object lifetime in a manner observable through the parameter values
returned by at least one method supported by the component.
Table summarizes the characteristics of the subject components and number of faulty
versions generated.
5.2. Results
Because the goal was to assess test sets generated using the specific heuristics
described in this paper rather than the more general goal of assessing the adequacy
criteria themselves, the experiment was limited to test sets produced by the
prototype.
However, the prototype uses a deterministic process to generate test frames, so
producing multiple test sets for the same component and criterion would only result
in random variations in parameter values, rather than any substantial variations in
test case structure.
Since all of the subject components are data structures that are relatively insensitive
to the values of the items they contain, such superficial variations in generated test
sets were not explored. As a result, a total of 12 test sets were generated, one for
each of the chosen adequacy criteria for each of the subject components.
Inclusion of difficult edges in the corresponding flow-graphs was decided by hand
on a per-edge basis.
5.2. Resultscont
For each subject component, the three corresponding test sets were run against
every faulty version.
Table II summarizes the results. Observed failures indicates the number of
mutants killed by the corresponding test case based solely on observable output
(without considering violation detection wrapper checks).
Detected violations indicates the number of mutants killed solely by using the
invariant and post-condition checking provided by the subjects detection
wrapper. In theory, any failure identifiable from observable output will
also be detected by the components post-condition checking wrapper if the
wrapper is implemented correctly; the data collected were consistent with this
expectation, so every Observed failure was also a Detected violation.
The rightmost column lists the number of infeasible test cases produced by the
prototype in each test set
5.2. Resultscont
5.2. Resultscont
5.3. Discussion
On the subject components, it is clear that all definitions reveals the fewest faults
of the three criteria studied, while all uses reveals the most, which is no surprise.
More notable is the magnification of fault-detecting power supplied by the use of
violation detection wrappers.
In all cases, the use of detection wrappers significantly increased the number of
mutants killed.
Further, the increase was more dramatic with weaker test sets.
In the extreme case of the all nodes test set for the one-way list, its faultdetection ability was doubled.
5.3. Discussioncont
Also surprising is the fact that for three of the four subjects, the all uses test
sets
achieved a 100 per cent fault detection rate with the use of detection wrappers.
This unusual result bears some interpretation.
It appears to be the result of two interacting factors.
First, the invariant checking performed by each detection wrapper is quite
extensive (partly because formal statements of representation invariants were
available for the subjects), providing the magnification effect discussed above.
Second, the simpler subject components involved significantly fewer complex
logic conditions and nested control constructs.
As a result, the all nodes and all uses test sets for the stack, queue and oneway list components actually achieved 100 per cent white-box statement-level
coverage of all statements where mutation operators were applied
5.3. Discussioncont
This fact was later confirmed via code instrumentation.
Presumably, this is atypical of most components, so the 100 per cent fault
detection results for all uses should not be unduly generalized.
Nevertheless, the fact that a relatively weak adequacy criterion could lead to such
effective fault revelation is a promising sign for this technique.
The only other empirical study of specification-based testing based on these
criteria has been reported by Zweben et al. [2].
Their work reports a small study on 25 versions of a two-way list component
written by undergraduate students. While small in scale,
5.3. Discussioncont
give some indication of the fault-detecting ability of the various specification-based
control and data flow criteria on defects that occurred in real components.
In that study, all nodes revealed 6 out of 10 defects, all definitions revealed 6 out of
10, and all uses revealed 8 out of 10, all of which are comparable to the Observed
failures results here.
Although the results of this experiment are promising, there are also important threats
to the validity of any conclusions drawn from it.
The subjects were limited in size for practical reasons. Although well-designed classes
in typical OO designs are often similar in size, it is not clear how representative the
subjects are in size or logical complexity. Also, the question of how well mutation-based
fault injection models real-world faults is relevant, and has implications for any
interpretation of the results.
With respect to the adequacy criteria themselves, this experiment only aims to assess
test sets generated using the strategy described in this paper, rather than aspiring to a
more sweeping assessment of the fault detecting ability of the entire class of test sets
meeting a criterion
6. RELATED WORK
The test set generation approach described here has been incorporated into an endto-end test automation strategy that also includes generation of component test
drivers and partial to full automation of violation detection wrappers .
The most relevant related projects in automated generation of specification-based
test sets are DAISTS [12] and ASTOOT.
One key difference with the current work is that model-based specifications are used
while DAISTS and ASTOOT are based on algebraic specifications.
6. RELATED WORKcont
Algebraic specifications often encourage the use of function-only operations and may
suppress any explicit view of the content stored in an object.
Model-based specifications instead focus on abstract modelling of that content,
direct support for state-modifying methods, and direct support for operations with
relational behaviour.
Another notable difference is the use of violation detection wrappers in this
approach, which provides increased fault-detecting ability.
6. RELATED WORKcont
Other specification-based testing or test generation approaches focus on finite state
machine (FSM) models of classes or programs [4,1315].
The work of Hoffman, Strooper and their colleagues is of particular relevance
because it also uses model-based specifications.
An FSM model typically contains a subset of the states and transitions supported by
the actual component under consideration, and may be developed by identifying
equivalence classes of states that behave similarly.
Test coverage is gauged against the states and transitions of the model.
The work described here, by contrast, does not involve collapsing the state space of
the component under test, or even directly modelling it.
As a result, it gracefully degrades when only semi- or informal specifications are
available. In addition, the use of violation detection wrappers sets the current work
apart from FSM approaches
6. RELATED WORKcont
Other work on test data adequacy, including both specification-based and white-box
criteria, is surveyed in detail by Zhu et al.
The focus of the current work is to develop and assess practical test set generation
strategies based on existing criteria, rather than describing new criteria. Similarly,
Edwards et al.
provide a more detailed discussion of interface violation detection wrappers and the
work related to them, including alternative approaches to run-time post-condition
checking
6. RELATED WORKcont
There is a large body of work on experimentally assessing the fault-detecting
ability of various testing strategies or adequacy criteria .
Frankls work on statistically characterizing the effectiveness of adequacy criteria
is notable.
Whereas her work focuses on statistically assessing the effectiveness of criteria
by looking at large numbers of test sets, the work here instead aims at assessing
test sets created using a specific strategy.
7. CONCLUSIONS
This article describes a specification-based test set generation strategy based on
Zweben et al.s specification-based test data adequacy criteria.
The strategy involves generating a flowgraph from a components specification, and
then applying analogues of white-box strategies to the graph.
Although there are a number of very difficult issues related to satisfiability involved in
generating test data, a prototype test set generator was implemented using specific
design tradeoffs to overcome these obstacles.
An experimental assessment of fault-detecting ability based on expression-selective
mutation analysis provided very promising results.
By using precondition, post-condition, and invariant checking wrappers around the
component under test, fault detection ratios competitive with white-box techniques
were achieved.
The results of the experiment, together with experiences with the generator, indicate
that there is the potential for practical automation of this strategy