Thanks to visit codestin.com
Credit goes to www.scribd.com

0% found this document useful (0 votes)
143 views142 pages

Empirical Software Engineering Guide

The document discusses empirical studies in software engineering. It defines empirical software engineering and provides an overview of the types and benefits of empirical studies. The main types discussed are experiments, case studies, systematic reviews, surveys, and post-mortem analysis. Empirical studies help improve processes, evaluate tools and techniques, and build knowledge in software engineering.

Uploaded by

Sahil Yadav
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
143 views142 pages

Empirical Software Engineering Guide

The document discusses empirical studies in software engineering. It defines empirical software engineering and provides an overview of the types and benefits of empirical studies. The main types discussed are experiments, case studies, systematic reviews, surveys, and post-mortem analysis. Empirical studies help improve processes, evaluate tools and techniques, and build knowledge in software engineering.

Uploaded by

Sahil Yadav
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 142

1

Introduction

As the size and complexity of software is increasing, software organizations are facing
the pressure of delivering high-quality software within a specific time, budget, and avail-
able resources. The software development life cycle consists of a series of phases, includ-
ing requirements analysis, design, implementation, testing, integration, and maintenance.
Software professionals want to know which tools to use at each phase in software devel-
opment and desire effective allocation of available resources. The software planning team
attempts to estimate the cost and duration of software development, the software testers
want to identify the fault-prone modules, and the software managers seek to know which
tools and techniques can be used to reduce the delivery time and best utilize the man-
power. In addition, the software managers also desire to improve the software processes
so that the quality of the software can be enhanced. Traditionally, the software engineers
have been making decisions based on their intuition or individual expertise without any
scientific evidence or support on the benefits of a tool or a technique.
Empirical studies are verified by observation or experiment and can provide powerful
evidence for testing a given hypothesis (Aggarwal et al. 2009). Like other disciplines, soft-
ware engineering has to adopt empirical methods that will help to plan, evaluate, assess,
monitor, control, predict, manage, and improve the way in which software products are
produced. An empirical study of real systems can help software organizations assess
large software systems quickly, at low costs. The application of empirical techniques is
especially beneficial for large-scale systems, where software professionals need to focus
their attention and resources on various activities of the system under development.
For example, developing a model for predicting faulty modules allows software organiza-
tions to identify faulty portions of source code so that testing activities can be planned
more effectively. Empirical studies such as surveys, systematic reviews and experimental
studies, help software practitioners to scientifically assess and validate the tools and tech-
niques in software development.
In this chapter, an overview and the types of empirical studies are provided, the phases
of the experimental process are described, and the ethics involved in empirical research
of software engineering are summarized. Further, this chapter also discusses the key con-
cepts used in the book.

1.1 What Is Empirical Software Engineering?


The initial debate on software as an engineering discipline is over now. It has been realized
that without software as an engineering discipline survival is difficult. Engineering compels
the development of the product in a scientific, well formed, and systematic manner. Core
engineering principles should be applied to produce good quality maintainable software

1
2 Empirical Research in Software Engineering

within a specified time and budget. Fritz Bauer coined the term software engineering in 1968 at
the first conference on software engineering and defined it as (Naur and Randell 1969):
The establishment and use of sound engineering principles in order to obtain economically
developed software that is reliable and works efficiently on real machines.

Software engineering is defined by IEEE Computer Society as (Abren et al. 2004):


The application of a systematic, disciplined, quantifiable approach to the development,
operation and maintenance of software, and the study of these approaches, that is, the
application of engineering to software.

The software engineering discipline facilitates the completion of the objective of delivering
good quality software to the customer following a systematic and scientific approach.
Empirical methods can be used in software engineering to provide scientific evidence on
the use of tools and techniques.
Harman et al. (2012a) defined “empirical” as:
“Empirical” is typically used to define any statement about the world that is related to
observation or experience.

Empirical software engineering (ESE) is an area of research that emphasizes the use of empir-
ical methods in the field of software engineering. It involves methods for evaluating, assess-
ing, predicting, monitoring, and controlling the existing artifacts of software development.
ESE applies quantitative methods to the software engineering phenomenon to understand
software development better. ESE has been gaining importance over the past few decades
because of the availability of vast data sets from open source repositories that contain
information about software requirements, bugs, and changes (Meyer et al. 2013).

1.2 Overview of Empirical Studies


Empirical study is an attempt to compare the theories and observations using real-life
data for analysis. Empirical studies usually utilize data analysis methods and statistical
techniques for exploring relationships. They play an important role in software engineer-
ing research by helping to form well-formed theories and widely accepted results. The
empirical studies provide the following benefits:

• Allow to explore relationships


• Allow to prove theoretical concepts
• Allow to evaluate accuracy of the models
• Allow to choose among tools and techniques
• Allow to establish quality benchmarks across software organizations
• Allow to assess and improve techniques and methods

Empirical studies are important in the area of software engineering as they allow software
professionals to evaluate and assess the new concepts, technologies, tools, and techniques
in scientific and proved manner. They also allow improving, managing, and controlling
the existing processes and techniques by using evidence obtained from the empirical
analysis. The empirical information can help software management in decision making
Introduction 3

Empirical
study

• Research questions
• Hypothesis formation
• Data collection
• Data analysis
• Model development and
validation
• Concluding results

FIGURE 1.1
Steps in empirical studies.

and improving software processes. The empirical studies involve the following steps
(Figure 1.1):

• Formation of research questions


• Formation of a research hypothesis
• Gathering data
• Analyzing the data
• Developing and validating models
• Deriving conclusions from the obtained results

Empirical study allows to gather evidence that can be used to support the claims of
efficiency of a given technique or technology. Thus, empirical studies help in build-
ing a body of knowledge so that the processes and products are improved resulting in
high-quality software.
Empirical studies are of many types, including surveys, systematic reviews, experi-
ments, and case studies.

1.3 Types of Empirical Studies


The studies can be broadly classified as quantitative and qualitative. Quantitative research
is the most widely used scientific method in software engineering that applies mathematical-
or statistical-based methods to derive conclusions. Quantitative research is used to prove
or disprove a hypothesis (a concept that has to be tested for further investigation). The aim
of a quantitative research is to generate results that are generalizable and unbiased and
thus can be applied to a larger population in research. It uses statistical methods to vali-
date a hypothesis and to explore causal relationships.

www.allitebooks.com
4 Empirical Research in Software Engineering

In qualitative research, the researchers study human behavior, preferences, and nature.
Qualitative research provides an in-depth analysis of the concept under investigation
and thus uses focused data for research. Understanding a new process or technique in
software engineering is an example of qualitative research. Qualitative research provides
textual descriptions or pictures related to human beliefs or behavior. It can be extended
to other studies with similar populations but generalizations of a particular phenomenon
may be difficult. Qualitative research involves methods such as observations, interviews,
and group discussions. This method is widely used in case studies.
Qualitative research can be used to analyze and interpret the meaning of results produced
by quantitative research. Quantitative research generates numerical data for analysis,
whereas qualitative research generates non-numerical data (Creswell 1994). The data of
qualitative research is quite rich as compared to quantitative data. Table 1.1 summaries
the key differences between quantitative and qualitative research.
The empirical studies can be further categorized as experimental, case study, systematic
review, survey, and post-mortem analysis. These categories are explained in the next sec-
tion. Figure 1.2 presents the quantitative and qualitative types of empirical studies.

1.3.1 Experiment
An experimental study tests the established hypothesis by finding the effect of variables of
interest (independent variables) on the outcome variable (dependent variable) using statis-
tical analysis. If the experiment is carried out correctly, the hypothesis is either accepted or
rejected. For example, one group uses technique A and the other group uses technique B,
which technique is more effective in detecting a larger number of defects? The researcher
may apply statistical tests to answer such questions. According to Kitchenham et al. (1995),
the experiments are small scale and must be controlled. The experiment must also con-
trol the confounding variables, which may affect the accuracy of the results produced by
the experiment. The experiments are carried out in a controlled environment and often
referred to as controlled experiments (Wohlin 2012).
The key factors involved in the experiments are independent variables, dependent vari-
ables, hypothesis, and statistical techniques. The basic steps followed in experimental

TABLE 1.1
Comparison of Quantitative and Qualitative Research
Quantitative Research Qualitative Research

General Objective Subjective


Concept Tests theory Forms theory
Focus Testing a hypothesis Examining the depth of a phenomenon
Data type Numerical Textual or pictorial
Group Small Large and random
Purpose Predict causal relationships Describe and interpret concepts
Basis Based on hypothesis Based on concept or theory
Method Confirmatory: established hypothesis is tested Exploratory: new hypothesis is formed
Variables Variables are defined by the researchers Variables may emerge unexpectedly
Settings Controlled Flexible
Results Generalizable Specialized
Introduction 5

Experiment

Survey research

Quantitative
Systematic
reviews

Empirical studies
Postmortem
analysis

Qualitative Case studies

FIGURE 1.2
Types of empirical studies.

Experiment
Experiment Experiment Experiment Experiment
conduct
definition design interpretation reporting
and analysis

FIGURE 1.3
Steps in experimental research.

research are shown in Figure 1.3. The same steps are followed in any empirical study
process however the content varies according to the specific study being carried out. In
first phase, experiment is defined. The next phase involves determining the experiment
design. In the third phase the experiment is executed as per the experiment design. Then,
the results are interpreted. Finally, the results are presented in the form of experiment
report. To carry out an empirical study, a replicated study (repeating a study with similar
settings or methods but different data sets or subjects), or to perform a survey of existing
empirical studies, the research methodology followed in these studies needs to be formu-
lated and described.
A controlled experiment involves varying the variables (one or more) and keeping every-
thing else constant or the same and are usually conducted in small or laboratory setting
(Conradi and Wang 2003). Comparing two methods for defect detection is an example of a
controlled experiment in software engineering context.

1.3.2 Case Study


Case study research represents real-world scenarios and involves studying a particular
phenomenon (Yin 2002). Case study research allows software industries to evaluate a tool,
6 Empirical Research in Software Engineering

method, or process (Kitchenham et al. 1995). The effect of a change in an organization


can be studied using case study research. Case studies increase the understanding of the
phenomenon under study. For example, a case study can be used to examine whether a
unified model language (UML) tool is effective for a given project or not. The initial and
new concepts are analyzed and explored by exploratory case studies, whereas the already
existing concepts are tested and improvised by confirmatory case studies.
The phases included in the case study are presented in Figure 1.4. The case study
design phase involves identifying existing objectives, cases, research questions, and
data-collection strategies. The case may be a tool, technology, technique, process, product,
individual, or software. Qualitative data is usually collected in a case study. The sources
include interviews, group discussions, or observations. The data may be directly or indi-
rectly collected from participants. Finally, the case study is executed, the results obtained
are analyzed, and the findings are reported. The report type may vary according to the
target audience.
Case studies are appropriate where a phenomenon is to be studied for a longer period
of time so that the effects of the phenomenon can be observed. The disadvantages of case
studies include difficulty in generalization as they represent a typical situation. Since they
are based on a particular case, the validity of the results is questionable.

1.3.3 Survey Research


Survey research identifies features or information from a large scale of a population. For
example, surveys can be used when a researcher wants to know whether the use of a par-
ticular process has improved the view of clients toward software usability features. This
information can be obtained by asking the selected software testers to fill questionnaires.
Surveys are usually conducted using questionnaires and interviews. The questionnaires
are constructed to collect research-related information.
Preparation of a questionnaire is an important activity and should take into consid-
eration the features of the research. The effective way to obtain a participant’s opinion
is to get a questionnaire or survey filled by the participant. The participant’s feedback
and reactions are recorded in the questionnaire (Singh 2011). The questionnaire/survey
can be used to detect trends and may provide valuable information and feedback on a
particular process, technique, or tool. The questionnaire/survey must include questions
concerning the participant’s likes and dislikes about a particular process, technique, or
tool. The interviewer should preferably handle the questionnaire.
Surveys are classified into three types (Babbie 1990)—descriptive, explorative, and
explanatory. Exploratory surveys focus on the discovery of new ideas and insights and are
usually conducted at the beginning of a research study to gather initial information. The
descriptive survey research is more detailed and describes a concept or topic. Explanatory
survey research tries to explain how things work in connections like cause and effect,
meaning a researcher wants to explain how things interact or work with each other. For
example, while exploring relationship between various independent variables and an

Case study Data Execution of Data


Reporting
design collection case study analysis

FIGURE 1.4
Case study phases.
Introduction 7

outcome variable, a researcher may want to explain why an independent variable affects
the outcome variable.

1.3.4 Systematic Reviews


While conducting any study, literature review is an important part that examines
the existing position of literature in an area in which the research is being conducted.
The systematic reviews are methodically undertaken with a specific search strategy and
well-defined methodology to answer a set of questions. The aim of a systematic review is
to analyze, assess, and interpret existing results of research to answer research questions.
Kitchenham (2007) defines systematic review as:
A form of secondary study that uses a well-defined methodology to identify, analyze
and interpret all available evidence related to a specific research question in a way that
is unbiased and (to a degree) repeatable.

The purpose of a systematic review is to summarize the existing research and provide
future guidelines for research by identifying gaps in the existing literature. A systematic
review involves:

1. Defining research questions.


2. Forming and documenting a search strategy.
3. Determining inclusion and exclusion criteria.
4. Establishing quality assessment criteria.

The systematic reviews are performed in three phases: planning the review, conducting
the review, and reporting the results of the review. Figure 1.5 presents the summary of the
phases involved in systematic reviews.
In the planning stage, the review protocol is developed that includes the following
steps: research questions identification, development of review protocol, and evaluation
of review protocol. During the development of review protocol the basic processes in
the review are planned. The research questions are formed that address the issues to be

• Need for review


• Research questions
Planning • Development of review
protocol
• Evaluation of review
protocol
• Search strategy execution
• Quality assessment
Conducting
• Data extraction
• Data synthesis

• Documenting
Reporting
the results

FIGURE 1.5
Phases of systematic review.
8 Empirical Research in Software Engineering

answered in the systematic literature review. The development of review protocol involves
planning a series of steps—search strategy design, study selection criteria, study quality
assessment, data extraction process, and data synthesis process. In the first step, the search
strategy is described that includes identification of search terms and selection of sources to
be searched to identify the primary studies. The second step determines the inclusion and
exclusion criteria for each primary study. In the next step, the quality assessment criterion
is identified by forming the quality assessment questionnaire to analyze and assess the
studies. The second to last step involves the design of data extraction forms to collect the
required information to answer the research questions, and, in the last step, data synthesis
process is defined. The above series of steps are executed in the review in the conducting
phase. In the final phase, the results are documented. Chapter 2 provides details of sys-
tematic review.

1.3.5 Postmortem Analysis


Postmortem analysis is carried out after an activity or a project has been completed.
The main aim is to detect how the activities or processes can be improved in the future.
The postmortem analysis captures knowledge from the past, after the activity has been
completed. Postmortem analysis can be classified into two types: general postmortem
analysis and focused postmortem analysis. General postmortem analysis collects all avail-
able information from a completed activity, whereas focused postmortem analysis collects
information about specific activity such as effort estimation (Birk et al. 2002).
According to Birk et al., in postmortem analysis, large software systems are analyzed
to gain knowledge about the good and bad practices of the past. The techniques such as
interviews and group discussions can be used for collecting data in postmortem analysis.
In the analysis process, the feedback sessions are conducted where the participants are
asked whether the concepts told to them have been correctly understood (Birk et al. 2002).

1.4 Empirical Study Process


Before describing the steps involved in the empirical research process, it is important to dis-
tinguish between empirical and experimental approaches as they are often used interchange-
ably but are slightly different from each other. Harman et al. (2012a) makes a distinction
between experimental and empirical approaches in software engineering. In experimental
software engineering, the dependent variable is closely observed in a controlled environment.
Empirical studies are used to define anything related to observation and experience and are
valuable as these studies consider real-world data. In experimental studies, data is artificial
or synthetic but is more controlled. For example, using 5000 machine-generated instances is
an experimental study, and using 20 real-world programs in the study is an empirical study
(Meyer et al. 2013). Hence, any experimental approach, under controlled environments, allows
the researcher to remove the research bias and confounding effects (Harman et al. 2012a).
Both empirical and experimental approaches can be combined in the studies.
Without a sound and proven research process, it is difficult to carry out efficient and
effective research. Thus, a research methodology must be complete and repeatable, which,
when followed, in a replicated or empirical study, will enable comparisons to be made
across various studies. Figure 1.6 depicts the five phases in the empirical study process.
These phases are discussed in the subsequent subsections.
Introduction 9

Reporting

• Presenting
Results
interpretation the results
Research
conduct and • Theoretical and
Experiment analysis practical significance
design of results
Study • Descriptive
• Research • Limitations of the
definition statistics
questions work
• Attribute
• Scope • Hypothesis reduction
• Purpose formulation • Statistical
• Motivation • Defining analysis
variables • Model
• Context
• Data prediction and
collection validation
• Selection of • Hypothesis
data analysis testing
methods
• Validity
threats

FIGURE 1.6
Empirical study phases.

1.4.1 Study Definition


The first step involves the definition of the goals and objectives of the empirical study.
The aim of the study is explained in this step. Basili et al. (1986) suggests dividing the
defining phase into the following parts:

• Scope: What are the dimensions of the study?


• Motivation: Why is it being studied?
• Object: What entity is being studied?
• Purpose: What is the aim of the study?
• Perspective: From whose view is the study being conducted (e.g, project manager,
customer)?
• Domain: What is the area of the study?

The scope of the empirical study defines the extent of the investigation. It involves listing
down the specific goals and objectives of the experiment. The purpose of the study may be
to find the effect of a set of variables on the outcome variable or to prove that technique A
is superior to technique B. It also involves identifying the underlying hypothesis that is
formulated at later stages. The motivation of the experiment describes the reason for con-
ducting the study. For example, the motivation of the empirical study is to analyze and
assess the capability of a technique or method. The object of the study is the entity being
examined in the study. The entity in the study may be the process, product, or technique.
Perspective defines the view from which the study is conducted. For example, if the study
is conducted from the tester’s point of view then the tester will be interested in planning
and allocating resources to test faulty portions of the source code. Two important domains
in the study are programmers and programs (Basili et al. 1986).
10 Empirical Research in Software Engineering

1.4.2 Experiment Design


This is the most important and significant phase in the empirical study process. The design of
the experiment covers stating the research questions, formation of the hypothesis, selection
of variables, data-collection strategies, and selection of data analysis methods. The context
of the study is defined in this phase. Thus, the sources (university/academic, industrial, or
open source) from which the data will be collected are identified. The data-collection pro-
cess must be well defined and the characteristics of the data must be stated. For example,
nature, programming language, size, and so on must be provided. The outcome variables
are to be carefully selected such that the objectives of the research are justified. The aim of
the design phase should be to select methods and techniques that promote replicability and
reduce experiment bias (Pfleeger 1995). Hence, the techniques used must be clearly defined
and the settings should be stated so that the results can be replicated and adopted by the
industry. The following are the steps carried out during the design phase:

1. Research questions: The first step is to formulate the research problem. This step states
the problem in the form of questions and identifies the main concepts and relations
to be explored. For example, the following questions may be addressed in empirical
studies to find the relationship between software metrics and quality attributes:
a. What will be the effect of software metrics on quality attributes (such as fault
proneness/testing effort/maintenance effort) of a class?
b. Are machine-learning methods adaptable to object-oriented systems for pre-
dicting quality attributes?
c. What will be the effect of software metrics on fault proneness when severity of
faults is taken into account?
2. Independent and dependent variables: To analyze relationships, the next step is to
define the dependent and the independent variables. The outcome variable pre-
dicted by the independent variables is called the dependent variable. For instance,
the dependent variables of the models chosen for analysis may be fault proneness,
testing effort, and maintenance effort. A variable used to predict or estimate a
dependent variable is called the independent (explanatory) variable.
3. Hypothesis formulation: The researcher should carefully state the hypothesis to
be tested in the study. The hypothesis is tested on the sample data. On the basis
of the result from the sample, a decision concerning the validity of the hypothesis
(acception or rejection) is made.
Consider an example where a hypothesis is to be formed for comparing a num-
ber of methods for predicting fault-prone classes.
For each method, M, the hypothesis in a given study is the following (the
relevant null hypothesis is given in parentheses), where the capital H indicates
“hypothesis.” For example:
H–M: M outperform the compared methods for predicting fault-prone software classes
(null hypothesis: M does not outperform the compared methods for predicting fault-
prone software classes).

4. Empirical data collection: The researcher decides the sources from which the
data is to be collected. It is found from literature that the data collected is either
from university/academic systems, commercial systems, or open source software.
The researcher should state the environment in which the study is performed,
Introduction 11

programming language in which the systems are developed, size of the systems
to be analyzed (lines of code [LOC] and number of classes), and the duration for
which the system is developed.
5. Empirical methods: The data analysis techniques are selected based on the type
of the dependent variables used. An appropriate data analysis technique should
be selected by identifying its strengths and weaknesses. For example, a number of
techniques have been available for developing models to predict and analyze soft-
ware quality attributes. These techniques could be statistical like linear regression
and logistic regression or machine-learning techniques like decision trees, support
vector machines, and so on. Apart from these techniques, there are a new set of
techniques like particle swarm optimization, gene expression programming, and
so on that are called the search-based techniques. The details of these techniques
can be found in Chapter 7.

In the empirical study, the data is analyzed corresponding to the details given in the
experimental design. Thus, the experimental design phase must be carefully planned and
executed so that the analysis phase is clear and unambiguous. If the design phase does not
match the analysis part then it is most likely that the results produced are incorrect.

1.4.3 Research Conduct and Analysis


Finally, the empirical study is conducted following the steps described in the experiment
design. The experiment analysis phase involves understanding the data by collecting
descriptive statistics. The unrelated attributes are removed, and the best attributes (vari-
ables) are selected out of a set of attributes (e.g., software metrics) using attribute reduction
techniques. After removing irrelevant attributes, hypothesis testing is performed using
statistical tests and, on the basis of the result obtained, a decision regarding the accep-
tance or rebuttal of the hypothesis is made. The statistical tests are described in Chapter 6.
Finally, for analyzing the casual relationships between the independent variables and the
dependent variable, the model is developed and validated. The steps involved in experi-
ment conduct and analysis are briefly described below.

1. Descriptive statistics: The data is validated for correctness before carrying out the
analysis. The first step in the analysis is descriptive statistics. The research data
must be suitably reduced so that the research data can be read easily and can be
used for further analysis. Descriptive statistics concern development of certain
indices or measures to summarize the data. The important statistics measures used
for comparing different case studies include mean, median, and standard devia-
tion. The data analysis methods are selected based on the type of the dependent
variable being used. Statistical tests can be applied to accept or refute a hypothesis.
Significance tests are performed for comparing the predicted performance of a
method with other sets of methods. Moreover, effective data assessment should
also yield outliers (Aggarwal et al. 2009).
2. Attribute reduction: Feature subselection is an important step that identifies
and removes as much of the irrelevant and redundant information as possible.
The dimensionality of the data reduces the size of the hypothesis space and allows
the methods to operate faster and more effectively (Hall 2000).
3. Statistical analysis: The data collected can be analyzed using statistical analysis by
following the steps below.
12 Empirical Research in Software Engineering

a. Model prediction: The multivariate analysis is used for the model prediction.
Multivariate analysis is used to find the combined effect of each indepen-
dent variable on the dependent variable. Based on the results of performance
measures, the performance of models predicted is evaluated and the results
are interpreted. Chapter 7 describes these performance measures.
b. Model validation: In systems, where models are independently constructed from
the training data (such as in data mining), the process of constructing the model is
called training. The subsamples of data that are used to validate the initial analy-
sis (by acting as “blind” data) are called validation data or test data. The valida-
tion data is used for validating the model predicted in the previous step.
c. Hypothesis testing: It determines whether the null hypothesis can be rejected at
a specified confidence level. The confidence level is determined by the researcher
and is usually less than 0.01 or 0.05 (refer Section 4.7 for details).

1.4.4 Results Interpretation


In this step, the results computed in the empirical study’s analysis phase are assessed
and discussed. The reason behind the acceptance or rejection of the hypothesis is exam-
ined. This process provides insight to the researchers about the actual reasons of the
decision made for hypothesis. The conclusions are derived from the results obtained in
the study. The significance and practical relevance of the results are defined in this phase.
The limitations of the study are also reported in the form of threats to validity.

1.4.5 Reporting
Finally, after the empirical study has been conducted and interpreted, the study is reported
in the desired format. The results of the study can be disseminated in the form of a confer-
ence article, a journal paper, or a technical report.
The results are to be reported from the reader’s perspective. Thus, the background,
motivation, analysis, design, results, and the discussion of the results must be clearly
documented. The audience may want to replicate or repeat the results of a study in a simi-
lar context. The experiment settings, data-collection methods, and design processes must
be reported in significant level of detail. For example, the descriptive statistics, statistical
tools, and parameter settings of techniques must be provided. In addition, graphical repre-
sentation should be used to represent the results. The results may be graphically presented
using pie charts, line graphs, box plots, and scatter plots.

1.4.6 Characteristics of a Good Empirical Study


The characteristics of a good empirical study are as follows:

1. Clear: The research goals, hypothesis, and data-collection procedure must be


clearly stated.
2. Descriptive: The research should provide details of the experiment so that the
study can be repeated and replicated in similar settings.
3. Precise: Precision helps to prove confidence in the data. It represents the degree of
measure correctness and data exactness. High precision is necessary to specify the
attributes in detail.
Introduction 13

4. Valid: The experiment conclusions should be valid for a wide range of population.
5. Unbiased: The researcher performing the study should not influence the results to sat-
isfy the hypothesis. The research may produce some bias because of experiment error.
The bias may be produced when the researcher selects the participants such that they
generate the desired results. The measurement bias may occur during data collection.
6. Control: The experiment design should be able to control the independent variables
so that the confounding effects (interaction effects) of variables can be reduced.
7. Replicable: Replication involves repeating the experiment with different data
under same experimental conditions. If the replication is successful then this indi-
cates generalizability and validity of the results.
8. Repeatable: The experimenter should be able to reproduce the results of the study
under similar settings.

1.5 Ethics of Empirical Research


Researchers, academicians, and sponsors should be aware of research ethics while conducting
and funding empirical research in software engineering. The upholding of ethical stan-
dards helps to develop trust between the researcher and the participant, and thus smooth-
ens the research process. An unethical study can harm the reputation of the research
conducted in software engineering area.
Some ethical issues are regulated by the standards and laws provided by the govern-
ment. In some countries like the United States, the sponsoring agency requires that the
research involving participants must be reviewed by a third-party ethics committee to
verify that the research complies with the ethical principles and standards (Singer and
Vinson 2001). Empirical research is based on the trust between the participant and the
researcher, the ethical information must be explicitly provided to the participants to avoid
any future conflicts. The participants must be informed about the risk and ethical issues
involved in the research at the beginning of the study. The examples of problems related
to ethics that are experienced in industry are given by Becker-Kornstaedt (2001) and
summarized in Table 1.2.

TABLE 1.2
Examples of Unethical Research
S. No Problem

1 Employees misleading the manager to protect himself or herself with the knowledge of the researcher
2 Nonconformance to a mandatory process
3 Revealing identities of the participant or organization
4 Manager unexpectedly joining a group interview or discussion with the participant
5 Experiment revealing identity of the participants of a nonperforming department in an organization
6 Experiment outcomes are used in employee ratings
7 Participants providing information off the record, that is, after interview or discussion is over

www.allitebooks.com
14 Empirical Research in Software Engineering

The ethical threats presented in Table 1.2 can be reduced by (1) presenting data and
results such that no information about the participant and the organization is revealed,
(2) presenting different reports to stakeholders, (3) providing findings to the participants
and giving them the right to withdraw any time during the research, and (4) providing
publication to companies for review before being published. Singer and Vinson (2001)
identified that the engineering and science ethics may not be related to empirical research
in software engineering. They provided the following four ethical principles:

1. Informed consent: This principle is concerned with subjects participating in the


experiment. The subjects or participants must be provided all the relevant infor-
mation related to the experiment or study. The participants must willingly agree
to participate in the research process. The consent form acts as a contract between
an individual participant and the researcher.
2. Scientific value: This principle states that the research results must be correct and
valid. This issue is critical if the researchers are not familiar with the technology or
methodology they are using as it will produce results of no scientific value.
3. Confidentiality: It refers to anonymity of data, participants, and organizations.
4. Beneficence: The research must provide maximum benefits to the participants and
protect the interests of the participants. The benefits of the organization must also
be protected by not revealing the weak processes and procedures being followed
in the departments of the organization.

1.5.1 Informed Content


Informed consent consists of five elements—disclosure, comprehension, voluntariness,
consent, and right to withdraw. Disclosure means to provide all relevant details about
the research to the participants. This information includes risks and benefits incurred by
the participants. Comprehension refers to presenting information in such a manner that
can be understood by the participant. Voluntariness specifies that the consent obtained
must not be under any pressure or influence and actual consent must be taken. Finally, the
subjects must have the right to withdraw from research process at any time. The consent
form has the following format (Vinson and Singer 2008):

1. Research title: The title of the project must be included in the consent form.
2. Contact details: The contact details (including ethics contact) will provide the
participant information about whom to contact to clarify any questions or issues
or complaints.
3. Consent and comprehension: The participant actually gives the consent form in
this section stating that they have understood the requirement of the research.
4. Withdrawal: This section states that the participants can withdraw from the
research without any penalty.
5. Confidentiality: It states the confidentiality related to the research study.
6. Risks and benefits: This section states the risks and benefits of the research to the
participants.
7. Clarification: The participants can ask for any further clarification at any time
during the research.
8. Signature: Finally, the participant signs the consent form with the date.
Introduction 15

1.5.2 Scientific Value


This ethical issue is concerned with two aspects—relevance of research topic and valid-
ity of research results. The research must balance between risks and benefits. In fact, the
advantages of the research should outweigh the risks. The results of the research must also
be valid. If they are not valid then the results are incorrect and the study has no value to
the research community. The reason for invalid results is usually misuse of methodology,
application, or tool. Hence, the researchers should not conduct the research for which they
are not capable or competent.

1.5.3 Confidentiality
The information shared by the participants should be kept confidential. The researcher
should hide the identity of the organization and participant. Vinson and Singer (2008) iden-
tified three features of confidentiality—data privacy, participant anonymity, and data ano-
nymity. The data collected must be protected by password and only the people involved
in the research should have access to it. The data should not reveal the information about
the participant. The researchers should not collect personal information of participant. For
example, participant identity must be used instead of the participant name. The partici-
pant information hiding is achieved by hiding information from colleagues, professors,
and general public. Hiding information from the manager is particularly essential as it
may affect the career of the participants. The information must be also hidden from the
organization’s competitors.

1.5.4 Beneficence
The participants must be benefited by the research. Hence, methods that protect the inter-
est of the participants and do not harm them must be adopted. The research must not pose
a threat to the researcher’s job, for example, by creating an employee-ranking framework.
The revealing of an organization’s sensitive information may also bring loss to the company
in terms of reputation and clients. For example, if the names of companies are revealed in
the publication, the comparison between the processes followed in the companies or poten-
tial flaws in the processes followed may affect obtaining contracts from the clients. If the
research involves analyzing the process of the organization, the outcome of the research or
facts revealed from the research can harm the participants to a significant level.

1.5.5 Ethics and Open Source Software


In the absence of empirical data, data and source code from open source software are
being widely used for analysis in research. This poses concerns of ethics, as the open
source software is not primarily developed for research purposes. El-Emam (2001) raised
two important ethical issues while using open source software namely “informed consent
and minimization of harm and confidentiality.” Conducting studies that rate the develop-
ers or compares two open source software may harm the developer’s reputation or the
company’s reputation (El-Emam 2001).

1.5.6 Concluding Remarks


The researcher must maintain the ethics in the research by careful planning and, if
required, consulting ethical bodies that have expertise for guiding them on ethical issues
in software engineering empirical research. The main aim of the research involving
16 Empirical Research in Software Engineering

participants must be to protect the interests of the participants so that they are protected
from any harm. Becker-Kornstaedt (2001) suggests that the participant interests can be
protected by using techniques such as manipulating data, providing different reports to
different stakeholders, and providing the right to withdraw to the participants.
Finally, feedback of the research results must be provided to the participants. The opin-
ion of the participants about the validity of the results must also be asked. This will help
in increasing the trust between the researcher and the participant.

1.6 Importance of Empirical Research


Why should empirical studies in software engineering be carried out? The main reason of
carrying out an empirical study is to reduce the gap between theory and practice by using
statistical tests to test the formed hypothesis. This will help in analyzing, assessing, and
improving the processes and procedures of software development. It may also provide
guidelines to management for decision making. Thus, without evaluating and assessing
new methods, tools, and techniques, their use will be random and effectiveness will be
uncertain. The empirical study is useful to researchers, academicians, and the software
industry from different perspectives.

1.6.1 Software Industry


The results of ESE must be adopted by the industry. ESE can be used to answer the ques-
tions related to practices in industry and can improve the processes and procedures of
software development. To match the requirements of the industry, the researcher must ask
the following questions while conducting research:

• How does the research aim maps to the industrial problems?


• How can the software practitioners use the research results?
• What are the important problems in the industry?

The predictive models constructed in ESE can be applied to future, similar industrial
applications. The empirical research enables software practitioners to use the results of the
experiment and ascertain that a set of good processes and procedures are followed dur-
ing software development. Thus, the empirical study can guide toward determining the
quality of the resultant software products and processes. For example, a new technique or
technology can be evaluated and assessed. The empirical study can help the software pro-
fessionals in effectively planning and allocating resources in the initial phases of software
development life cycle.

1.6.2 Academicians
While studying or conducting research, academicians are always curious to answer ques-
tions that are foremost in their minds. As the academicians dig deeper into their subject
or research, the questions tend to become more complex. Empirical research empowers
them with a great tool to find an answer by asking or interviewing different stakeholders,
Introduction 17

by conducting a survey, or by conducting a scientific experiment. Academicians gener-


ally make predictions that can be stated in the form of hypotheses. These hypotheses
need to be subjected to robust scientific verification or approval. With empirical research,
these hypotheses can be tested, and their results can be stated as either being accepted or
rejected. Thereafter, based on the result, the academicians can make some generalization
or make a conclusion about a particular theory. In other words, a new theory can be gen-
erated and some old ones may be disproved. Additionally, sometimes there are practical
questions that an academician encounters, empirical research would be highly beneficial
in solving them. For example, an academician working in a university may want to find
out the most efficient learning approach that yields the best performance among a group
of students. The results of the research can be included in the course curricula.
From the academic point of view, high-quality teaching is important for future soft-
ware engineers. Empirical research can provide management with important infor-
mation about the use of tools and techniques. The students will further carry forward
the knowledge to the software industry and thus improve the industrial practices. The
empirical result can support one technique over the other and hence will be very useful
in comparing the techniques.

1.6.3 Researchers
From the researchers point of view, the results can be used to provide insight about exist-
ing trends and guidelines regarding future research. The empirical study can be repeated
or replicated by the researcher in order to establish generalizability of the results to new
subjects or data sets.

1.7 Basic Elements of Empirical Research


The basic elements in empirical research are purpose, participants, process, and product.
Figure 1.7 presents the four basic elements of empirical research. The purpose defines
the reason of the research, the relevance of the topic, specific aims in the form of research
questions, and objectives of the research.

Purpose

Participants

Process Product

FIGURE 1.7
Elements of empirical research.
18 Empirical Research in Software Engineering

Process lays down the way in which the research will be conducted. It defines the
sequence of steps taken to conduct a research. It provides details about the techniques,
methodologies, and procedures to be used in the research. The data-collection steps,
variables involved, techniques applied, and limitations of the study are defined in this
step. The process should be followed systematically to produce a successful research.
Participants are the subjects involved in the research. The participants may be inter-
viewed or closely observed to obtain the research results. While dealing with participants,
ethical issues in ESE must be considered so that the participants are not harmed in any
way.
Product is the outcome produced by the research. The final outcome provides the
answer to research questions in the empirical research. The new technique developed or
methodology produced can also be considered as a product of the research. The journal
paper, conference article, technical report, thesis, and book chapters are products of the
research.

1.8 Some Terminologies


Some terminologies that are frequently used in the empirical research in software engi-
neering are discussed in this section.

1.8.1 Software Quality and Software Evolution


Software quality determines how well the software is designed (quality of design), and
how well the software conforms to that design (quality of conformance).
In a software project, most of the cost is consumed in making changes rather than devel-
oping the software. Software evolution (maintenance) involves making changes in the
software. Changes are required because of the following reasons:

1. Defects reported by the customer


2. New functionality requested by the customer
3. Improvement in the quality of the software
4. To adapt to new platforms

The typical evolution process is depicted in Figure 1.8. The figure shows that a change
is requested by a stakeholder (anyone who is involved in the project) in the project. The
second step requires analyzing the cost of implementing the change and the impact of
the change on the related modules or components. It is the responsibility of an expert
group known as the change control board (CCB) to determine whether the change must be
implemented or not. On the basis of the outcome of the analysis, the CCB approves or dis-
approves a change. If the change is approved, then the developers implement the change.
Finally, the change and the portions affected by the change are tested and a new version of
the software is released. The process of continuously changing the software may decrease
the quality of the software.
The main concerns during the evolution phase are maintaining the flexibility and qual-
ity of the software. Predicting defects, changes, efforts, and costs in the evolution phase
Introduction 19

Test Request
change change

Implement Analyze
change change

Approve/
deny

FIGURE 1.8
Software evolution cycle.

is an important area of software engineering research. An effective prediction can lead to


decreasing the cost of maintenance by a large extent. This will also lead to high-quality
software and hence increasing the modifiability aspect of the software. Change prediction
concerns itself with predicting the portions of the software that are prone to changes and
will thus add up to the maintenance costs of the software. Figure 1.9 shows the various
research avenues in the area of software evolution.
After the detection of the change and nonchange portions in a software, the software
developers can take various remedial actions that will reduce the probability of occur-
rence of changes in the later phases of software development and, consequently, the cost
will also reduce exponentially. The remedial steps may involve redesigning or refactoring
of modules so that fewer changes are encountered in the maintenance phase. For example,
if high value of the coupling metric is the reason for change proneness of a given module.
This implies that the given module in question is highly interdependent on other modules.
Thus, the module should be redesigned to improve the quality and reduce its probability
to be change prone. Similar design corrective actions or other measures can be easily taken
once the software professional detects the change-prone portions in a software.

Defect prediction
• What are the defect-prone portions in the maintanence phase?

Change prediction

• What are the change-prone portions in the software?


• How many change requests are expected?

Maintenance costs prediction

• What is the cost of maintaining the software over a period of time?

Maintenance effort prediction

• How much effort will be required to implement a change?

FIGURE 1.9
Prediction during evolution phase.
20 Empirical Research in Software Engineering

1.8.2 Software Quality Attributes


Software quality can be measured in terms of attributes. The attribute domains that are
required to define for a given software are as follows:

1. Functionality
2. Usability
3. Testability
4. Reliability
5. Maintainability
6. Adaptability

The attribute domains can be further divided into attributes that are related to software
quality and are given in Figure 1.10. The details of software quality attributes are given in
Table 1.3.

1.8.3 Measures, Measurements, and Metrics


The terms measures, measurements, and metrics are often used interchangeably. However,
we should understand the difference among these terms. Pressman (2005) explained this
clearly as:
A measure provides a quantitative indication of the extent, amount, dimension, capacity
or size of some attributes of a product or process. Measurement is the act of determin-
ing a measure. The metric is a quantitative measure of the degree to which a product or
process possesses a given attribute.

• Completeness
• Correctness
• Security
1 • Traceability
• Efficiency
Functionality

• Portability 6 2 • Learnability
• Interoperability Adaptability Usability • Operability
• User-friendliness
• Installability
Software
• Satisfaction
quality
attributes
• Agility
• Modifiability Maintainability • Verifiability
Testability
• Readability • Validatable
• Flexibility 5 3

Reliability

4 • Robustness
• Recoverability

FIGURE 1.10
Software quality attributes.
Introduction 21

TABLE 1.3
Software Quality Attributes
Functionality: The degree to which the purpose of the software is satisfied
1 Completeness The degree to which the software is complete
2 Correctness The degree to which the software is correct
3 Security The degree to which the software is able to prevent unauthorized access to the
program data
4 Traceability The degree to which requirement is traceable to software design and source code
5 Efficiency The degree to which the software requires resources to perform a software
function

Usability: The degree to which the software is easy to use


1 Learnability The degree to which the software is easy to learn
2 Operability The degree to which the software is easy to operate
3 User-friendliness The degree to which the interfaces of the software are easy to use and understand
4 Installability The degree to which the software is easy to install
5 Satisfaction The degree to which the user’s feel satisfied with the software

Testability: The ease with which the software can be tested to demonstrate the faults
1 Verifiability The degree to which the software deliverable meets the specified standards,
procedures, and process
2 Validatable The ease with which the software can be executed to demonstrate whether the
established testing criteria is met

Maintainability: The ease with which the faults can be located and fixed, quality of the software can be
improved, or software can be modified in the maintenance phase
1 Agility The degree to which the software is quick to change or modify
2 Modifiability The degree to which the software is easy to implement, modify, and test in the
maintenance phase
3 Readability The degree to which the software documents and programs are easy to understand
so that the faults can be easily located and fixed in the maintenance phase
4 Flexibility The ease with which changes can be made in the software in the maintenance
phase

Adaptability: The degree to which the software is adaptable to different technologies and platforms
1 Portability The ease with which the software can be transferred from one platform to another
platform
2 Interoperability The degree to which the system is compatible with other systems

Reliability: The degree to which the software performs failure-free functions


1 Robustness The degree to which the software performs reasonably under unexpected
circumstances
2 Recoverability The speed with which the software recovers after the occurrence of a failure
Source: Y. Singh and R. Malhotra, Object-Oriented Software Engineering, PHI Learning, New Delhi, India, 2012.

For example, a measure is the number of failures experienced during testing. Measurement
is the way of recording such failures. A software metric may be the average number of
failures experienced per hour during testing.
Fenton and Pfleeger (1996) has defined measurement as:
It is the process by which numbers or symbols are assigned to attributes of entities in
the real world in such a way as to describe them according to clearly defined rules.
22 Empirical Research in Software Engineering

Software metrics can be defined as (Goodman 1993): “The continuous application of


measurement based techniques to the software development process and its products to
supply meaningful and timely management information, together with the use of those
techniques to improve that process and its products.”

1.8.4 Descriptive, Correlational, and Cause–Effect Research


Descriptive research provides description of concepts. Correlational research provides
relation between two variables. Cause–effect research is similar to experiment research in
that the effect of one variable on another is found.

1.8.5 Classification and Prediction


Classification predicts categorical outcome variables (ordinal or nominal). The training
data is used for model development, and the model can be used for predicting unknown
categories of outcome variables. For example, consider a model to classify modules as
faulty or not faulty on the basis of coupling and size of the modules. Figure 1.11 represents
this example in the form of a decision tree. The tree shows that if the coupling of modules
is <8 and the LOC is low then the module is not faulty.
In classification, the classification techniques take training data (comprising of the
independent and the dependent variables) as input and generate rules or mathemati-
cal formulas that are used by validation data to verify the model predicted. The gener-
ated rules or mathematical formulas are used by future data to predict categories of the
outcome variables. Figure 1.12 depicts the classification process. Prediction is similar to
classification except that the outcome variable is continuous.

1.8.6 Quantitative and Qualitative Data


Quantitative data is numeric, whereas qualitative data is textual or pictorial. Quantitative
data can either be discrete or continuous. Examples of quantitative data are LOC, num-
ber of faults, number of work hours, and so on. The information obtained by qualitative

Coupling?

<8 >8

Lines of code Faulty

Low High

Not faulty Faulty

FIGURE 1.11
Example of classification process.
Introduction 23

Validation
data

Predicts

Training Classification Generates Outcome


data technique rules variable

New data

FIGURE 1.12
Steps in classification process.

analysis can be categorized by identifying patterns from the textual information. This can
be achieved by reading and analyzing texts and deriving logical categories. This will help
organize data in the form of categories. For example, answers to the following questions
are presented in the form of categories.

• What makes a good quality system?


User-friendliness, response time, reliability, security, recovery from failure
• How was the overall experience with the software?
Excellent, very good, good, average, poor, very poor

Text mining is another way to process qualitative data into useful form that can be used
for further analysis.

1.8.7 Independent, Dependent, and Confounding Variables


Variables are measures that can be manipulated or varied in research. There are two types
of variables involved in cause–effect analysis, namely, the independent and the dependent
variables. They are also known as attributes or features in software engineering research.
Figure 1.13 shows that the experimental process analyzes the relationship between the
independent and the dependent variables. Independent variables (or predictor variables)

Experiment
Causes process Effect
(independent variables) (dependent variable)

FIGURE 1.13
Independent and dependent variables.

www.allitebooks.com
24 Empirical Research in Software Engineering

are input variables that are manipulated or controlled by the researcher to measure the
response of the dependent variable.
The dependent variable (or response variable) is the output produced by analyzing the
effect of the independent variables. The dependent variables are presumed to be influenced
by the independent variables. The independent variables are the causes and the depen-
dent variable is the effect. Usually, there is only one dependent variable in the research.
Figure 1.13 depicts that the independent variables are used to predict the outcome variable
following a systematic experimental process.
Examples of independent variables are lines of source code, number of methods, and
number of attributes. Dependent variables are usually measures of software quality attri-
butes. Examples of dependent variable are effort, cost, faults, and productivity. Consider
the following research question:
Do software metrics have an effect on the change proneness of a module?
Here, software metrics are the independent variables and change proneness is the
dependent variable.
Apart from the independent variables, unknown variables or confounding variables
(extraneous variables) may affect the outcome (dependent) variable. Randomization can
nullify the effect of confounding variables. In randomization, many replications of the
experiments are executed and the results are averaged over multiple runs, which may
cancel the effect of extraneous variables in the long course.

1.8.8 Proprietary, Open Source, and University Software


Data-based empirical studies that are capable of being verified by observation or experi-
ment are needed to provide relevant answers. In software engineering empirical research,
obtaining empirical data is difficult and is a major concern for researchers. The data
collected may be from university/academic software, open source software, or proprietary
software.
Undergraduate or graduate students at the university usually develop the university
software. To use this type of data, the researchers must ensure that the software is devel-
oped by following industrial practices and should document the process of software
development and empirical data collection in detail. For example, Aggarwal et al. (2009)
document the procedure of data collection as: “All students had at least six months experi-
ence with Java language, and thus they had the basic knowledge necessary for this study.
All the developed systems were taken of a similar level of complexity and all the develop-
ers were made sufficiently familiar with the application they were working on.” The study
provides a list of the coding standards that were followed by students while developing
the software and also provides details about the testing environment as given below by
Aggarwal et al. (2009):
The testing team was constituted under the guidance of senior faculty consisting of a
separate group of students who had the prior knowledge of system testing. They were
assigned the task of testing systems according to test plans and black-box testing tech-
niques. Each fault was reported back to the development team, since the development
environment was representative of real industry environment used in these days. Thus,
our results are likely to be generalizable to other development environments.

Open source software is usually a freely available software, developed by many develop-
ers from different places in a collaborative manner. For example, Google Chrome, Android
operating system, and Linux operating system.
Introduction 25

Proprietary software is a licensed software owned by a company. For example, Microsoft


Office, Adobe Acrobat, and IBM SPSS are proprietary software. In practice, obtaining data
from proprietary software for research validation is difficult as the software companies are
usually not willing to share the information about their software systems.
The software developed by the student programmers is generally small and developed
by limited number of developers. If the decision is made for collecting and using this type
of data in research then the guidelines similar to given above must be followed to promote
unbiased and replicated results. These days, open source software repositories are being
mined to obtain research data for historical analysis.

1.8.9 Within-Company and Cross-Company Analysis


In within-company analysis, the empirical study collects the data from the old versions/
releases of the same software, predicts models, and applies the predicted models to the
future versions of the same project. However, in practice, the old data may not be avail-
able. In such cases, the data obtained from similar earlier projects developed by different
companies are used for prediction in new projects. The process of validating the predicted
model using data collected from different projects from which the model has been derived
is known as cross-company analysis. For example, He et al. (2012) conducted a study to
find the effectiveness of cross-project prediction for predicting defects. They used data col-
lected from different projects to predict models and applied those data on new projects.
Figure 1.14 shows that the model (M1) is developed using training data collected from
software A, release R1. The next release of software used model M1 to predict the outcome
variable. This process is known as within-company prediction, whereas in cross-company
prediction, data collected from another software B uses model M1 to predict the outcome
variable.

Software A, Learning Prediction


Training data
release R1 techniques model, M1

Software A, Prediction Prediction


Test data
release R2 model, M1 results

(a) Within-company prediction

Software B, Prediction Prediction


Test data
release R1 model, M1 results

(b) Cross-company prediction

FIGURE 1.14
(a) Within-company versus (b) cross-company prediction.
26 Empirical Research in Software Engineering

1.8.10 Parametric and Nonparametric Tests


In hypothesis testing, statistical tests are applied to determine the validity of the hypoth-
esis. These tests can be categorized as either parametric or nonparametric. Parametric tests
are used for data samples having normal distribution (bell-shaped curve), whereas non-
parametric tests are used when the distribution of data samples is highly skewed. If the
assumptions of parametric tests are met, they are more powerful as they use more infor-
mation while computation. The difference between parametric and nonparametric tests is
presented in Table 1.4.

1.8.11 Goal/Question/Metric Method


The Goal/Question/Metric (GQM) method was developed by Basili and Weiss (1984)
and is a result of their experience, research, and practical knowledge. The GQM method
consists of the following three basis elements:

1. Goal
2. Question
3. Metric

In GQM method, measurement is goal-oriented. Thus, first the goals need to be defined
that can be measured during the software development. The GQM method defines goals
that are transformed into questions and metrics. These questions are answered later to
determine whether the goals have been satisfied or not. Hence, GQM method follows
top-down approach for dividing goals into questions and mapping questions to metrics,
and follows bottom-up approach by interpreting the measurement to verify whether the
goals have been satisfied. Figure 1.15 presents the hierarchical view of GQM framework.
The figure shows that the same metric can be used to answer multiple questions.
For example, if the developer wants to improve the defect-correction rate during the
maintenance phase. The goal, question, and associated metrics are given as:

• Goal: Improve the defect-correction rate in the system.


• Question: How many defects have been corrected in the maintenance phase?
• Metric: Number of defects corrected/Number of defects reported.
• Question: Is the defect-correction rate satisfactory?
• Metric: Number of defects corrected/Number of defects reported.

The goals are defined as purposes, objects, and viewpoints (Basili et al. 1994). In the above
example, purpose is “to improve,” object is “defects,” and viewpoint is “project manager.”

TABLE 1.4
Difference between Parametric and Nonparametric Tests
Parametric Tests Nonparametric Tests

Assumed distribution Normal Any


Data type Ratio or interval Any
Measures of central tendency Mean Median
Example t-test, ANOVA Kruskal–Wallis–Wilcoxon test
Introduction 27

Goal

Question 1 Question 2 Question 3 Question 4

Metric 1 Metric 2 Metric 5 Metric 6

Metric 3

Metric 4

FIGURE 1.15
Framework of GQM.

• Project plan • Goal


• Question
• Metric

Planning Definition

Data
Interpretation
collection
• Answering
questions
• Measurement • Collecting data
• Goal evaluated

FIGURE 1.16
Phases of GQM.

Figure 1.16 presents the phases of the GQM method. The GQM method has the following
four phases:

• Planning: In the first phase, the project plan is produced by recognizing the basic
requirements.
• Definition: In this phase goals, questions, and relevant metrics are defined.
• Data collection: In this phase actual measurement data is collected.
• Interpretation: In the final phase, the answers to the questions are provided and
the goal’s attainment is verified.
28 Empirical Research in Software Engineering

1.8.12 Software Archive or Repositories


The progress of the software is managed using software repositories that include source
code, documentation, archived communications, and defect-tracking systems. The infor-
mation contained in these repositories can be used by the researchers and practitioners for
maintaining software systems, improving software quality, and empirical validation of
data and techniques.
Researchers can mine these repositories to understand the software development, soft-
ware evolution, and make predictions. The predictions can consist of defects and changes
and can be used for planning of future releases. For example, defects can be predicted
using historical data, and this information can be used to produce less defective future
releases.
The data is kept in various types of software repositories such as CVS, Git, SVN,
ClearCase, Perforce, Mercurial, Veracity, and Fossil. These repositories are used for man-
agement of software content and changes, including documents, programs, user proce-
dure manuals, and other related information. The details of mining software repositories
are presented in Chapter 5.

1.9 Concluding Remarks


It is very important for a researcher, academician, practitioner, and a student to understand
the procedures and concepts of ESE before beginning the research study. However, there is
a lack of understanding of the empirical concepts and techniques, and the level of uncer-
tainty on the use of empirical procedures and practices in software engineering. The goal
of the subsequent chapters is to present empirical concepts, procedures, and practices that
can be used by the research community in conducting effective and well-formed research
in software engineering field.

Exercises
1.1 What is empirical software engineering? What is the purpose of empirical soft-
ware engineering?
1.2 What is the importance of empirical studies in software engineering?
1.3 Describe the characteristics of empirical studies.
1.4 What are the five types of empirical studies?
1.5 What is the importance of replicated and repeated studies in empirical software
engineering?
1.6 Explain the difference between an experiment and a case study.
1.7 Differentiate between quantitative and qualitative research.
1.8 What are the steps involved in an experiment? What are characteristics of a good
experiment?
Introduction 29

1.9 What are ethics involved in a research? Give examples of unethical research.
1.10 Discuss the following terms:
a. Hypothesis testing
b. Ethics
c. Empirical research
d. Software quality
1.11 What are systematic reviews? Explain the steps in systematic review.
1.12 What are the key issues involved in empirical research?
1.13 Compare and contrast classification and prediction process.
1.14 What is GQM method? Explain the phases of GQM method.
1.15 List the importance of empirical research from the perspective of software indus-
tries, academicians, and researchers.
1.16 Differentiate between the following:
a. Parametric and nonparametric tests
b. Independent, dependent and confounding variables
c. Quantitative and qualitative data
d. Within-company and cross-company analysis
e. Proprietary and open source software

Further Readings
Kitchenham et al. effectively provides guidelines for empirical research in software
engineering:

B. A. Kitchenham, S. L. Pfleeger, L. M. Pickard, P. W. Jones, D. C. Hoaglin, K. E. Emam,


and J. Rosenberg, “Preliminary guidelines for empirical research in software
engineering,” IEEE Transactions on Software Engineering, vol. 28, pp. 721–734, 2002.

Juristo and Moreno explain a good number of concepts of empirical software engineering:

N. Juristo, and A. N. Moreno, “Lecture notes on empirical software engineering,”


Series on Software Engineering and Knowledge Engineering, World Scientific, vol. 12,
2003.

The basic concept of qualitative research is presented in:

N. Mays, and C. Pope, “Qualitative research: Rigour and qualitative research,” British
Medical Journal, vol. 311, no. 6997, pp. 109–112, 1995.
A. Strauss, and J. Corbin, Basics of Qualitative Research: Techniques and Procedures for
Developing Grounded Theory, Sage Publications, Thousand Oaks, CA, 1998.
30 Empirical Research in Software Engineering

A collection of research from top empirical software engineering researchers focusing on


the practical knowledge necessary for conducting, reporting, and using empirical methods
in software engineering can be found in:

J. Singer, and D. I. K. Sjøberg, Guide to Advanced Empirical Software Engineering, Edited


by F. Shull, Springer, Berlin, Germany, vol. 93, 2008.

The detail about ethical issues for empirical software engineering is presented in:

J. Singer, and N. Vinson, “Ethical issues in empirical studies of software engineer-


ing,” IEEE Transactions on Software Engineering, vol. 28, pp. 1171–1180, NRC 44912,
2002.

An overview of empirical observations and laws is provided in:

A. Endres, and D. Rombach, A Handbook of Software and Systems Engineering: Empirical


Observations, Laws, and Theories, Addison-Wesley, New York, 2003.

Authors present detailed practical guidelines on the preparation, conduct, design, and
reporting of case studies of software engineering in:

P. Runeson, M. Host, A. Rainer, and B. Regnell, Case Study Research in Software


Engineering: Guidelines and Examples, John Wiley & Sons, New York, 2012.

The following research paper provides detailed explanations about software quality
attributes:

I. Gorton (ed.), “Software quality attributes,” In: Essential Software Architecture,


Springer, Berlin, Germany, pp. 23–38, 2011.

An in-depth knowledge of prediction is mentioned in:

A. J. Albrecht, and J. E. Gaffney, “Software function, source lines of code, and devel-
opment effort prediction: A software science validation,” IEEE Transactions on
Software Engineering, vol. 6, pp. 639–648, 1983.

The following research papers provide a brief knowledge of quantitative and qualitative
data in software engineering:

A. Rainer, and T. Hall, “A quantitative and qualitative analysis of factors affecting


software processes,” Journal of Systems and Software, vol. 66, pp. 7–21, 2003.
C. B. Seaman, “Qualitative methods in empirical studies of software engineering,”
IEEE Transactions on Software Engineering, vol. 25, pp. 557–572, 1999.

A useful concept of how to analyze qualitative data is presented in:

A. Bryman, and B. Burgess, Analyzing Qualitative Data, Routledge, New York, 2002.
Introduction 31

Basili explain the major role to controlled experiment in software engineering field in:

V. Basili, The Role of Controlled Experiments in Software Engineering Research, Empirical


Software Engineering Issues, LNCS 4336, Springer-Verlag, Berlin, Germany,
pp. 33–37, 2007.

The following paper presents guidelines for controlling experiments:

A. Jedlitschka, and D. Pfahl, “Reporting guidelines for controlled experiments in


software engineering,” In Proceedings of the International Symposium on Empirical
Software Engineering Symposium, IEEE, Noosa Heads, Australia, pp. 95–104, 2005.

A detailed explanation of within-company and cross-company concept with sample case


studies may be obtained from:

A. Kitchenham, E. Mendes, and G. H. Travassos, “Cross versus within-company cost


estimation studies: A systematic review,” IEEE Transactions on Software Engineering,
vol. 33, pp. 316–329, 2007.

The concept of proprietary, open source, and university software are well explained in the
following research paper:

A. MacCormack, J. Rusnak, and C. Y. Baldwin, “Exploring the structure of com-


plex software designs: An empirical study of open source and proprietary code,”
Management Science, vol. 52, pp. 1015–1030, 2006.

The concept of parametric and nonparametric test may be obtained from:

D. G. Altman, and J. M. Bland, “Parametric v non-parametric methods for data analy-


sis,” British Medical Journal, 338, 2009.

The book by Solingen and Berghout is a classic and a very useful reference, and it gives
detailed discussion on the GQM methods:

R. V. Solingen, and E. Berghout, The Goal/Question/Metric Method: A Practical Guide for


Quality Improvement of Software Development, McGraw-Hill, London, vol. 40, 1999.

A classical report written by Prieto explains the concept of software repositories:

R. Prieto-Díaz, “Status report: Software reusability,” IEEE Software, vol. 10, pp. 61–66,
1993.
2
Systematic Literature Reviews

Review of existing literature is an essential step before beginning any new research.
Systematic reviews (SRs) synthesize the existing research work in such a manner that can be
analyzed, assessed, and interpreted to draw meaningful conclusions. The aim of conducting
an SR is to gather and interpret empirical evidence from the available research with respect
to formed research questions. The benefit of conducting an SR is to summarize the existing
trends in the available research, identify gaps in the current research, and provide future
guidelines for conducting new research. The SRs also provide empirical evidence in sup-
port or opposition of a given hypothesis. Hence, the author of the SR must make all the
efforts to provide evidence that support or does not support a given research hypothesis.
In this chapter, guidelines for conducting SRs are given for software engineering research-
ers and practitioners. The steps to be followed while conducting an SR including planning,
conducting and reporting phases are described. The existing high-quality reviews in the
areas of software engineering are also presented in this chapter.

2.1 Basic Concepts


SRs are better planned, more rigorous, and thoroughly analyzed as compared to surveys
or literature reviews. In this section, we provide an overview of SRs and compare them
with traditional surveys.

2.1.1 Survey versus SRs


Literature survey is the process of summarizing, organizing, and documenting the exist-
ing research to understand the research carried out in the field. On the other hand, an SR
is the process of systematically and critically analyzing the information extracted from the
existing research to answer the established research questions. The literature survey only
provides the summary of the results of existing literature, whereas an SR opens avenues
for new research as it provides future directions for researchers based on thorough analy-
sis of existing literature. Kitchenham (2007) defined SR as:
A systematic literature review (often referred to as a systematic review) is a means of
identifying, evaluating and interpreting all available research relevant to a particular
research question, or topic area, or phenomenon of interest.

Glossary of evidence-based medicine (EBM) terms defines SR as (http://ktclearinghouse.


ca/cebm/glossary/):
A summary of the medical literature that uses explicit methods to perform a comprehen-
sive literature search and critical appraisal of individual studies and that uses appropriate
statistical techniques to combine these valid studies.

33

www.allitebooks.com
34 Empirical Research in Software Engineering

TABLE 2.1
Comparison of Systematic Reviews and Literature Survey
S. No. Systematic Review Literature Survey

1 The goal is to identify best practices, The goal is to classify or categorize existing
strengths and weaknesses of specific literature.
techniques, procedures, tools, or methods
by combining information from various
studies.
2 Focused on research questions that assess Provides an introduction of each paper in
the techniques under investigation. literature based on the identified area.
3 Provides a detailed review of existing Provides a brief overview of existing
literature. literature.
4 Extracts technical and useful metadata Extracts general research trends from the
from the contents. studies.
5 Search process is more stringent such that it Search process is less stringent.
involves searching references or
contacting researchers in the field.
6 Strong assessment of quality is necessary. Strong assessment of quality is not necessary.
7 Results are based on high-quality evidence Results only provide summary of existing
with the aim to answer research questions. literature.
8 Often uses statistics to analyze the results. Does not use statistics to analyze the results.

SRs summarize high-quality research on a specific area. They provide the best available
evidence on a particular technique or technology and produce conclusions that can be
used by the software practitioners and researchers to select the best available techniques
or methodologies. The studies included in the review are known as primary studies and
the SRs are known as secondary studies. Table 2.1 presents the summary of difference
between SR and literature survey.

2.1.2 Characteristics of SRs


The following are the main characteristics of an SR:

1. It selects high-quality research papers and studies that are relevant, important,
and essential, which are summarized in the form of one review paper.
2. It performs a systematic search by forming a search strategy to identify primary
studies from the digital libraries. The search strategy is documented so that the
readers can analyze the completeness of the process and repeat the same.
3. It forms a valid review protocol and research questions that address the issues to
be answered in the SR.
4. It clearly summarizes the characteristics of each selected study, including aims,
techniques, and methods used in the studies.
5. It consists of a justified quality assessment criteria for inclusion and exclusion of
the studies in the SR so that the effectiveness of each study can be determined.
6. It uses a number of presentation tools for reporting the findings and results of the
selected studies to be included in the SR.
7. It identifies gaps in the current findings and highlights future directions.
Systematic Literature Reviews 35

2.1.3 Importance of SRs


An SR is conducted using scientific methods and minimizes the bias in the studies. The
SRs are important as:

1. They gather important empirical evidence on the technique or method being


focused in the SR. On the basis of the empirical evidence, the strengths and weak-
nesses of the technique may be summarized.
2. They identify the gaps in the current research.
3. They report the commonalities and the differences in the primary studies.
4. They provide future guidelines and framework to researchers and practitioners to
perform new research.

2.1.4 Stages of SRs


SR consists of a series of steps that are carried out throughout the review process and pro-
vides a summary of important issues raised in the study. The stages in the SR enable the
researchers to conduct the review in an organized manner. The activities included in the
SR are as follows:

1. Planning the review


2. Conducting the review
3. Reporting the review results

The procedure followed in performing the SR is given by Kitchenham et al. (2007). The
process is depicted in Figure 2.1. In the first step, the need for the SR is examined and in the
second step the research questions are formed that address the issues to be answered in
the review. Thereafter, the review protocol is developed that includes the following steps:
search strategy design, study selection criteria, study quality assessment criteria, data
extraction process, and data synthesis process.
The formation of review protocol consists of a series of stages. In the first step, the
search strategy is formed, including identification of search terms and selection of
sources to be searched to identify the primary studies. The next step involves deter-
mination of relevant studies by setting the inclusion and exclusion criteria for select-
ing review studies. Thereafter, quality assessment criteria are identified by forming the
quality assessment questionnaire to analyze and assess the studies. The second to last
stage involves the design of data extraction forms to collect the required information
to answer the research questions, and in the final stage, methods for data synthesis are
devised. Development of review protocol is an important step in an SR as it reduces the
possibility and risk of research bias in the SR. Finally, in the planning stage, the review
protocol is evaluated.
The steps planned in the first stage are actually performed in the conducting stage that
includes actual collection of relevant studies by applying first the search strategy and then
the inclusion and exclusion criteria. Each selected study is ranked according to the qual-
ity assessment criteria, and the data extraction and data synthesis steps are followed from
only the selected high-quality primary studies. In the final phase, the results of the SR are
reported. This step further involves examining, presenting, and verifying the results.
36 Empirical Research in Software Engineering

1. Identify the need for systematic review

Planning
2. Identify research questions
the review

3. Develop review protocol

4. Evaluate review protocol

5. Search strategy execution

6. Selection of primary studies

Conducting
7. Study quality assessment
the review

8. Data extraction

9. Data synthesis

10. Reporting the review results

FIGURE 2.1
Systematic review process.

The above stages defined in the SR are iterative and not sequential. For example, the criteria
for inclusion and exclusion of primary studies must be developed prior to collecting the
studies. The criteria may be refined in the later stages.

2.2 Case Study


Software fault prediction (SFP) involves prediction of classes/modules in a software as
faulty or nonfaulty based on the object oriented (OO) metrics for corresponding classes
or modules. The identification of faulty or nonfaulty classes/modules enables researchers
and practitioners to identify faulty portions in the early phases of software development.
These faulty portions need extra attention during software development and the prac-
titioners may focus testing resources on them. There are many techniques such as the
statistical and the machine learning (ML) that can be used for classifying a class as faulty
Systematic Literature Reviews 37

or nonfaulty. We conducted an SR of 64 in Malhotra (2015) primary studies from January


1991 to October 2013 for SFP using the ML techniques. The aim of the study is to gather
empirical evidence from the literature to facilitate the use of the ML techniques for SFP.
The study analyzes and assesses the gathered evidence regarding the use and perfor-
mance of the ML techniques.
This case study is taken as an example review to explain all the steps in the SR in the
subsequent sections and will be referred as systematic review of machine learning tech-
niques (SRML). The detailed results of the case study can be found in Malhotra (2015).

2.3 Planning the Review


Before one begins with the review, it is important and essential to recognize the need for the
review. After identifying the need for the SR, the researcher should form the research ques-
tions. Subsequently, the researchers must develop, document, and analyze the review protocol.
The detailed results of the case study can be found in Malhotra (2015).

2.3.1 Identify the Need for SR


The identification of need for an SR is the most essential and crucial step while performing
an SR. For example, Singh et al. (2014) identified the need of a structured review that can
provide similarities and differences between results of existing studies on fault proneness.
In their study, the summary of the results of the studies that predict fault proneness were
provided. Radjenović et al. (2013) observed that many software metrics have been proposed
in the literature and many of these metrics have been used for fault prediction. However,
finding an appropriate suite of metrics was found to be essential because of the differences
in the performance of the metrics. They concluded that there should be more studies that use
industrial data sets so that metrics that can be used in the industrial settings can be identified.
To justify the importance of the SR, this step involves the review of all the existing SRs
conducted in the same software engineering domain, thus recognizing the existing works
and identifying the areas that need to be addressed in the new SR.
The following questions need to be determined before conducting the SR:

1. How many primary studies are available in the software engineering context?
2. What are the strength and weaknesses of the existing SR (if any) in the software
engineering context?
3. What is the practical relevance of the proposed SR?
4. How will the proposed SR guide practitioners and researchers?
5. How can the quality of the proposed SR be evaluated?

Checklist is the most common mechanism used for reviewing the quality of the existing SR
in the same area. It may also identify the flaws in the existing SR. A checklist may consist
of a list of questions to determine the effectiveness of the existing SR. Table 2.2 shows an
example of the checklist to assess the quality of an SR. The checklist consists of questions
pertaining to the procedures and processes followed during an SR. The existing studies
may be rated on a scale of 1–12 so that the quality of each study can be determined.
38 Empirical Research in Software Engineering

TABLE 2.2
Checklist for Evaluating Existing SR
S. No. Questions

1 Is the aim of the review stated?


2 Is the search strategy appropriate?
3 Are the research questions justified?
4 Is the inclusion/exclusion criteria appropriate?
5 Is the quality assessment criteria applied?
6 Are independent reviewers used for quality evaluation of primary
studies?
7 Is the data collected from the primary sources in an appropriate
manner?
8 Is the data synthesis process effectively carried out?
9 Are the characteristics of the primary studies described?
10 Is any empirical evidence collected from the primary studies to
reach a conclusion?
11 Does the review identify gaps in the existing literature?
12 Is the interpretation of the results stated and the guidelines for
future research identified?

We may establish a threshold value to identify quality level of the study. If the rating of
the existing SR goes below the established threshold value, the quality of the study may be
considered as not acceptable and a new SR on the same topic may be conducted.
Thus, if an SR in the same domain with similar aims is located but it was conducted a
long time ago, then a new SR adding current studies may be justified. However, if the exist-
ing SR is still relevant and is of high quality, then a new SR may not be required.

2.3.2 Formation of Research Questions


The process of formation of the research questions involves identification of relevant issues
that need to be answered by the SR. According to Kitchenham (2007), it is the most impor-
tant activity in any SR. The structure of an SR depends on the content of the research
questions formed, and key decisions are based on the questions such as: Which studies
to focus? Where to search them? How to assess the quality of these studies? Hence, the
research questions must be well formed and constructed after a thorough analysis. The
data for answering the identified research questions is collected from the primary studies.
While constructing the research questions, the target audience, the tools and techniques to
be evaluated, outcomes of the study, and the environment in which the study is conducted
(academic or industry) must be determined. Hence, the following things must be kept in
mind while forming the research questions:

• Which areas have already been explored in the existing reviews (if any)?
• Which areas are relevant and need to be explored/answered during the
proposed SR?
• Are the questions important to the researchers and software practitioners?
• Will the questions assess any similarities in the trends or identify any deviation
from the existing trends?
Systematic Literature Reviews 39

TABLE 2.3
Research Questions for SRML Case Study (Malhotra 2015)
RQ# Research Questions Motivation

RQ1 Which ML techniques have been used for SFP? Identify the ML techniques commonly being used
in SFP.
RQ2 What kind of empirical validation for Assess the empirical evidence obtained.
predicting faults is found using the ML
techniques found in RQ1?
RQ2.1 Which techniques are used for subselecting Identify techniques reported to be appropriate for
metrics for SFP? selecting relevant metrics.
RQ2.2 Which metrics are found useful for SFP? Identify metrics reported to be appropriate for SFP.
RQ2.3 Which metrics are found not useful for SFP? Identify metrics reported to be inappropriate for SFP.
RQ2.4 Which data sets are used for SFP? Identify data sets reported to be appropriate for SFP.
RQ2.5 Which performance measures are used Identify the measures which can be used for
for SFP? assessing the performance of the ML techniques
for SFP.
RQ3 What is the overall performance of the Investigate the performance of the ML techniques
ML techniques for SFP? for SFP.
RQ4 Whether the performance of the ML Compare the performance of the ML techniques
techniques is better than statistical over statistical techniques for SFP.
techniques?
RQ5 Are there any ML techniques that significantly Assess the performance of the ML techniques over
outperform other ML techniques? other ML techniques for SFP.
RQ6 What are the strengths and weaknesses of the Determine the conditions that favor the use of ML
ML techniques? techniques.

The following questions address various issues related to SR on the use of the ML
techniques for SFP:

• Which ML techniques have been used for SFP?


• Which metrics have been used for SFP?
• What type of data sets have been used for SFP?
• What is the accuracy of the ML techniques for SFP?
• Is the performance of the ML techniques better than the traditional statistical
techniques for SFP?

Table 2.3 presents the research questions along with the motivation for SRML. While
forming the research questions, the interest of the researchers must be kept in mind.
For example, for Masters and PhD student thesis, it is necessary to identify the research
relevant to the proposed work so that the current body of knowledge can be formed and
the proposed work can be established.

2.3.3 Develop Review Protocol


The development of review protocol is an important step in an SR as it reduces the
possibility and risk of research bias in the SR. The development of review protocol involves
defining the basic research process and procedures that will be followed during the SR.
40 Empirical Research in Software Engineering

Development of search
strategy Search terms

Digital libraries

Formation of inclusion and


exclusion criteria

Construction of quality
assessment checklists

Development of data
extraction forms

Identification of study
synthesis techniques

FIGURE 2.2
Steps involved in a review protocol.

In this step, the planning of the search strategy, study selection criteria, quality assessment
criteria, data extraction, and data synthesis is carried out.
The purpose of the review must state the options researchers have when deciding which
technique or method to adopt in practice. The review protocol is established by frequently
holding meetings and group discussions in the group formed comprising of preferably
senior members having experience in the area. Hence, this step is iterative and is defined
and refined in various iterations. Figure 2.2 shows the steps involved in the development
of review protocol.
The first step involves formation of search terms, selection of digital libraries that must
be searched, and refinement of search terms. This step allows identification of primary
studies that will address the research questions. The initial search terms may be identified
by the following steps to form the best suited search string:

• Breaking down the research questions into individual units.


• Using search terms in the titles, keywords, and abstracts of relevant studies.
• Identifying alternative terms and synonyms for the main search terms.

Thereafter, the sophisticated search terms are formed by incorporating alternative terms
and synonyms using Boolean expression “OR” and combining main search terms using
“AND.” The following general search terms were used for identification of primary studies
in SRML case study:
Software AND (fault OR defect OR error) AND (proneness OR prone OR prediction OR
probability) AND (regression OR ML OR soft computing OR data mining OR classifica-
tion OR Bayesian network OR neural network [NN] OR decision tree OR support vector
machine OR genetic algorithms OR random forest [RF]).
Systematic Literature Reviews 41

After identifying the search terms, the relevant and important digital portals are to be
selected. The portals publishing the journal articles are the right place to search for the rel-
evant studies. The bibliographic databases are also common place of search as they provide
title, abstract, and publication source of the study. The selection of digital libraries/portals
is very essential, as the number of studies found is dependent on it. Generally, several
libraries must be searched to find all the relevant studies that cover the research questions.
The selection must not be restricted by the availability of digital portals at the home uni-
versities. For example, the following seven electronic digital libraries may be searched for
the identification of primary studies:

1. IEEE Xplore
2. ScienceDirect
3. ACM Digital Library
4. Wiley Online Library
5. Google Scholar
6. SpringerLink
7. Web of Science

The reference section of the relevant studies must also be examined/scanned to identify the
other relevant studies. The external experts in the areas may also be contacted in this regard.
The next step is to establish the inclusion and exclusion criteria for the SR. The inclusion
and exclusion criteria allow the researchers to decide whether to include or exclude the
study in the SR. The inclusion and exclusion criteria are based on the research questions.
For example, the studies that use data collected from university software developed by
student programmers or experiments conducted by students may be excluded from the
SR. Similarly, the studies that do not perform any empirical analysis on the techniques
and technologies that are being examined in the SR may be excluded. Hence, the inclusion
criteria may be specific to the type of tool, technique, or technology being explored in the
SR. The data on which the study was conducted or the type of empirical data being used
(academia or industry/small, medium, or large sized) may also affect the inclusion criteria.
The following inclusion and exclusion criteria were formed in SRML review:

Inclusion criteria:
• Empirical studies using the ML techniques for SFP.
• Empirical studies combining the ML and non-ML techniques.
• Empirical studies comparing the ML and statistical techniques.
Exclusion criteria:
• Studies without empirical analysis or results of use of the ML techniques for SFP.
• Studies based on fault count as dependent variable.
• Studies using the ML techniques in context other than SFP.
• Similar studies, that is, studies by the same author in conference as well-
extended version in journal. However, if the results were different in both the
studies, they were retained.
• Studies that only use statistical techniques for SFP.
• Review studies.
42 Empirical Research in Software Engineering

The above inclusion and exclusion criteria were applied on each relevant study tested
by two researchers independently, and they reached a common decision after detailed
discussion. In case of any doubt, full text of a study was reviewed and final decision
regarding the inclusion/exclusion of the study was made. Hence, more than one reviewer
should check the relevance of a study based on the inclusion and exclusion criteria before
a final decision for inclusion or exclusion of a study is made.
The third step in development of a review protocol is to form the quality questionnaire
for assessing the relevance and strength of the primary studies. The quality assessment is
necessary to investigate and analyze the quality and determine the strength of the stud-
ies to be included in final synthesis. It is necessary to limit the bias in the SR and provide
guidelines for interpretation of the results.
The assessment criteria must be based on the relevance of a particular study to the
research questions and the quality of the processes and methods used in the study.
In addition, quality assessment questions must focus on experimental design, appli-
cability of results, and interpretation of results. Some studies may meet the inclusion
criteria but may not be relevant with respect to the research design, the way in which
data is collected, or may not justify the use of various techniques analyzed. For example,
a study on fault proneness may not perform comparative analysis of ML and non-ML
techniques.
The quality questionnaire must be constructed by weighing the studies with numerical val-
ues. Table 2.4 presents the quality assessment questions for any SR. The studies are rated
according to each question and given a score of 1 (yes) if it is satisfactory, 0.5 (partly) if it is
moderately satisfactory, and a score of 0 (no) if it is unsatisfactory. The final score is obtained
after adding the values assigned to each question. A study could have a maximum score of
10 and a minimum score of 0, if ranked on the basis of quality assessment questions formed
in Table 2.4. The studies with low-quality scores may be excluded from the SR or final list of
primary studies.
In addition to the questions given in Table 2.4, the following four additional questions
were formed in SRML review (see Table 2.5). Hence, a researcher may create specific qual-
ity assessment questions with respect to the SR.
The quality score along with the level assigned to the study in the example case study
SRML taken in this chapter is given in Table 2.6. The reviewers must decide a threshold
value for excluding a study from the SR. For example, studies with quality score >9 were
considered for further data extraction and synthesis in SRML review.

TABLE 2.4
Quality Assessment Questions
Q# Quality Questions Yes Partly No

Q1 Are the aims of the research clearly stated?


Q2 Are the independent variables clearly defined?
Q3 Is the data set size appropriate?
Q4 Is the data-collection procedure clearly defined?
Q5 Is attributes subselection technique used?
Q6 Are the techniques clearly defined?
Q7 Are the results and findings clearly stated?
Q8 Are the limitations of the study specified?
Q9 Is the research methodology repeatable?
Q10 Does the study contribute/add to the literature?
Systematic Literature Reviews 43

TABLE 2.5
Additional Quality Assessment Questions for SRML Review
Q# Quality Questions Yes Partly No

Q11 Are the ML techniques justified?


Q12 Are the performance measures used to assess the
SFP models clearly defined?
Q13 Is there any comparative analysis conducted
among statistical and ML techniques?
Q14 Is there any comparative analysis conducted
among different ML techniques?

TABLE 2.6
Quality Scores for Quality Assessment
questions given in Table 2.4
Quality Score
9 ≤ score ≤ 10 Very high
8 ≤ score ≤ 6 High
5 ≤ score ≤ 4 Medium
0 ≤ score ≤ 3 Low

The next step is to construct data extraction forms that will help to summarize the infor-
mation extracted from the primary studies in view of the research questions. The details of
which specific research questions are answered by specific primary study are also present
in the data extraction form. Hence, one of the aim of the data extraction is to find which
primary study addresses which research question for a given study. In many cases, the
data extraction forms will extract the numeric data from the primary studies that will
help to analyze the results obtained from these primary studies. The first part of the data
extraction card summarizes the author name, title of the primary study, and publishing
details, and the second part of the data extraction form contains answers to the research
questions extracted from a given primary study. For example, the data set details, indepen-
dent variables (metrics), and the ML techniques are summarized for the SRML case study
(see Figure 2.3).
A team of researchers must collect the information from the primary studies. However,
because of the time and resource constraints at least two researchers must evaluate the
primary studies to obtain useful information to be included in the data extraction card.
The results from these two researchers must then be matched and if there is any disagree-
ment between them, then other researchers may be consulted to resolve these disagree-
ments. The researchers must clearly understand the research questions and the review
protocol before collecting the information from the primary studies. In case of Masters
and PhD students, their supervisors may collect information from the primary studies and
then match their results with those obtained by the students.
The last step involves identification of data synthesis tools and techniques to summarize
and interpret the information obtained from the primary studies. The basic objective while
synthesizing data is to accumulate and combine facts and figures obtained from the selected
primary studies to formulate a response to the research questions. Tables and charts may be
used to highlight the similarities and differences between the primary studies. The following

www.allitebooks.com
44 Empirical Research in Software Engineering

Section I
Reviewer name
Author name
Title of publication
Year of publication
Journal/conference name
Type of study
Section II
Data set used
Independent variables
Feature subselection methods
ML techniques used
Performance measures used
Values of accuracy measures
Strengths of ML techniques
Weaknesses of ML techniques

FIGURE 2.3
Data extraction form.

steps need to be followed before deciding the tools and methods to be used for depicting the
results of the research questions:

• Decide which studies to include for answering a particular research question.


• Summarize the information obtained by the primary studies.
• Interpret the information depicted by the answer to the research question.

The effects of the results (performance measures) obtained from the primary studies may
be analyzed using statistical measures such as mean, median, and standard deviation (SD).
In addition, the outliers present in the results may be identified and removed using
various methods such as box plots. We must also use various tools such as bar charts,
scatter plots, forest plots, funnel plots, and line charts to visually present the results of
the primary studies in the SR. The aggregation of the results from various studies will
allow researchers to provide strong and well-acceptable conclusions and may give strong
support in proving a point. The data obtained from these studies may be quantitative
(expressed in the form of numerical measures) or qualitative (expressed in the form of
descriptive information/texts). For example, the values of performance measures are
quantitative in nature, and the strengths and weaknesses of the ML techniques are quali-
tative in nature.
A detailed description of the methods and techniques that are identified to represent
answers to the established research questions in the SRML case study for SFP using the
ML techniques are stated as follows:

• To summarize the number of ML techniques used in primary studies the SRML case
study will use a visualization technique, that is, a line graph to depict the number of
studies pertaining to the ML techniques in each year, and presented a classification
taxonomy of various ML techniques with their major categories and subcategories.
Systematic Literature Reviews 45

The case study also presented a bar chart that shows the total number of studies
conducted for each main category of the ML technique and pie charts that depict the
distribution of selected studies into subcategories for each ML category.
• The case study will use counting method to find the feature subselection tech-
niques, useful and not useful metrics, and commonly used data sets for SFP. These
subparts will be further aided by graphs and pie charts that showcase the distribu-
tion of selected primary studies for metrics usage and data set usage. Performance
measures will be summarized with the help of a table and a graph.
• The comparison of the result of the primary studies is shown with the help of a table
that compares six performance measures for each ML technique. The box plots will be
constructed to identify extreme values corresponding to each performance measure.
• A bar chart will be created to depict and analyze the comparison between the
performance of the statistical and ML techniques.
• The strengths and weaknesses of different ML techniques for SFP will be sum-
marized in tabular format.

Finally, the review protocol document may consist of the following sections:

1. Background of the review


2. Purpose of the review
3. Contents
a. Search strategy
b. Inclusion and exclusion criteria
c. Study quality assessment criteria
d. Data extraction
e. Data synthesis
4. Review evaluation criteria
5. References
6. Appendix

2.3.4 Evaluate Review Protocol


For the evaluation of review protocol a team of independent reviewers must be formed.
The team must frequently hold meetings and group discussions to evaluate the complete-
ness and consistency of the review protocol. The evaluation of review protocol involves the
confirmation of the following:

1. Development of appropriate search strings that are derived from research questions
2. Adequacy of inclusion and exclusion criteria
3. Completeness of quality assessment questionnaire
4. Design of data extraction forms that address various research questions
5. Appropriateness of data analysis procedures

Masters and PhD students must present the review protocol to their supervisors for the
comments and analysis.
46 Empirical Research in Software Engineering

2.4 Methods for Presenting Results


The data synthesis provides a summary of the knowledge gained from the existing
studies with respect to a specific research question. The appropriate approach for select-
ing specific technique for qualitative and quantitative synthesis depends on the type of
research question being answered. The narrative synthesis and visual representation can
be used to conclude the research results of the SR.

2.4.1 Tools and Techniques


The following tools can be used for summarizing and presenting the resultant
information:

1. Tabulation: It is the most common approach for representing qualitative and


quantitative data. The description of an approach can be summarized in tabular
form. The details of study assessment, study design, outcomes of the measure, and
the results of the study can be presented in tables. Each table must be referred and
interpreted in the results section.
2. Textual descriptions: They are used to highlight the main findings of the studies.
The most important findings/outcomes and comparision results must be empha-
sized in the review and the less important issues should not be overemphasized
in the text.
3. Visual diagrams: There are various diagrams that can be used to present and
summarize the findings of the study. Meta-analysis is a statistical method to
analyze the results of the independent studies so that generalized conclusions can
be produced. The outcomes obtained from a given study can be either binary or
continuous.
a. For binary variables the following effects are of interest:
i. Relative risk (RR, or risk ratio): Risk measures the strength of the relation-
ship between the presence of an attribute and occurrence of an outcome. RR
is having the ratio of samples of a positive outcome in two groups included
in a study.
Table 2.7 shows 2 × 2 contingency table, where a11, a12, a21, and a22 represent
the number of samples in each group with respect to each outcome.
Table 2.7 can be used to calculate RR and the results are shown below:
a11 a21
Risk 1 = , Risk 2 =
a11 + a12 a21 + a22

TABLE 2.7
Contingency Table for Binary Variable
Outcome Present Outcome Absent

Group 1 a11 a12


Group 2 a21 a22
Systematic Literature Reviews 47

Risk 1
RR =
Risk 2
ii. Odds ratio (OR): It measures the strength of the presence or absence of an
event. It is the ratio of odds of an outcome in two groups. It is desired that
the value is greater than one. The OR is defined as:
a11 a21
=
Odds1 = , Odds 2
a12 a22

Odds1
OR =
Odds 2

iii. Risk difference: It is also known as measure of absolute effect. It is calculated


as the difference between the observed risks (ratio of number of samples
present in an individual group with respect to outcome of interest) in the
presence of outcome in two groups. The risk difference is given as:

Risk difference = Risk 1 − Risk 2


iv. Area under the ROC curve (AUC): It is obtained from the receiver operat-
ing characteristics (ROC) analysis (for details refer Chapter 6) and used to
evaluate the models’ accuracy by plotting sensitivity and 1-specificity at
various cutoff points.
Consider the example given in Table 2.8, it shows the contingency table for
classes that are coupled or not coupled in a software with respect to the
faulty or nonfaulty binary outcomes.
The values of RR, OR, and risk difference are given below:
31 4
Risk 1 = = 0.885, Risk 2 = = 0.038
4 + 31 4 + 99
0.885
=
RR = 23.289
0.038
31 4
=
=Odds1 = 7=
.75, Odds 2 0.04
4 99
7.75
=
OR = 193.75
0.04
Risk difference = 0.885 − 0.038 = 0.847

TABLE 2.8
Example Contingency Table for Binary
Variable
Faulty Not Faulty Total

Coupled 31 4 35
Not coupled 4 99 103
Total 35 103 138
48 Empirical Research in Software Engineering

b. For continuous variables (variables that do not have any specified range), the
following commonly used effects are of interest:
i. Mean difference: This measure is used when a study reports the same type
of outcome and measures them on the same scale. It is also known as “dif-
ference of means.” It represents the difference between the mean value of
each group (Kictenham 2007). Let X g1 and X g2 be the mean of two groups
(say g1 and g2), which is defined as:

Mean difference = X g1 − X g2

ii. Standardized mean difference: It is used when a study reports the same
type of outcome measure but measures it in different ways. For example,
the size of a program may be measured by function points or lines of code.
Standardized mean difference is defined as the ratio of difference between
the means in two groups to the SD of the pooled outcome. Let SDpooled be
the SD pooled across groups, SDg1 be the SD of one group, SDg2 be the SD of
another group, and ng1 and ng2 be the sizes of the two groups. The formula
for standardized mean difference is given below:

X g1 − X g2
Standardized mean difference =
SD pooled

where

(ng1 − 1)SD g12 + (ng2 − 1)SD g2 2


SD pooled =
ng1 + ng2 − 2

For example, let X g1 = 110, X g2 = 100, SDg1 = 5 and SDg2 = 4, and ng1 = 20 and
ng2 = 20 of a sample population. Then,

(20 − 1) × 52 + (20 − 1) × 4 2
SD pooled = = 4.527
20 + 20 − 2

110 − 100
Standardized mean difference = = 2.209
4.527

Example 2.1
Consider the following data (refer Table 2.9) consisting of an attribute data class that can
have binary values true or false, where true represents that the class is data intensive
(number of declared variables is high) and false represents that the class is not data
intensive (number of declared variables is low). The outcome variable is change that
contains “yes” and “no,” where “yes” represents presence of change and “no” repre-
sents absence of change.
Calculate RR, OR, and risk difference.
Solution
The 2 × 2 contingency table is given in Table 2.10.

6 1
Risk 1 = = 0.75, Risk 2 = = 0.142
6+2 1+ 6
Systematic Literature Reviews 49

TABLE 2.9
Sample Data
Data Class Change

False No
False No
True Yes
False Yes
True Yes
False No
False No
True Yes
True No
False No
False No
True Yes
True No
True Yes
True Yes

TABLE 2.10
Contingency Table for Example Data Given in Table 2.9
Data Class Change Present Change Not Present Total

True 6 2 8
False 1 6 7
Total 7 8 15

0.75
=
RR = 5.282
0.142
6 1
=
=
Odds1 = 3=
, Odds 2 0.17
2 6
3
=
OR = 17.647
0.17
Risk difference = 0.75 − 0.142 = 0.608

2.4.2 Forest Plots


The forest plot provides a visual assessment of estimate of the overall results of the studies.
These overall results may be OR, RR, or AUC in each of the independent studies in the SR
or meta-analysis. The confidence interval (CI) of each effect along with overall combined
effect at 95% level is computed from the available data. The effects model can be either
fixed or random. The fixed effects model assumes that there is a common effect in the
50 Empirical Research in Software Engineering

TABLE 2.11
Results of Five Studies
Standard
Study AUC Error 95% CI

Study 1 0.721 0.025 0.672–0.770


Study 2 0.851 0.021 0.810–0.892
Study 3 0.690 0.008 0.674–0.706
Study 4 0.774 0.017 0.741–0.807
Study 5 0.742 0.031 0.681–0.803
Total (fixed effects) 0.722 0.006 0.709–0.734
Total (random effects) 0.755 0.025 0.705–0.805

studies, whereas in the random effects model there are varied effects in the studies. When
heterogeneity is found in the effects, then the random effects model is preferred.
Table 2.11 presents the AUC computed from the ROC analysis, standard error, and upper
bound and lower bound of CI. Figure 2.4 depicts the forest plot for five studies using AUC
and standard error. Each line represents each study in the SR. The boxes (black-filled
squares) depict the weight assigned to each study. The weight is represented as the inverse
of the standard error. The lesser the standard error, the more weight is assigned to the
study. Hence, in general, weights can be based on the standard error and sample size. The
CI is represented through length of lines. The diamond represents summary of combined
effects of all the studies, and the edges of the diamond represent the overall effect. The
results show the presence of heterogeneity, hence, random effects models is used to ana-
lyze the overall accuracy in terms of AUC ranging from 0.69 to 0.85.

Study 1

Study 2

Study 3

Study 4

Study 5

Total (fixed effects)

Total (random effects)

0.6 0.7 0.8 0.9


Area under the ROC curve

FIGURE 2.4
Forest plots.
Systematic Literature Reviews 51

0.00

0.05
Standard error

0.10

0.15

0.20
−2.0 −1.5 −1.0 −0.5 0.0 0.5 1.0 1.5 2.0
Area under the curve

FIGURE 2.5
Funnel plot.

2.4.3 Publication Bias


Publication bias means that the probability of finding studies with positive results is more
as compared to the negative results, which are inconclusive. Dickersin et al. found that the
possibility of publication of statistically significant results is three times greater than
the inconclusive results. The major reason for rejection of a research paper is its inability
to produce significant results that can be published. The funnel plot depicts a plot of effect
on the horizontal axis and the study size measure (generally standard error) on the vertical
axis. The funnel plot can be used to analyze the publication bias and is shown in Figure 2.5.
Figure 2.5 presents the plot of effect size against the standard error. If the publication bias is
not present, the funnel plot will be like a symmetrical, inverted funnel in which the studies
are distributed symmetrically around the combined size of effect. In Figure 2.5, the funnel
plot is shown for five studies in which the AUC curve represents the effect size. As shown
in funnel plot, all the studies (represented by circles) cluster on the top of the plot, which
indicates the presence of the publication bias. In this case, further analysis of the studies
lying in the outlying part of the asymmetrical funnel plot is done.

2.5 Conducting the Review


The review protocol is actually put into practice in this phase, including conducting
search, selecting primary studies (see Figure 2.6), filling data extraction forms, and data
synthesis.

2.5.1 Search Strategy Execution


This step involves comprehensive search of relevant primary studies that meet the
search criteria formed from the research questions in the review protocol. The search is
performed in the digital libraries identified in the review protocol. The search string may
be refined according to the initial results of the search. The studies that are gathered from
52 Empirical Research in Software Engineering

IEEE Xplore
Basic search
ScienceDirect
Select number of
relevant studies
ACM Digital
Library
Initial studies Reference

Wiley Digital Select number of


Online studies based
on inclusion/
exclusion criteria by
Google Scholar reading title, abstracts, Candidate studies
or full texts. Select
number of studies
SpringerLink based on the
quality assessment
criteria Primary studies
Web of Science

FIGURE 2.6
Search process.

the reference section of the relevant papers must also be included. The multiple copies
of the same publications must be removed and the collected publications must be stored
in a reference management system, such as Mendeley and JabRef. The list of journals
and conferences in which the primary studies have been published must be created.
Table 2.12 shows some popular journals and conferences in software engineering.

TABLE 2.12
Popular Journals and Conferences on Software Engineering
Publication Name Type

IEEE Transactions on Software Engineering Journal


Journal of Systems and Software Journal
Empirical Software Engineering Journal
Information and Software Technology Journal
IEEE International Symposium on Software Reliability Conference
International Conference on Predictor Models in Software Engineering (PROMISE) Conference
International Conference on Software Engineering Conference
Software Quality Journal Journal
Automated Software Engineering Journal
SW Maintenance & Evolution—Research & Practice Journal
Expert Systems with Applications Journal
Software Verification, Validation & Testing Journal
IEEE Software Journal
Software Practice & Experience Journal
IET Software Journal
ACM Transactions on Software Engineering and Methodology Journal
Systematic Literature Reviews 53

2.5.2 Selection of Primary Studies


The primary studies are selected based on the established inclusion and exclusion criteria.
The selection criteria may be revised during the process of selection, as all the aspects
are not apparent in the planning phase. It is advised that two or more researchers must
explore the studies to determine their relevance.
The process must begin with the removal of obvious irrelevant studies. The titles,
abstracts, or full texts of the collected studies need to be analyzed to identify the primary
studies. In some cases, only the title or abstract may be enough to detect the relevance of the
study, however, in other cases, the full texts need to be obtained to determine the relevance.
Brereton et al. (2007) observed in his study, “The standard of IT and software engineering
abstracts is too poor to rely on when selecting primary studies. You should also review the
conclusions.”

2.5.3 Study Quality Assessment


The studies selected are assigned quality scores based on the quality questions framed in
Section 2.3.3. On the basis of the final scores, decision of whether or not to retain the study
in the final list of relevant studies is made.
The record of studies that were considered as candidate for selection but were removed
after applying thorough inclusion/exclusion criteria must be maintained along with the
reasons of rejection.

2.5.4 Data Extraction


After the selection of primary studies, the information from the primary studies is col-
lected in the data extraction forms. The data extraction form was designed during the
planning phase and is based on the research questions. The data extraction forms consist
of numerical values, weaknesses and strengths of techniques used in studies, CIs, and so
on. Brereton et al. (2007) suggested that the following guidelines may be followed during
data extraction:

• When large number of primary studies is present, two independent reviewers


may be used, one as data collector and the other as a data checker.
• The review protocol and data extraction forms must be clearly understood by the
reviewers.

Table 2.13 shows an example of data extraction form collected for SRML case study using
research results given by Dejager et al. (2013). A similar form can be made for all the pri-
mary studies.

2.5.5 Data Synthesis


The tables and charts are used to summarize the results of the SR. The qualitative results
are summarized in tabular form and quantitative results are presented in the form of
tables and plots.
In this section, we summarize some of the results obtained by examining the results
of SRML case study. Each research question given in Table 2.3 should be answered in
54 Empirical Research in Software Engineering

TABLE 2.13
Example Data Extraction Form
Section I
Reviewer name Ruchika Malhotra
Author name Karel Dejaeger, Thomas Verbraken, and Bart Baesens
Title of publication Toward Comprehensible Software Fault Prediction Models Using
Bayesian Network Classifiers
Year of publication 2013
Journal/conference name IEEE Transactions on Software Engineering
Type of the study Research paper

Section II
Data set used NASA data sets (JM1, MC1, KC1, PC1, PC2, PC3, PC4, PC5), Eclipse
Independent variables Static code measures (Halstead and McCabe)
Feature subselection method Markov Blanket
ML techniques used Naïve Bayes, Random Forest
Performance measures used AUC, H-measure
Values of accuracy measures (AUC) Data RF NB
JM1 0.74 0.74 0.69 0.69
KC1 0.82 0.8 0.8 0.81
MC1 0.92 0.92 0.81 0.79
PC1 0.84 0.81 0.77 0.85
PC2 0.73 0.66 0.81 0.79
PC3 0.82 0.78 0.77 0.78
PC4 0.93 0.89 0.79 0.8
PC5 0.97 0.97 0.95 0.95
Ecl 2.0a 0.82 0.82 0.8 0.79
Ecl 2.1a 0.75 0.73 0.74 0.74
Ecl 3.0a 0.77 0.77 0.76 0.86
Strengths (Naïve Bayes) It is easy to interpret and construct
Computationally efficient
Weaknesses (Naïve Bayes) Performance of model is dependent on attribute selection
technique used
Unable to discard irrelevant attributes

TABLE 2.14
Distribution of Studies Across ML Techniques
Based on Classification
Method # of Studies Percent

Decision tree 31 47.7


NN 17 26.16
Support vector machine 18 27.7
Bayesian learning 31 47.7
Ensemble learning 12 18.47
Evolutionary algorithm 8 12.31
Rule-based learning 5 7.7
Misc. 16 24.62
Systematic Literature Reviews 55

Miscellaneous
15%
Hybrid
OO 31%
7%

Procedural 47%

FIGURE 2.7
Primary study distribution according to the metrics used.

the results section by using visual diagrams and tables. For example, Table 2.14 pres-
ents the number of studies covering various ML techniques. There are various ML tech-
niques available in the literature such as decision tree, NNs, support vector machine,
and bayesian learning. The table shows that 31 studies analyzed decision tree techniques,
17 studies analyzed NN techniques, 18 studies examined support vector machines, and
so on. Similarly, the software metrics are divided into various categories in the SRML case
study—OO, procedural, hybrid, and miscellaneous. Figure 2.7 depicts the percentage of
studies examining each category of metrics, such as 31% of studies examine OO metrics.
The pie chart shows that the procedural metrics are most commonly used metrics with
47% of the total primary studies.
The results of the ML techniques that were assessed in at least 5 out of 64 selected
primary studies are provided using frequently used performance measures in the
64 primary studies. The results showed that accuracy, F-measure, precision, recall,
and AUC are the most frequently used performance measures in the selected primary
studies. Tables 2.15 and 2.16 present the minimum, maximum, mean, median, and SD
values for the selected performance measures. The results are shown for RF and NN
techniques.

TABLE 2.15
Results of RF Technique
RF Accuracy Precision Recall AUC Specificity

Minimum 55.00 59.00 62.00 0.66 64.3


Maximum 93.40 78.90 100.00 1.00 80.7
Mean 75.63 70.63 81.35 0.83 72.5
Median 75.94 71.515 80.25 0.82 72.5
SD 15.66 7.21 12.39 0.09 11.6
56 Empirical Research in Software Engineering

TABLE 2.16
Results of NN Technique
MLP Accuracy Precision Recall ROC Specificity

Minimum 64.02 2.20 36.00 0.54 61.60


Maximum 93.44 76.55 98.00 0.95 79.06
Mean 82.23 52.36 69.11 0.78 70.29
Median 83.46 65.29 71.70 0.77 71.11
SD 9.44 27.57 12.84 0.09 5.27

2.6 Reporting the Review


The last step in the SR is to prepare a report consisting of the results of the review and dis-
tributing it to the target audience. The results of the SR may be reported in the following:

• Journal or conferences
• Technical report
• PhD thesis

The detailed reporting of the results of the SR is very important and critical so that academi-
cians can have an idea about the quality of the study. The detailed reporting consists of the
review protocol, inclusion/exclusion criteria, list of primary studies, list of rejected studies,
quality scores assigned to studies, and raw data pertaining to the primary studies, for example,
number of research questions addressed by the primary studies and so on should be reported.
The review results are generally longer than the normal original study. However, the journals
may not permit publication of long SR. Hence, the details may be kept in appendix and stored
in electronic form. The details in the form of technical report may also be published online.
Table 2.17 presents the format and contents of the SR. The table provides the contents
along with its detailed description. The strengths and limitations of the SR must also be
discussed along with the explanation of its effect on the findings.

TABLE 2.17
Format of an SR Report
Section Subsections Description Comments
Title – The title should be short and informative.
Authors – –
Details
Abstract Background What is the relevance and It allows the researchers to gain insight about the
importance of the SR? importance, addressed areas, and main findings
Method What are the tools and techniques of the study.
used to perform the SR?
Results What are the major findings
obtained by the SR?
Conclusions What are the main implications
of the results and guidelines for
the future research?
(Continued)
Systematic Literature Reviews 57

TABLE 2.17 (Continued)


Format of SR Report
Section Subsections Description Comments
Introduction What is the motivation and It will provide the justification for the need of the
need of the SR? SR. It also presents the overview of an existing SR.
Method Research What are the areas to be The review methods must be based on the
Questions addressed during the SR? review protocol.
This is the most important part of the SR.
Search What are the relevant studies It identifies the initial list of relevant studies using
Strategy found during the SR? the keywords and searching the digital portals.
Study What is the inclusion/ It describes the criteria for including and
Selection exclusion criterion for excluding the studies in the SR.
selecting the studies?
Quality What are the quality The rejected studies along with the reason of the
Assessment assessment questions that rejection need to be maintained.
Criteria need to be evaluated?
How will the scores be
assigned to the studies?
Which studies have been
rejected?
Data What should be the format of The data extraction forms are used to summarize
Extraction the data extraction forms? the information from the primary studies.
Data Which tools are used to present The tools and techniques used to summarize the
Synthesis the results of the analysis? results of the research are presented in this section.
Results Description What are the primary sources It summarizes the description of the primary
of Primary of the selected primary studies.
Studies studies?
Answers to What are the findings of the It presents the detailed findings of the SR by
Research areas to be explored? addressing the research questions.
Questions Qualitative findings of the research are
summarized in tabular form and quantitative
findings are depicted through tables and plots.
Discussions What are the applications and It provides the similarities and differences in the
meaning of the findings? results of the primary studies so that the results
can be generalized.
It discusses the risks and effects of the
summarized studies.
The main strengths and weaknesses of the
techniques used in the primary studies are
summarized in this section.
Threats to What are the threats to the The main limitations of the SR are presented in
Validity validity of the results? this section.
Conclusions Summary of What are the implications It summarizes the main findings and its
Current of the findings for the implications for the practitioners.
Trends researchers and
practitioners?
Future What are the guidelines for
Directions future research?
References – It provides references to the primary studies,
rejected studies, and referred studies.
Appendix – The appendix can present the quality scores
assigned to each primary study and the number
of research questions addressed by each study.
58 Empirical Research in Software Engineering

2.7 SRs in Software Engineering


There are many SRs conducted in software engineering. Table 2.18 summarizes few of
them with author details, year and review topics, the number of studies reviewed (study
size), whether quality assessment of the studies was performed (QA used), data synthesis
methods, and conclusions.

TABLE 2.18
Systematic Reviews in Software Engineering
Data
Research Study QA Synthesis
Authors Year Topics Size Used Methods Conclusions
Kitchenham 2007 Cost estimation 10 Yes Tables • Strict quality control on
et al. models, data collection is not
cross-company sufficient to ensure that a
data, within- cross-company model
company data performs as well as a
within-company model.
• Studies where within-
company predictions were
better than cross-company
predictions employed
smaller within-company
data sets, smaller number
of projects in the cross-
company models, and
smaller databases.
Jørgensen and 2007 Cost estimation 304 No Tables • Increase the breadth of the
Shepperd search for relevant studies.
• Search manually for
relevant papers within a
carefully selected set of
journals.
• Conduct more studies on
the estimation methods
commonly used by the
software industry.
• Increase the awareness of
how properties of the data
sets impact the results
when evaluating
estimation methods.
Stol et al. 2009 Open source 63 No Pie charts, • Most research is done on
software bar OSS communities.
(OSS)–related charts, • Most studies investigate
empirical tables projects in the “system”
research and “internet” categories.
• Among research methods
used, case study, survey,
and quantitative analysis
are the most popular.
(Continued)
Systematic Literature Reviews 59

TABLE 2.18 (Continued)


Systematic Reviews in Software Engineering
Data
Research Study QA Synthesis
Authors Year Topics Size Used Methods Conclusions
Riaz et al. 2009 Software 15 Yes Tables • Maintainability prediction
maintainability models are based on
prediction algorithmic techniques.
• Most commonly used
predictors are based on
size, complexity, and
coupling.
• Prediction techniques,
accuracy measures, and
cross-validation methods
are not much used for
validating prediction
models.
• Most commonly used
maintainability metric
employed an ordinal scale
and is based on expert
judgment.
Hauge et al. 2010 OSS, 112 No Bar charts, • Practitioners should use
organizations tables the opportunities offered
by OSS.
• Researchers should
conduct more empirical
research on the topics
important to
organizations.
Afzal et al. 2009 Search-based 35 Yes Tables, • Meta-heuristic search
software testing, figures techniques (including
meta-heuristics simulated annealing, tabu
search, genetic algorithms,
ant colony methods,
grammatical evolution,
genetic programming, and
swarm intelligence
methods) are applied for
nonfunctional testing of
execution time, quality of
service, security, usability,
and safety.
Wen et al. 2012 Effort estimation, 84 Yes Narrative • Models predicted using
machine learning synthesis, ML methods is close to
tables, acceptable level.
pie • Accuracy of ML models is
charts, better than non-ML
box plots models.
• Case-based reasoning and
artificial NN methods are
more accurate than
decision trees.
(Continued)
60 Empirical Research in Software Engineering

TABLE 2.18 (Continued)


Systematic Reviews in Software Engineering
Data
Research Study QA Synthesis
Authors Year Topics Size Used Methods Conclusions
Catal 2011 Fault prediction, 90 No Theoretical • Most of the studies used
machine method-level metrics.
learning, and • Most studies used ML
statistical-based techniques.
approaches • Naïve Bayes is a robust
machine-learning
algorithm.
Radjenović et al. 2013 Fault prediction, 106 Yes Line chart, • OO metrics were used
software metrics bubble nearly twice as often as
chart traditional source code
metrics and process
metrics.
• OO metrics predict better
models as compared to
size and complexity
metrics.
Ding et al. 2014 Software 60 Yes Tables, line • Knowledge capture and
documentation, graph, bar representation is the
knowledge- charts, widely used approach in
based approach bubble software documentation.
chart • Knowledge retrieval and
knowledge recovery
approaches are useful but
still need to be evaluated.
Malhotra 2015 Fault prediction, 64 Yes Tables, line • ML techniques show
ML technique charts, bar acceptable prediction
charts capability for estimating
software Fault Proneness
• ML techniques
outperformed Logistic
regression technique for
software fault models
predictions
• Random forest was
superior as compared to all
the other ML techniques

Exercises
2.1 What is an SR? Why do we really need to perform SR?
2.2 a. Discuss the advantages of SRs.
b. Differentiate between a survey and an SR.
2.3 Explain the characteristics and importance of SRs.
Systematic Literature Reviews 61

TABLE 2.12.1
Contingency Table from Study on change
Prediction
Change Not Change
Prone Prone Total

Coupled 14 12 26
Not coupled 16 22 38
Total 30 34 64

2.4 a. What are the search strategies available for selecting primary studies? How will
you select the digital portals for searching primary studies?
b. What is the criteria for forming a search string?
2.5 What is the criteria for determining the number of researchers for conducting the
same steps in an SR?
2.6 What is the purpose of quality assessment criteria? How will you construct the
quality assessment questions?
2.7 Why identification of the need for an SR is considered the most important step in
planning the review?
2.8 How will you decide on the tools and techniques to be used during the data
synthesis?
2.9 What is publication bias? Explain the purpose of funnel plots in detecting
publication bias?
2.10 Explain the steps in SRs with the help of an example case study.
2.11 Define the following terms:
a. RR
b. OR
c. Risk difference
d. Standardized mean difference
e. Mean difference
2.12 Given the contingency table for all classes that are coupled or not coupled in a
software with respect to a dichotomous variable change proneness, calculate the
RR, OR, and risk difference (Table 2.12.1).

Further Readings
A classic study that describes empirical results in software engineering is given by:

L. M. Pickarda, B. A. Kitchenham, and P. W. Jones, “Combining empirical results in soft-


ware engineering,” Information and Software Technology, vol. 40, no. 14, pp. 811–821,
1998.
62 Empirical Research in Software Engineering

A detailed survey that summarizes approaches that mine software repositories in the
context of software evolution is given in:

H. Kagdi, M. L. Collard, and J. I. Maletic, “A survey and taxonomy of approaches for


mining software repositories in the context of software evolution,” Journal of Software
Evolution and Maintenance: Research and Practice, vol. 19, no. 2, pp. 77–131, 2007.

The guidelines for preparing the review protocols are given in:

“Guidelines for preparation of review protocols,” The Campbell Corporation, http://


www.campbellcollaboration.org.

A review on the research synthesis performed in SRs is given in:

D. S. Cruzes, and T. Dybå, “Research synthesis in software engineering: A tertiary


study,” Information and Software Technology, vol. 53, no. 5, pp. 440–455, 2011.

For details on meta-analysis, see the following publications:

M. Borenstein, L. V. Hedges, J. P. T. Higgins, and H. R. Rothstein, Introduction to Meta-


Analysis, Wiley, Chichester, 2009.
R. DerSimonian, and N. Laird, “Meta-analysis in clinical trials,” Controlled Clinical
Trials, vol. 7, no. 3, pp. 177–188, 1986.
J. P. T. Higgins, and S. Green, Cochrane Handbook for Systematic Reviews of Interventions
Version 5.1.0, The Cochrane Collaboration, 2011. Available from www.cochrane-
handbook.org.
J. P. Higgins, S. G. Thompson, J. J. Deeks, and D. G. Altman, “Measuring inconsis-
tency in meta-analyses,” British Medical Journal, vol. 327, no. 7414, pp. 557–560, 2003.
N. Mantel, and W. Haenszel, “Statistical aspects of the analysis of data from the ret-
rospective analysis of disease,” Journal of the National Cancer Institute, vol. 22, no. 4,
pp. 719–748, 1959.
A. Petrie, J. S. Bulman, and J. F. Osborn, “Further statistics in dentistry. Part 8:
Systematic reviews and meta-analyses,” British Dental Journal, vol. 194, no. 2,
pp. 73–78, 2003.
K. Ried, “Interpreting and understanding meta-analysis graphs: A practical guide,”
Australian Family Physician, vol. 35, no. 8, pp. 635–638, 2006.

For further understanding on forest and funnel plots, see the following publications:

J. Anzures-Cabrera, and J. P. T. Higgins, “Graphical displays for meta-analysis: An


overview with suggestions for practice,” Research Synthesis Methods, vol. 1, no. 1,
pp. 66–80, 2010.
A. G. Lalkhen, and A. McCluskey, “Statistics V: Introduction to clinical trials and
systematic reviews,” Continuing Education in Anaesthesia, Critical Case and Pain,
vol. 18, no. 4, pp. 143–146, 2008.
R. J. Light, and D. B. Pillemer, Summing Up: The Science of Reviewing Research, Harvard
University Press, Cambridge, 1984.
Systematic Literature Reviews 63

J. L. Neyeloff, S. C. Fuchs, and L. B. Moreira, “Meta-analyses and Forest plots using


a Microsoft excel spreadsheet: Step-by-step guide focusing on descriptive data
analysis,” British Dental Journal Research Notes, vol. 5, no. 52, pp. 1–6, 2012.

An effective meta-analysis of a number of high-quality defect prediction studies is


provided in:

M. Shepperd, D. Bowes, and T. Hall, “Researcher bias: The use of machine learning
in software defect prediction,” IEEE Transactions on Software Engineering, vol. 40,
no. 6, pp. 603–616, 2014.
3
Software Metrics

Software metrics are used to assess the quality of the product or process used to build it.
The metrics allow project managers to gain insight about the progress of software and
assess the quality of the various artifacts produced during software development. The
software analysts can check whether the requirements are verifiable or not. The metrics
allow management to obtain an estimate of cost and time for software development. The
metrics can also be used to measure customer satisfaction. The software testers can mea-
sure the faults corrected in the system, and this decides when to stop testing.
Hence, the software metrics are required to capture various software attributes at differ-
ent phases of the software development. Object-oriented (OO) concepts such as coupling,
cohesion, inheritance, and polymorphism can be measured using software metrics. In this
chapter, we describe the measurement basics, software quality metrics, OO metrics, and
dynamic metrics. We also provide practical applications of metrics so that good-quality
systems can be developed.

3.1 Introduction
Software metrics can be used to adequately measure various elements of the software
development life cycle. The metrics can be used to provide feedback on a process or tech-
nique so that better or improved strategies can be developed for future projects. The qual-
ity of the software can be improved using the measurements collected by analyzing and
assessing the processes and techniques being used.
The metrics can be used to answer the following questions during software development:

1. What is the size of the program?


2. What is the estimated cost and duration of the software?
3. Is the requirement testable?
4. When is the right time to stop testing?
5. What is the effort expended during maintenance phase?
6. How many defects have been corrected that are reported during maintenance
phase?
7. How many defects have been detected using a given activity such as inspections?
8. What is the complexity of a given module?
9. What is the estimated cost of correcting a given defect?
10. Which technique or process is more effective than the other?
11. What is the productivity of persons working on a project?
12. Is there any requirement to improve a given process, method, or technique?

65
66 Empirical Research in Software Engineering

The above questions can be addressed by gathering information using metrics. The infor-
mation will allow software developer, project manager, or management to assess, improve,
and control software processes and products during the software development life cycle.

3.1.1 What Are Software Metrics?


Software metrics are used for monitoring and improving various processes and products
in software engineering. The rationale arises from the notion that “you cannot control
what you cannot measure” (DeMarco 1982). The most essential and critical issues involved
in monitoring and controlling various artifacts during software development can be
addressed by using software metrics. Goodman (1993) defined software metrics as:

The continuous application of measurement based techniques to the software development


process and its products to supply meaningful and timely management information,
together with the use of those techniques to improve that process and its products.

The above definition provides all the relevant details. Software metrics should be collected
from the initial phases of software development to measure the cost, size, and effort of the
project. Software metrics can be used to ascertain and monitor the progress of the soft-
ware throughout the software development life cycle.

3.1.2 Application Areas of Metrics


Software metrics can be used in various domains. One of the key applications of software
metrics is estimation of cost and effort. The cost and effort estimation models can be derived
using the historical data and can be applied in the early phases of software development.
Software metrics can be used to measure the effectiveness of various activities or pro-
cesses such as inspections and audits. For example, the project managers can use the num-
ber of defects detected by inspection technique to assess the effectiveness of the technique.
The processes can be improved and controlled by analyzing the values of metrics. The
graphs and reports provide indications to the software developers and they can decide in
which direction to move.
Various software constructs such as size, coupling, cohesion, or inheritance can be mea-
sured using software metrics. The alarming values (thresholds) of the software metrics
can be computed and based on these values and then the required corrective actions can
be taken by the software developers to improve the quality of the software.
One of the most important areas of application of software metrics is the prediction of
software quality attributes. There are many quality attributes proposed in the literature
such as maintainability, testability, usability, and reliability. The benefits of developing
the quality models is that they can be used by software developers, project managers, and
management personnel in the early phases of software development for resource alloca-
tion and identification of problematic areas.
Testing metrics can be used to measure the effectiveness of the test suite. These metrics
include the number of statements, percentage of statement coverage, number of paths cov-
ered in a program graph, number of independent paths in a program graph, and percent-
age of branches covered.
Software metrics can also be used to provide meaningful and timely information to
the management. The software quality, process efficiency, and people productivity can
be computed using the metrics. Hence, this information will help the management in
Software Metrics 67

making effective decisions. The effective application of metrics can improve the quality
of the software and produce software within the budget and on time. The contributions of
software metrics in building good-quality system are provided in Section 3.9.1.

3.1.3 Characteristics of Software Metrics


A metric is only relevant if it is easily understood, calculated, valid, and economical:

1. Quantitative: The metrics should be expressible in values.


2. Understandable: The way of computing the metric must be easy to understand.
3. Validatable: The metric should capture the same attribute that it is designed for.
4. Economical: It should be economical to measure a metric.
5. Repeatable: The values should be same if measured repeatedly, that is, can be con-
sistently repeated.
6. Language independent: The metrics should not depend on any language.
7. Applicability: The metric should be applicable in the early phases of software
development.
8. Comparable: The metric should correlate with another metric capturing the same
feature or concept.

3.2 Measurement Basics


Software metrics should preserve the empirical relations corresponding to numerical rela-
tions for real-life entities. For example, for “taller than” empirical relation, “>” would be an
appropriate numeric relation. Figure 3.1 shows the steps of defining measures. In the first
step, the characteristics for representing real-life entities should be identified. In the
third step, the empirical relations for these characteristics are identified. The third step

Identify characteristics for real-life entities

Identify empirical relations for characteristics

Determine numerical relations for empirical relations

Map real-world entities to numeric numbers

Check whether numeric relations preserve the empirical


relations

FIGURE 3.1
Steps in software measurement.
68 Empirical Research in Software Engineering

determines the numerical relations corresponding to the empirical relations. In the next
step, real-world entities are mapped to numeric numbers, and in the last step, we deter-
mine whether the numeric relations preserve the empirical relation.

3.2.1 Product and Process Metrics


The entities in software engineering can be divided into two different categories:

1. Process: The process is defined as the way in which the product is developed.
2. Product: The final outcome of following a given process or a set of processes is
known as a product. The product includes documents, source codes, or artifacts
that are produced during the software development life cycle.

The process uses the product produced by an activity, and a process produces products that
can be used by another activity. For example, the software design document is an artifact
produced from the design phase, and it serves as an input to the implementation phase. The
effectiveness of the processes followed during software development is measured using the
process metrics. The metrics related to products are known as product metrics. The effi-
ciency of the products is measured using the product metrics.
The process metrics can be used to

1. Measure the cost and duration of an activity.


2. Measure the effectiveness of a process.
3. Compare the performance of various processes.
4. Improve the processes and guide the selection of future processes.

For example, the effectiveness of the inspection activity can be measured by computing
costs and resources spent on it and the number of defects detected during the inspection
activity. By assessing whether the number of faults found outweighs the costs incurred
during the inspection activity or not, the project managers can decide about the effective-
ness of the inspection activity.
The product metrics are used to measure the effectiveness of deliverables produced dur-
ing the software development life cycle. For example, size, cost, and effort of the deliver-
ables can be measured. Similarly, documents produced during the software development
(SRS, test plans, user guides) can be assessed for readability, usability, understandability,
and maintainability.
The process and product metrics can further be classified as internal or external attributes.
The internal attribute concerns with the internal structure of the process or product. The com-
mon internal attributes are size, coupling, and complexity. The external attributes concern
with the behavior aspects of the process or product. The external attributes such as testability,
understandability, maintainability, and reliability can be measured using the process or prod-
uct metrics.
The difference between attributes and metrics is that metrics are used to measure a
given attribute. For example, size is an attribute that can be measured through lines of
source code (LOC) metric.
The internal attributes of a process or product can be measured without executing the
source code. For instance, the examples of internal attributes are number of paths, number
of branches, coupling, and cohesion. External attributes include quality attributes of the
system. They can be measured by executing the source code such as the number of failures,
Software Metrics 69

Software metrics

Process Product

Internal attributes External attributes Internal attributes External attributes

Reliability,
Failure rate found in Effectiveness of a Size, inheritance,
maintainability,
reviews, no. of issues method coupling
usability

FIGURE 3.2
Categories of software metrics.

response time, and navigation easiness of an item. Figure 3.2 presents the categories of
software metrics with examples at the lowest level in the hierarchy.

3.2.2 Measurement Scale


The data can be classified into two types—metric (continuous) and nonmetric (categorical).
Metric data is of continuous type that represents the amount of magnitude of a given entity.
For example, the number of faults in a class or number of LOC added or deleted during
maintenance phase. Table 3.1 shows the LOC added and deleted for the classes A, B, and C.
Nonmetric data is of discrete or categorical type that is represented in the form of cat-
egories or classes. For example, weather is sunny, cloudy, or rainy. Metric data can be mea-
sured on interval, ratio, or absolute scale. The interval scale is used when the interpretation
of difference between values is same. For example, difference between 40°C and 50°C is
same as between 70°C and 80°C. In interval scale, one value cannot be represented as a
multiple of other value as it does not have an absolute (true) zero point. For example, if the
temperature is 20°C, it cannot be said to be twice hotter than when the temperature was
10°C. The reason is that on Fahrenheit scale, 10°C is 50 and 20°C is 68. Hence, ratios cannot
be computed on measures with interval scale.
Ratio scales provide more precision as they have absolute zero points and one value
can be expressed as a multiple of other. For example, with weight 200 pounds A is twice

TABLE 3.1
Example of Metrics Having Continuous Scale
Class# LOC Added LOC Deleted

A 34 5
B 42 10
C 17 9
70 Empirical Research in Software Engineering

heavier than B with weight 100 pounds. Simple counts are represented by absolute scale.
The example of simple counts is number of faults, LOC, and number of methods. In abso-
lute type of scale, the descriptive statistics such as mean, median, and standard deviation
can be applied to summarize data.
Nonmetric type of data can be measured on nominal or ordinal scales. Nominal scale
divides metric into classes, categories, or levels without considering any order or rank
between these classes. For example, Change is either present or not present in a class.

0, no change present


Change = 
1, change present

Another example of nominal scale is programming languages that are used as labels for dif-
ferent categories. In ordinal scale, one category can be compared with the other category in
terms of “higher than,” “greater than,” or “lower than” relationship. For example, the overall
navigational capability of a web page can be ranked into various categories as shown below:

1, excellent

2, good

What is the overall navigational capability of a webpage? = 3, medium

4, bad
5, worst

Table 3.2 summarizes the differences between measurement scales with examples.

TABLE 3.2
Summary of Measurement Scales
Measurement
Scale Characteristics Statistics Operations Transformation Examples

Interval • =, <, > Addition and M = xM′ + y Temperatures, date,


• Ratios not Mode, mean, subtraction and time
allowed median,
• Arbitrary interquartile
zero point range,
Ratio • Absolute zero variance, All arithmetic M = xM′ Weight, height, and
point standard operations length
Absolute • Simple count deviation All arithmetic M = M′ LOC
values operations
Nominal • Order not Frequencies None One-to-one Fault proneness
considered mapping (0—not present,
1—present)
Ordinal • Order or rank Mode, None Increasing Programmer
considered median, function capability levels
• Monotonic interquartile M(x) > M(y) (high, medium,
increasing range low), severity
function levels (critical,
(=, <, >) high, medium, low)
Software Metrics 71

Example 3.1
Consider the count of number of faults detected during inspection activity:
1. What is the measurement scale for this definition?
2. What is the measurement scale if number of faults is classified between 1 and
5, where 1 means very high, 2 means high, 3 means medium, 4 means low, and
5 means very low?

Solution:
1. The measurement scale of the number of faults is absolute as it is a simple count
of values.
2. Now, the measurement scale is ordinal since the variable has been converted
to be categorical (consists of classes), involving ranking or ordering among
categories.

3.3 Measuring Size


The purpose of size metrics is to measure the size of the software that can be taken as
input by the empirical models to further estimate the cost and effort during the software
development life cycle. Hence, the measurement of size is very important and crucial to
the success of the project. The LOC metric is the most popular size metric used in the
literature for estimation and prediction purposes during the software development. The
LOC metric can be counted in various ways. The source code consists of executable lines
and unexecutable lines in the form of blank and comment lines. The comment lines are
used to increase the understandably and readability of the source code.
The researchers may measure only the executable lines, whereas some may like to mea-
sure the LOC with comment lines to analyze the understandability of the software. Hence,
the researcher must be careful while selecting the method for counting LOC. Consider the
function to check greatest among three numbers given in Figure 3.3.
The function “find-maximum” in Figure 3.3 consists of 20 LOC, if we simply count the
number of LOC.
Most researchers and programmers exclude blank lines and comment lines as these
lines do not consume any effort and only give the illusion of high productivity of the
staff that is measured in terms of LOC/person month (LOC/PM). The LOC count for
the function shown in Figure 3.2 is 16 and is computed after excluding the blank and
comment lines. The value is computed following the definition of LOC given by Conte
et al. (1986):

A line of code is any line of program text that is not a comment or blank line, regardless
of the number of statements or fragments of statements on the line. This specifically
includes all lines containing program headers, declarations, and executable and non-
executable statements.

In OO software development, the size of software can be calculated in terms of classes and
the attributes and functions included in the classes. The details of OO size metrics can be
found in Section 3.5.6.
72 Empirical Research in Software Engineering

/*This function checks greatest amongst three numbers*/


int find-maximum (int i, int j, int k)
{
int max;
/*compute the greatest*/
if(i>j)
{
if (i<k)
max=i;
else
max=k;
}
else if (j>k)
max=j;
else
max=k;

/*return the greatest number*/


return (max);
}

FIGURE 3.3
Operation to find greatest among three numbers.

3.4 Measuring Software Quality


Maintaining software quality is an essential part of the software development and thus all
aspects of software quality should be measured. Measuring quality attributes will guide
the software professionals about the quality of the software. Software quality must be
measured throughout the software development life cycle phases.

3.4.1 Software Quality Metrics Based on Defects


Defect is defined by IEEE/ANSI as “an accidental condition that causes a unit of the system
to fail to function as required” IEEE/ANSI (Standard 982.2). A failure occurs when a fault
executes and more than one failure may be associated with a given fault. The defect-based
metrics can be classified at product and process levels. The difference of the two terms
fault and the defect is unclear from the definitions. In practice, the difference between the
two terms is not significant and these terms are used interchangeably. The commonly used
product metrics are defect density and defect rate that are used for measuring defects. In
the subsequent chapters, we will use the terms fault and defect interchangeably.

3.4.1.1 Defect Density


Defect density metric can be defined as the ratio of the number of defects to the size of the
software. Size of the software is usually measured in terms of thousands of lines of code
(KLOC) and is given as:

Number of defects
Defect density =
KLOC
Software Metrics 73

The number of defects measure counts the defects detected during testing or by using any
verification technique.
Defect rate can be measured as the defects encountered over a period of time, for instance
per month. The defect rate may be useful in predicting the cost and resources that will be
utilized in the maintenance phase of software development. Defect density during testing
is another effective metric that can be used during formal testing. It measures the defect
density during the formal testing after completion of the source code and addition of the
source code to the software library. If the value of defect density metric during testing is
high, then the tester should ask the following questions:

1. Whether the software is well designed or developed?


2. Whether the testing technique is effective in defect detection?

If the reason for high number of defects is the first one then the software should be thor-
oughly tested to detect the high number of defects. However, if the reason for high number
of defects is the second one, it implies that the quality of the system is good because of the
presence of fewer defects.

3.4.1.2 Phase-Based Defect Density


It is an extension of defect density metric where instead of calculating defect density at
system level it is calculated at various phases of the software development life cycle,
including verification techniques such as reviews, walkthroughs inspections, and audits
before the validation testing begins. This metric provides an insight about the procedures
and standards being used during the software development. Some organizations even set
“alarming values” for these metrics so that the quality of the software can be assessed and
monitored, thus appropriate remedial actions can be taken.

3.4.1.3 Defect Removal Effectiveness


Defect removal effectiveness (DRE) is defined as:

Defects removedin a given life cycle phase


DRE =
Latent defects

For a given phase in the software development life cycle, latent defects are not known.
Thus, they are calculated as the estimation of the sum of defects removed during a phase
and defects detected later. The higher the value of the DRE, the more efficient and effec-
tive is the process followed in a particular phase. The ideal value of DRE is 1. The DRE of
a product can also be calculated by:

DB
DRE =
D B +D A

where:
DB depicts the defects encountered before software delivery
DA depicts the defects encountered after software delivery
74 Empirical Research in Software Engineering

3.4.2 Usability Metrics


The ease of use, user-friendliness, learnability, and user satisfaction can be measured through
usability for a given software. Bevan (1995) used MUSIC project to measure usability attri-
butes. There are a number of performance measures proposed in this project and the metrics
are defined on the basis of these measures. The task effectiveness is defined as follows:
1
Task effectiveness = × ( quantity × quality ) %
100
where:
Quantity is defined as the amount of task completed by a user
Quality is defined as the degree to which the output produced by the user satisfies the
targets of a given task

Quantity and quality measures are expressed in percentages. For example, consider a
problem of proofreading an eight-page document. Quantity is defined as the percentage of
proofread words, and quality is defined as the percentage of the correctly proofread docu-
ment. Suppose quantity is 90% and quality is 70%, then task effectiveness is 63%.
The other measures of usability defined in MUSIC project are (Bevan 1995):

Effectiveness
Temporal efficiency =
Task time
Task time − unproductive time
Productive peroiid = × 100
Task time
User efficiency
Relative user efficiency = × 100
Expert efficiency

There are various measures that can be used to measure the usability aspect of the system
and are defined below:

1. Time for learning a system


2. Productivity increase by using the system
3. Response time

In testing web-based applications, usability can be measured by conducting a survey based


on the questionnaire to measure the satisfaction of the customer. The expert having knowl-
edge must develop the questionnaire. The sample size should be sufficient enough to build
the confidence level on the survey results. The results are rated on a scale. For example, the
difficulty level is measured for the following questions in terms of very easy, easy, difficult,
and very difficult. The following questions may be asked in the survey:

• How the user is able to easily learn the interface paths in a webpage?
• Are the interface titles understandable?
• Whether the topics can be found in the ‘help’ easily or not?

The charts, such as bar charts, pie charts, scatter plots, and line charts, can be used to
depict and assess the satisfaction level of the customer. The satisfaction level of the cus-
tomer must be continuously monitored over time.
Software Metrics 75

3.4.3 Testing Metrics


Testing metrics are used to capture the progress and level of testing for a given software.
The amount of testing done is measured by using the test coverage metrics. These metrics
can be used to measure the various levels of coverage, such as statement, path, condition,
and branch, and are given below:

1. The percentage of statements covered while testing is defined by statement cover-


age metric.
2. The percentage of branches covered while testing the source code is defined by
branch coverage metric.
3. The percentage of operations covered while testing the source code is defined by
operation coverage metric.
4. The percentage of conditions covered (both for true and false) is evaluated using
condition coverage metric.
5. The percentage of paths covered in a control flow graph is evaluated using condi-
tion coverage metric.
6. The percentage of loops covered while testing a program is evaluated using loop
coverage metric.
7. All the possible combinations of conditions are covered by multiple coverage metrics.

NASA developed a test focus (TF) metric defined as the ratio of the amount of effort spent
in finding and removing “real” faults in the software to the total number of faults reported
in the software. The TF metric is given as (Stark et al. 1992):

Number of STRs fixed and closed


TF =
Total number of STRs
where:
STR is software trouble report

The fault coverage metric (FCM) is given as:

Number of faults addressed × severity of faults


FCM =
Total number of faults × severity of faults
Some of the basic process metrics used to measure testing are given below:

1. Number of test cases designed


2. Number of test cases executed
3. Number of test cases passed
4. Number of test cases failed
5. Test case execution time
6. Total execution time
7. Time spent for the development of a test case
8. Testing effort
9. Total time spent for the development of test cases
76 Empirical Research in Software Engineering

On the basis of above direct measures, the following additional testing-related metrics
can be computed to derive more useful information from the basic metrics as given below.

1. Percentage of test cases executed


2. Percentage of test cases passed
3. Percentage of test cases failed
4. Actual execution time of a test case/estimated execution time of a test case
5. Average execution time of a test case

3.5 OO Metrics
Because of growing size and complexity of software systems in the market, OO analysis
and design principles are being used by organizations to produce better designed, high–
quality, and maintainable software. As the systems are being developed using OO soft-
ware engineering principles, the need for measuring various OO constructs is increasing.
Features of OO paradigm (programming languages, tools, methods, and processes) pro-
vide support for many quality attributes. The key concepts of OO paradigm are: classes,
objects, attributes, methods, modularity, encapsulation, inheritance, and polymorphism
(Malhotra 2009). An object is made up of three basic components: an identity, a state, and a
behavior (Booch 1994). The identity distinguishes two objects with same state and behav-
ior. The state of the object represents the different possible internal conditions that the
object may experience during its lifetime. The behavior of the object is the way the object
will respond to a set of received messages.
A class is a template consisting of a number of attributes and methods. Every object
is the instance of a class. The attributes in a class define the possible states in which an
instance of that class may be. The behavior of an object depends on the class methods and
the state of the object as methods may respond differently to input messages depending on
the current state. Attributes and methods are said to be encapsulated into a single entity.
Encapsulation and data hiding are key features of OO languages.
The main advantage of encapsulation is that the values of attributes remain private,
unless the methods are written to pass that information outside of the object. The internal
working of each object is decoupled from the other parts of the software thus achieving
modularity. Once a class has been written and tested, it can be distributed to other pro-
grammers for reuse in their own software. This is known as reusability. The objects can
be maintained separately leading to easier location and fixation of errors. This process is
called maintainability.
The most powerful technique associated to OO methods is the inheritance relationship.
If a class B is derived from class A. Class A is said to be a base (or super) class and class B is
said to be a derived (or sub) class. A derived class inherits all the behavior of its base class
and is allowed to add its own behavior.
Polymorphism (another useful OO concept) describes multiple possible states for a
single property. Polymorphism allows programs to be written based only on the abstract
interfaces of the objects, which will be manipulated. This means that future extension
in the form of new types of objects is easy, if the new objects conform to the original
interface.
Software Metrics 77

Nowadays, the software organizations are focusing on software process improvement.


This demand led to new/improved approaches in software development area, with per-
haps the most promising being the OO approach. The earlier software metrics (Halstead,
McCabe, LOCs) were aimed at procedural-oriented languages. The OO paradigm includes
new concepts. Therefore, a number of OO metrics to capture the key concepts of OO para-
digm have been proposed in literature in the last two decades.

3.5.1 Popular OO Metric Suites


There are a number of OO metric suites proposed in the literature. These metric suites are
summarized below. Chidamber and Kemerer (1994) defined a suite of six popular metrics.
This suite has received widest attention for predicting quality attributes in literature. The
metrics summary along with the construct they are capturing is provided in Table 3.3.
Li and Henry (1993) assessed the Chidamber and Kemerer metrics given in Table 3.3 and
provided a metric suite given in Table 3.4.
Bieman and Kang (1995) proposed two cohesion metrics loose class cohesion (LCC) and
tight class cohesion (TCC).
Lorenz and Kidd (1994) proposed a suite of 11 metrics. These metrics address size, cou-
pling, inheritance, and so on and are summarized in Table 3.5.
Briand et al. (1997) proposed a suite of 18 coupling metrics. These metrics are summa-
rized in Table 3.6. Similarly, Tegarden et al. (1995) have proposed a large suite of met-
rics based on variable, object, method and system level. The detailed list can be found in
Henderson-Sellers (1996). Lee et al. (1995) has given four metrics, one for measuring cohe-
sion and three metrics for measuring coupling (see Table 3.7).
The system-level polymorphism metrics are measured by Benlarbi and Melo (1999).
These metrics are used to measure static and dynamic polymorphism and are summa-
rized in Table 3.8.

TABLE 3.3
Chidamber and Kemerer Metric Suites
Metric Definition Construct Being Measured

CBO It counts the number of other classes to which a class is linked. Coupling
WMC It counts the number of methods weighted by complexity in a class. Size
RFC It counts the number of external and internal methods in a class. Coupling
LCOM Lack of cohesion in methods Cohesion
NOC It counts the number of immediate subclasses of a given class. Inheritance
DIT It counts the number of steps from the leaf to the root node. Inheritance

TABLE 3.4
Li and Henry Metric Suites
Metric Definition Construct Being Measured

DAC It counts the number of abstract data types in a class. Coupling


MPC It counts a number of unique send statements from a class to Coupling
another class.
NOM It counts the number of methods in a given class. Size
SIZE1 It counts the number of semicolons. Size
SIZE2 It is the sum of number of attributes and methods in a class. Size
78 Empirical Research in Software Engineering

TABLE 3.5
Lorenz and Kidd Metric Suites for measuring Inheritance
Metric Definition
NOP It counts the number of immediate parents of a given class.
NOD It counts the number of indirect and direct subclasses of a given class.
NMO It counts the number of methods overridden in a class.
NMI It counts the number of methods inherited in a class.
NMA It counts the number of new methods added in a class.
SIX Specialization index

TABLE 3.6
Briand et al. Metric Suites
IFCAIC These coupling metrics count the number of interactions between classes.
ACAIC These metrics distinguish the relationship between the classes (friendship, inheritance,
OCAIC none), different types of interactions, and the locus of impact of the interaction.
FCAEC The acronyms for the metrics indicates what interactions are counted:
DCAEC
• The first or first two characters indicate the type of coupling relationship between
OCAEC
classes:
IFCMIC A: ancestor, D: descendents, F: friend classes, IF: inverse friends (classes that declare
ACMIC a given class A as their friend), O: others, implies none of the other relationships.
DCMIC • The next two characters indicate the type of interaction:
FCMEC CA: There is a class–attribute interaction if class x has an attribute of type class y.
DCMEC CM: There is a class–method interaction if class x consist of a method that has
OCMEC parameter of type class y.
IFMMIC MM: There is a method–method interaction if class x calls method of another class y,
AMMIC or class x has a method of class y as a parameter.
OMMIC • The last two characters indicate the locus of impact:
FMMEC IC: Import coupling, counts the number of other classes called by class x.
DMMEC EC: Export coupling, count number of other classes using class y.
OMMEC

TABLE 3.7
Lee et al. Metric Suites
Metric Definition Construct Being Measured

ICP Information flow-based coupling Coupling


IHICP Information flow-based inheritance coupling Coupling
NIHICP Information flow-based noninheritance coupling Coupling
ICH Information-based cohesion Cohesion

Yap and Henderson-Sellers (1993) have proposed a suite of metrics to measure cohesion
and reuse in OO systems. Aggarwal et al. (2005) defined two reusability metrics namely
function template factor (FTF) and class template factor (CTF) that are used to mea-
sure reuse in OO systems. The relevant metrics summarized in tables are explained in
subsequent sections.
Software Metrics 79

TABLE 3.8
Benlarbi and Melo Polymorphism Metrics
Metric Definition
SPA It measures static polymorphism in ancestors.
DPA It measures dynamic polymorphism in ancestors.
SP It is the sum of SPA and SPD metrics.
DP It is the sum of DPA and DPD metrics.
NIP It measures polymorphism in noninheritance relations.
OVO It measures overloading in stand-alone classes.
SPD It measures static polymorphism in descendants.
DPD It measures dynamic polymorphism in descendants.

3.5.2 Coupling Metrics


Coupling is defined as the degree of interdependence between modules or classes. It is
measured by counting the number of other classes called by a class, during the software
analysis and design phases. It increases complexity and decreases maintainability, reus-
ability, and understandability of the system. Thus, interclass coupling must be kept
to a minimum. Coupling also increases amount of testing effort required to test classes
(Henderson-Sellers 1996). Thus, the aim of the developer should be to keep coupling
between two classes as minimum as possible.
Information flow metrics represent the amount of coupling in the classes. Fan-in and
fan-out metrics indicate the number of classes collaborating with the other classes:

1. Fan-in: It counts the number of other classes calling class X.


2. Fan-out: It counts the number of classes called by class X.

Figure 3.4 depicts the values for fan-in and fan-out metrics for classes A, B, C, D, E, and F of
an example system. The values of fan-out should be as low as possible because of the fact
that it increases complexity and maintainability of the software.

Class A
Fan-out = 4

Class B Class C Class D Class E


Fan-in = 1 Fan-in = 2 Fan-in = 1 Fan-in = 1
Fan-out = 1 Fan-out = 1 Fan-out = 1 Fan-out = 0

Class F
Fan-in = 2
Fan-out = 0

FIGURE 3.4
Fan-in and fan-out metrics
80 Empirical Research in Software Engineering

Chidamber and Kemerer (1994) defined coupling as:


Two classes are coupled when methods declared in one class use methods or instance
variables of the other classes.

This definition also includes coupling based on inheritance. Chidamber and Kemerer
(1994) defined coupling between objects (CBO) as “the count of number of other classes
to which a class is coupled.” The CBO definition given in 1994 includes inheritance-based
coupling. For example, consider Figure 3.5, three variables of other classes (class B, class C,
and class D) are used in class A, hence, the value of CBO for class A is 3. Similarly, classes
D, F, G, and H have the value of CBO metric as zero.
Li and Henry (1993) used data abstraction technique for defining coupling. Data abstrac-
tion provides the ability to create user-defined data types called abstract data types (ADTs).
Li and Henry defined data abstraction coupling (DAC) as:
DAC = number of ADTs defined in a class
In Figure 3.5, class A has three ADTs (i.e., three nonsimple attributes). Li and Henry defined
another coupling metric known as message passing coupling (MPC) as “number of unique
send statements in a class.” Hence, if three different methods in class B access the same
method in class A, then MPC is 3 for class B, as shown in Figure 3.6.
Chidamber and Kemerer (1994) defined response for a class (RFC) metric as a set of
methods defined in a class and called by a class. It is given by RFC = |RS|, where RS, the
response set of the class, is given by:

RS = I i ∪ all j {Eij }

where:
Ii = set of all methods in a class (total i)
Ri = {Rij} = set of methods called by Mi

A
Fan-out = 3
CBO = 3

B C
Fan-out = 2 Fan-out = 1 D
CBO = 2 CBO = 1

F H

FIGURE 3.5
Values of CBO metric for a small program.
Software Metrics 81

Class B
Class A

FIGURE 3.6
Example of MPC metric.

Class A RFC = 6 Class B

MethodA1() MethodB1()

MethodA2() MethodB2()

MethodA3()

Class C

MethodC1()

MethodC2()

FIGURE 3.7
Example of RFC metric.

For example, in Figure 3.7, RFC value for class A is 6, as class A has three methods of its
own and calls 2 other methods of class B and one of class C.
A number of coupling metrics with respect to OO software have been proposed by
Briand et al. (1997). These metrics take into account the different OO design mechanisms
provided by the C++ language: friendship, classes, specialization, and aggregation. These
metrics may be used to guide software developers about which type of coupling affects
the maintenance cost and reduces reusability. Briand et al. (1997) observed that the cou-
pling between classes could be divided into different facets:

1. Relationship: It signifies the type of relationship between classes—friendship,


inheritance, or other.
2. Export or import coupling (EC/IC): It determines the number of classes calling
class A (export) and the number of classes called by class A (import).
3. Type of interaction: There are three types of interactions between classes—class–
attribute (CA), class–method (CM), and method–method (MM).
i. CA interaction: If there are nonsimple attributes declared in a class, the type
of interaction is CA. For example, consider Figure 3.8, there are two nonsimple
82 Empirical Research in Software Engineering

attributes in class A, B1 of type class B and C1 of type class C. Hence, any


changes in class B or class C may affect class A.
ii. CM interaction: If the object of class A is passed as parameter to method of
class B, then the type of interaction is said to be CM. For example, as shown in
Figure 3.8, object of class B, B1, is passed as parameter to method M1 of class
A, thus the interaction is of CM type.
iii. MM interaction: If a method Mi of class Ki calls method Mj of class Kj or if the
reference of method Mi of class Ki is passed as an argument to method Mj of
class Kj, then there is MM type of interaction between class Ki and class Kj. For
example, as shown in Figure 3.8, the method M2 of class B calls method M1 of
class A, hence, there is a MM interaction between class B and class A. Similarly,
method B1 of class B type is passed as reference to method M3 of class C.

The metrics for CM interaction type are IFCMIC, ACMIC, OCMIC, FCMEC, DCMEC, and
OCMEC. In these metrics, the first one/two letters denote the type of relationship (IF denotes
inverse friendship, A denotes ancestors, D denotes descendant, F denotes friendship, and O
denotes others). The next two letters denote the type of interaction (CA, CM, MM) between
classes. Finally, the last two letters denote the type of coupling (IC or EC).
Lee et al. (1995) acknowledged the need to differentiate between inheritance-based and
noninheritance-based coupling by proposing the corresponding measures: noninheritance
information flow-based coupling (NIH-ICP) and information flow-based inheritance coupling
(IH-ICP). Information flow-based coupling (ICP) metric is defined as the sum of NIH-ICP and
IH-ICP metrics and is based on method invocations, taking polymorphism into account.

class A
{
B B1; // Nonsimple attributes
C C1;
public:
void M1(B B1)
{
}
};
class B
{
public:
void M2()
{
A A1;
A1.M1();// Method of class A called
}
};
class C
{
void M3(B::B1) //Method of class B passed as parameter
{
}
};

FIGURE 3.8
Example for computing type of interaction.
Software Metrics 83

3.5.3 Cohesion Metrics


Cohesion is a measure of the degree to which the elements of a module are functionally
related to each other. The cohesion measure requires information about attribute usage
and method invocations within a class. A class that is less cohesive is more complex and is
likely to contain more number of faults in the software development life cycle. Chidamber
and Kemerer (1994) proposed lack of cohesion in methods (LCOM) metric in 1994. The
LCOM metric is used to measure the dissimilarity of methods in a class by taking into
account the attributes commonly used by the methods.
The LCOM metric calculates the difference between the number of methods that have
similarity zero and the number of methods that have similarly greater than zero. In LCOM,
similarity represents whether there is common attribute usage in pair of methods or not.
The greater the similarly between methods, the more is the cohesiveness of the class. For
example, consider a class consisting of four attributes (A1, A2, A3, and A4). The method
usage of the class is given in Figure 3.9.
There are few problems related to LCOM metric, proposed by Chidamber and Kemerer
(1994), which were addressed by Henderson-Sellers (1996) as given below:
1. The value of LCOM metric was zero in a number of real examples because of the
presence of dissimilarity among methods. Hence, although a high value of LCOM
metric suggests low cohesion, the zero value does not essentially suggest high
cohesion.
2. Chidamber and Kemerer (1994) gave no guideline for interpretation of value of
LCOM. Thus, Henderson-Sellers (1996) revised the LCOM value. Consider m
methods accessing a set of attributes Di (i = 1,…,n). Let µ ( Di ) be the number of
methods that access each datum. The revised LCOM1 metric is given as follows:

(1 N ) ∑ i=1 µ ( Di ) − m
n

LCOM1 =
1− m

M1 = {A1, A2, A3, A4}


M 2 = {A1, A2}
M 3 = {A3}
M 4 = {A3, A4}
M 5 = {A2}
M1  M 2 = 1
M1  M 3 = 1
M1  M 4 = 1
M1  M 5 = 1
M2  M3 = 0
M2  M4 = 0
M2  M5 = 1
M3  M4 = 1
M3  M5 = 0
M4  M5 = 0
LCOM = 4 − 6, Hence, LCOM = 0

FIGURE 3.9
Example of LCOM metric.
84 Empirical Research in Software Engineering

Stack Class name

Top : Integer
Attributes
a : Integer

Push(a, n)
Pop() Methods
Getsize()
Empty ()
Display ()

FIGURE 3.10
Stack class.

The approach by Bieman and Kang (1995) to measure cohesion was based on that of
Chidamber and Kemerer (1994). They proposed two cohesion measures—TCC and LCC.
TCC metric is defined as the percentage of pairs of directly connected public methods
of the class with common attribute usage. LCC is the same as TCC, except that it also
considers indirectly connected methods. A method M1 is indirectly connected with
method M3, if method M1 is connected to method M2 and method M2 is connected
to method M3. Hence, transitive closure of directly connected methods is represented by
indirectly connected methods. Consider the class stack shown in Figure 3.10.
Figure 3.11 shows the attribute usage of methods. The pair of public functions with com-
mon attribute usage is given below:

{(empty, push), (empty, pop), (empty, display), (getsize, push), (getsize, pop), (push, pop),
(push, display), (pop, display)}

Thus, TCC for stack class is as given below:

8
TCC ( Stack ) = × 100 = 80%
10

The methods “empty” and “getsize” are indirectly connected, since “empty” is connected
to “push” and “getsize” is also connected to “push.” Thus, by transitivity, “empty” is con-
nected to “getsize.” Similarly “getsize” is indirectly connected to “display.”
LCC for stack class is as given below:

10
LCC ( Stack ) = × 100 = 100%
10

Push Pop Getsize Empty Display

Top, a, n Top, a, n n Top Top, a

FIGURE 3.11
Attribute usage of methods of class stack.
Software Metrics 85

Lee et al. (1995) proposed information flow-based cohesion (ICH) metric. ICH for a class is
defined as the weighted sum of the number of invocations of other methods of the same
class, weighted by the number of parameters of the invoked method.

3.5.4 Inheritance Metrics


The inheritance represents parent–child relationship and is measured in terms of num-
ber of subclasses, base classes, and depth of inheritance hierarchy by many authors in
the literature. Inheritance represents form of reusability. Chidamber and Kemerer (1994)
defined depth of inheritance tree (DIT) metric as maximum number of steps from class to
root node in a tree. Thus, in case concerning multiple inheritance, the DIT will be counted
as the maximum length from the class to the root of the tree. Consider Figure 3.12, DIT for
class D and class F is 2.
The average inheritance depth (AID) is calculated as (Yap and Henderson-Sellers 1993):

AID =
∑ depth of each class
Total number of classes

In Figure 3.11, the depth of subclass D is 2 ([2 + 2]/2).


The AID of overall inheritance structure is: 0(A) + 1(B) + 1(C) + 2(D) + 0(E) + 1.5(F) +
0(G) = 5.5. Finally, dividing by total number of classes we get 5.5/6 = 0.92.
Chidamber and Kemerer (1994) yet proposed another metric, number of children (NOC),
which counts the number of immediate subclasses of a given class in an inheritance hier-
archy. A class with more NOC requires more testing. In Figure 3.12, class B has 1 and class
C has 2 subclasses. Lorenz and Kidd (1994) proposed number of parents (NOP) metric that
counts the number of direct parent classes for a given class in inheritance hierarchy. For
example, class D has NOP value of 2. Similarly, Lorenz and Kidd (1994) also developed
number of descendants (NOD) metric. The NOD metric defines the number of direct and
indirect subclasses of a class. In Figure 3.12, class E has NOD value of 3 (C, D, and F).
Tegarden et al. (1992) define number of ancestors (NA) as the number of indirect and direct
parent classes of a given class. Hence, as given in Figure 3.12, NA(D) = 4 (A, B, C, and E).
Other inheritance metrics defined by Lorenz and Kidd include the number of methods
added (NMA), number of methods overridden (NMO), and number of methods inherited
(NMI). NMO counts number of methods in a class with same name and signature as in its

A E

B C G

D F

FIGURE 3.12
Inheritance hierarchy.
86 Empirical Research in Software Engineering

parent class. NMA counts the number of new methods (neither overridden nor inherited)
added in a class. NMI counts number of methods inherited by a class from its parent class.
Finally, Lorenz and Kidd (1994) defined specialization index (SIX) using DIT, NMO, NMA,
and NMI metrics as given below:

NMO × DIT
SIX =
NMO + NMA + NMI
Consider the class diagram given in Figure 3.13. The class employee inherits class person.
The class employee overrides two functions, addDetails() and display(). Thus, the value of
NMO metric for class student is 2. Two new methods is added in this class (getSalary() and
compSalary()). Hence, the value of NMA metric is 2.
Thus, for class Employee, the value of NMO is 2, NMA is 2, and NMI is 1 (getEmail()).
For the class Employee, the value of SIX is:

2×1 2
SIX = = = 0.4
2+ 2+1 5
The maximum number of levels in the inheritance hierarchy that are below the class are
measured through class to leaf depth (CLD). The value of CLD for class Person is 1.

3.5.5 Reuse Metrics


An OO development environment supports design and code reuse, the most straight-
forward type of reuse being the use of a library class (of code), which perfectly suits the

Person

name: char
phone: integer
addr: integer
email: char

addDetails()
display()
getEmail()

Employee

Emp_id: char
basic: integer
da: real
hra: real

addDetails()
display()
getSalary()
compSalary()

FIGURE 3.13
Example of inheritance relationship.
Software Metrics 87

requirements. Yap and Henderson-Sellers (1993) discuss two measures designed to evaluate
the level of reuse possible within classes. The reuse ratio (U) is defined as:

Number of superclasses
U=
Total number of classes

Consider Figure 3.13, the value of U is 1 2. Another metric is specialization ratio (S), and is
given as:

Number of subclasses
S=
Number of superclasses

In Figure 3.13, Employee is the subclass and Person is the parent class. Thus, S = 1.
Aggarwal et al. (2005) proposed another set of metrics for measuring reuse by using
generic programming in the form of templates. The metric FTF is defined as ratio of num-
ber of functions using function templates to total number of functions as shown below:

Number of functions using function templates


FTF =
Total number of functions

Consider a system with methods F1,…,Fn. Then,


n
uses _ FT ( Fi )
FTF = i −1


n
F
i −1

where:

1, iff function uses function template 


uses _ FT ( Fi )  
0, otherwise 

In Figure 3.14, the value of metric FTF = (1/3).


The metric CTF is defined as the ratio of number of classes using class templates to total
number of classes as shown below:

Number of classes using class templates


CTF =
Total number of classees

void method1(){
.........}
template<class U>
void method2(U &a, U &b){
.........}
void method3(){
........}

FIGURE 3.14
Source code for calculation of FTF metric.
88 Empirical Research in Software Engineering

class X{
.....};
template<class U, int size>
class Y{
U ar1[size];
....};

FIGURE 3.15
Source code for calculating metric CTF.

Consider a system with classes C1,…,Cn. Then,


n
uses _ CT ( Ci )
CTF = i −1


n
Ci
i −1

where:

1iff class uses class template 


uses _ CT ( Ci )  
0 otherwise 

1
In Figure 3.15, the value of metric CTF = .
2

3.5.6 Size Metrics


There are various conventional metrics applicable to OO systems. The traditional LOC
metric measures the size of a class (refer Section 3.3). However, the OO paradigm defines
many concepts that require additional metrics that can measure them. Keeping this in
view, many OO metrics have been proposed in the literature. Chidamber and Kemerer
(1994) developed weighted methods per class (WMC) metric as count of number of meth-
ods weighted by complexities and is given as:
n
WMC = ∑C
i =1
i

where:
M1,…Mn are methods defined in class K1
C1,…Cn are the complexities of the methods

Lorenz and Kidd defined number of attributes (NOA) metric given as the sum of number
of instance variables and number of class variables. Li and Henry (1993) defined number
of methods (NOM) as the number of local methods defined in a given class. They also
defined two other size metrics—namely, SIZE1 and SIZE2. These metrics are defined
below:

SIZE1 = number of semicolons in a class


SIZE2 = sum of NOA and NOM
Software Metrics 89

3.6 Dynamic Software Metrics


The dynamic behavior of the software is captured though dynamic metrics. The dynamic
metrics related to coupling, cohesion, and complexity have been proposed in literature.
The difference between static and dynamic metrics is presented in Table 3.9 (Chhabra and
Gupta 2010).

3.6.1 Dynamic Coupling Metrics


Yacoub et al. (1999) developed a set of metrics for measuring dynamic coupling—namely,
export object coupling (EOC) and import object coupling (IOC). These metrics are based on
executable code. EOC metric calculates the percentage of ratio of number of messages sent
from one object o1 to another object o2 to the total number of messages exchanged between
o1 and o2 during the execution of a scenario. IOC metric calculates percentage of ratio of num-
ber of messages sent from object o2 to o1 to the total number of messages exchanged between
o1 and o2 during the execution of a scenario. For example, four messages are sent from object
o1 to object o2 and three messages are sent from object o2 to object o1, EOC(o1) = 4/7 × 100 and
IOC(o1) = 3/7 × 100.
Arisholm et al. (2004) proposed a suite of 12 dynamic IC and EC metrics. There are six
metrics defined at object level and six defined at class level. The first two letters of the
metric describe the type of coupling—import or export. The EC represents the number of
other classes calling a given class. IC represents number of other classes called by a class.
The third letter signifies object or class, and the last letter signifies the strength of coupling
(D—dynamic messages, M—distinct method, C—distinct classes). Mitchell and Power
developed dynamic coupling metric suite summarized in Table 3.10.

3.6.2 Dynamic Cohesion Metrics


Mitchell and Power (2003, 2004) proposed extension of Chidamber and Kemerer’s LCOM
metric as dynamic LCOM. They proposed two variations of LCOM metric: runtime simple
LCOM (RLCOM) and runtime call-weighted LCOM (RWLCOM). RLCOM considers
instance variables accessed at runtime. RWLCOM assigns weights to each instance vari-
able by the number of times it is accessed at runtime.

TABLE 3.9
Difference between static and dynamic metrics
S. No. Static Metrics Dynamic Metrics

1 Collected without execution of the program Collected at runtime execution


2 Easy to collect Difficult to collect
3 Available in the early phases of software development Available in later phases of software
development
4 Less accurate as compared to dynamic metrics More accurate
5 Inefficient in dealing with dead code and OO Efficient in dealing with all OO concepts
concepts such as polymorphism and dynamic
binding
90 Empirical Research in Software Engineering

TABLE 3.10
Mitchell and Power Dynamic Coupling Metric Suite
Metric Definition
Dynamic coupling between objects This metric is same as Chidamber and Kemerer’s
CBO metric, but defined at runtime.
Degree of dynamic coupling between two classes It is the percentage of ratio of number of times a class A
at runtime accesses the methods or instance variables of another
class B to the total no of accesses of class A.
Degree of dynamic coupling within a given set The metric extends the concept given by the above
of classes metric to indicate the level of dynamic coupling within
a given set of classes.
Runtime import coupling between objects Number of classes assessed by a given class at runtime.
Runtime export coupling between objects Number of classes that access a given class at runtime.
Runtime import degree of coupling Ratio of number of classes assessed by a given class at
runtime to the total number of accesses made.
Runtime export degree of coupling Ratio of number of classes that access a given class at
runtime to the total number of accesses made.

3.6.3 Dynamic Complexity Metrics


Determining the complexity of a program is important to analyze its testability and main-
tainability. The complexity of the program may depend on the execution environment.
Munson and Khoshgoftar (1993) proposed dynamic complexity metric.

3.7 System Evolution and Evolutionary Metrics


Software evolution aims at incorporating and revalidating the probable significant changes
to the system without being able to predict a priori how user requirements will evolve. The
current system release or version can never be said to be complete and continues to evolve.
As it evolves, the complexity of the system will grow unless there is a better solution avail-
able to solve these issues.
The main objectives of software evolution are ensuring the reliability and flexibility
of the system. During the past 20 years, the life span of a system could be on average
6–10 years. However, it was recently found that a system should be evolved once every few
months to ensure it is adapted to the real-world environment. This is because of the rapid
growth of World Wide Web and Internet resources that make it easier for users to find
related information.
The idea of software evolution leads to open source development as anybody could
download the source codes and, hence, modify it. The positive impact in this case is that a
number of new ideas would be discovered and generated that aims to improve the quality
of system with a variety of choices. However, the negative impact is that there is no copy-
right if a software product has been published as open source.
Over time, software systems, programs as well as applications, continue to develop.
These changes will require new laws and theories to be created and justified. Some mod-
els would also require additional aspects in developing future programs. Innovations,
Software Metrics 91

improvements, and additions lead to unexpected form of software effort in maintenance


phase. The maintenance issues also would probably change to adapt to the evolution of
the future software.
Software process and development are an ongoing process that has a never-ending
cycle. After going through learning and refinements, it is always an arguable issue when it
comes to efficiency and effectiveness of the programs.
A software system may be analyzed by the following evolutionary and change metrics
(suggested by Moser et al. 2008), which may prove helpful in understanding the evolution
and release history of a software system.

3.7.1 Revisions, Refactorings, and Bug-Fixes


The metrics related to refactoring and bug-fixes are defined below:

• Revisions: Number of revisions of a software repository file


• Refactorings: Number of times a software repository file has been refactored
• Bug-fixes: Number of times a file has been associated with bug-fixing
• Authors: Number of distinct or different authors who have committed or checked
in a software repository

3.7.2 LOC Based


The LOC-based evolution metrics are described as:

• LOC added: Sum total of all the lines of code added to a file for all of its revisions
in the repository
• Max LOC added: Maximum number of lines of code added to a file for all of its
revisions in the repository
• Average LOC added: Average number of lines of code added to a file for all of its
revisions in the repository
• LOC deleted: Sum total of all the lines of code deleted from a file for all of its revi-
sions in the repository
• Max LOC deleted: Maximum number of lines of code deleted from a file for all of
its revisions in the repository
• Average LOC deleted: Average number of lines of code deleted from a file for all of
its revisions in the repository

3.7.3 Code Churn Based


• Code churn: Sum total of (difference between added lines of code and deleted lines
of code) for a file, considering all of its revisions in the repository
• Max code churn: Maximum code churn for all of the revisions of a file in the
repository
• Average code churn: Average code churn for all of the revisions of a file in the
repository
92 Empirical Research in Software Engineering

3.7.4 Miscellaneous
The other related evolution metrics are:

• Max change set: Maximum number of files that are committed or checked in
together in a repository
• Average change set: Average number of files that are committed or checked in
together in a repository
• Age: Age of repository file, measured in weeks by counting backward from a given
release of a software system
• Weighted Age: Weighted Age of a repository file is given as:


N
Age ( i ) × LOC added ( i )
i =1

∑ LOC added ( i )
where:
i is a revision of a repository file and N is the total number of revisions for that
file

3.8 Validation of Metrics


Several researchers recommend properties that software metrics should posses to increase
their usefulness. For instance, Basili and Reiter suggest that metrics should be sensitive to
externally observable differences in the development environment, and must correspond
to notions about the differences between the software artifacts being measured (Basili and
Reiter 1979). However, most recommended properties tend to be informal in the evaluation
of metrics. It is always desirable to have a formal set of criteria with which the proposed
metrics can be evaluated. Weyuker (1998) has developed a formal list of properties for soft-
ware metrics and has evaluated number of existing software metrics against these proper-
ties. Although many authors (Zuse 1991, Briand et al. 1999b) have criticized this approach,
it is still a widely known formal, analytical approach.
Weyuker’s (1988) first four properties address how sensitive and discriminative the
metric is. The fifth property requires that when two classes are combined their metric
value should be greater than the metric value of each individual class. The sixth property
addresses the interaction between two programs/classes. It implies that the interaction
between program/class A and program/class B is different than the interaction between
program/class C and program/class B given that the interaction between program/class A
and program/class C. The seventh property requires that a measure be sensitive to state-
ment order within a program/class. The eighth property requires that renaming of vari-
ables does not affect the value of a measure. Last property states that the sum of the metric
values of a program/class could be less than the metric value of the program/class when
considered as a whole (Henderson-Sellers 1996). The applicability of only the properties for
OO metrics are given below:
Let u be the metric of program/class P and Q
Software Metrics 93

Property 1: This property states that

( ∃P ) , ( ∃Q ) u ( p ) ≠ u (Q )
It ensures that no measure rates all program/class to be of same metric value.
Property 2: Let c be a nonnegative number. Then, there are finite numbers of pro-
gram/class with metric c. This property ensures that there is sufficient resolution
in the measurement scale to be useful.
Property 3: There are distinct program/class P and Q such that u ( p ) = u ( Q ) ⋅
Property 4: For OO system, two programs/classes having the same functionality
could have different values.

( ∃P ) ( ∃Q )  P ≡ Q and u ( P ) ≠ (Q )
Property 5: When two programs/classes are concatenated, their metric should be
greater than the metrics of each of the parts.

( ∀P ) ( ∀Q ) u ( P ) ≤ u ( P + Q ) and u (Q ) ≤ u ( P + Q )


Property 6: This property suggests nonequivalence of interaction. If there are two
program/class bodies of equal metric value which, when separately concatenated
to a same third program/class, yield program/class of different metric value.
For program/class P, Q, R

( ∃P ) ( ∃Q ) ( ∃R ) u ( P ) = u (Q ) and u ( P + R ) ≠ u (Q + R )


Property 7: This property is not applicable for OO metrics (Chidamber and Kemerer 1994).
Property 8: It specifies that “if P is a renaming of Q, then u ( P ) = u ( Q ).”
Property 9: This property is not applicable for OO metrics (Chidamber and Kemerer
1994).

3.9 Practical Relevance


Empirical assessment of software metrics is important to ensure their practical relevance in
the software organizations. Such analysis is of high practical relevance and especially ben-
eficial for large-scale systems, where the experts need to focus their attention and resources
to problem areas in the system under development. In the subsequent section, we describe
the role of metrics in research and industry. We also provide the approach for calculating
metric thresholds.

3.9.1 Designing a Good Quality System


During the entire life cycle of a project, it is very important to maintain the quality and
to ensure that it does not deteriorate as a project progresses through its life cycle. Thus,
the project manager must monitor quality of the system on a continuous basis. To plan
94 Empirical Research in Software Engineering

and control quality, it is very important to understand how the quality can be measured.
Software metrics are widely used for measuring, monitoring, and evaluating the quality
of a project. Various software metrics have been proposed in the literature to assess the
software quality attributes such as change proneness, fault proneness, maintainability of
a class or module, and so on. A large portion of empirical research has been involved with
the development and evaluation of the quality models for procedural and OO software.
Software metrics have found a wide range of applications in various fields of software engi-
neering. As discussed, some of the familiar and common uses of software metrics are sched-
uling the time required by a project, estimating the budget or cost of a project, estimating the
size of the project, and so on. These parameters can be estimated at the early phases of soft-
ware development life cycle, and thus help software managers to make judicious allocation
of resources. For example, once the schedule and budget has been decided upon, managers
can plan in advance the amount of person-hours (effort) required. Besides this, the design of
software can be assessed in the industry by identifying the out of range values of the software
metrics. One way to improve the quality of the system is to relate structural attribute mea-
sures intended to quantify important concepts of a given software, such as the following:

• Encapsulation
• Coupling
• Cohesion
• Inheritance
• Polymorphism

to external quality attributes such as the following:

• Fault proneness
• Maintainability
• Testing effort
• Rework effort
• Reusability
• Development effort

The ability to assess quality of software in the early phases of the software life cycle is the
main aim of researchers so that structural attribute measures can be used for predicting exter-
nal attribute measures. This would greatly facilitate technology assessment and comparisons.
Researchers are working hard to investigate the properties of software measures to
understand the effectiveness and applicability of the underlying measures. Hence, we need
to understand what these measures are really capturing, whether they are really differ-
ent, and whether they are useful indicators of quality attributes of interest? This will build
a body of evidence, and present commonalities and differences across various studies.
Finally, these empirical studies will contribute largely in building good quality systems.

3.9.2 Which Software Metrics to Select?


The selection of software metrics (independent variables) in the research is a crucial
decision. The researcher must first decide on the domain of the metrics. After deciding
the domain, the researcher must decide the attributes to capture in the domain. Then,
Software Metrics 95

the popular and widely used software metrics suite available to measure the constructs
is identified from the literature. Finally, a decision on the selection must be made on soft-
ware metrics. The criterion that can be used to select software metrics is that the selected
software metrics must capture all the constructs, be widely used in the literature, easily
understood, fast to compute, and computationally less expensive. The choice of metric
suite heavily depends on the goals of the research. For instance, in quality model pre-
diction, OO metrics proposed by Chidamber and Kemerer (1994) are widely used in the
empirical studies.
In cases where multiple software metrics are used, the attribute reduction techniques
given in Section 6.2 must be applied to reduce them, if model prediction is being conducted.

3.9.3 Computing Thresholds


As seen in previous sections, there are a number of metrics proposed and there are numer-
ous tools to measure them (see Section 5.8.3). Metrics are widely used in the field of soft-
ware engineering to identify problematic parts of the software that need focused and
careful attention. A researcher can also keep a track of the metric values, which will allow
to identify benchmarks across organizations. The products can be compared or rated,
which will allow to assess their quality. In addition to this, threshold values can be defined
for the metrics, which will allow the metrics to be used for decision making. Bender (1999)
defined threshold as “Breakpoints that are used to identify the acceptable risk in classes.”
In other words, a threshold can be defined as a benchmark or an upper bound such that
the values greater than a threshold value are considered to be problematic, whereas the
values lower are considered to be acceptable.
During the initial years, many authors have derived threshold values based on their
experience and, thus, those values are not universally accepted. For example, McCabe
(1976) defined a value of 10 as threshold for the cyclomatic complexity metric. Similarly, for
the maintainability index metric, 65 and 85 are defined as thresholds (Coleman et al. 1995).
Since these values are based on intuition or experience, it is not possible to generalize
results using these values. Besides the thresholds based on intuition, some authors defined
thresholds using mean (µ) and standard deviation (σ). For example, Erni and Lewerentz
(1996) defined the minimum and maximum values of threshold as T = µ + σ and T = µ − σ,
respectively. However, this methodology did not gain popularity as it used the assump-
tion that the metrics should be normally distributed, which is not applicable always.
French (1999) used Chebyshev’s inequality theorem (not restricted to normal distribution)
in addition to mean (µ) and standard deviation (σ) to derive threshold values. According
to French, a threshold can be defined as T = µ + k × σ (k = number of standard deviations).
However, this methodology was also not used much as it was restricted to only two-tailed
symmetric distributions, which is not justified.
A statistical model (based on logistic regression) to calculate the threshold values was
suggested by Ulm (1991). Benlarbi et al. (2000) and El Emam et al. (2000b) estimated the
threshold values of a number of OO metrics using this model. However, they found that
there was no statistical difference between the two models: the model built using the
thresholds and the model built without using the thresholds. Bender (1999) working in the
epidemiological field found that the proposed threshold model by Ulm (1991) has some
drawbacks. The model assumed that the probability of fault in a class is constant when a
metric value is below the threshold, and the fault probability increases according to the
logistic function, otherwise. Bender (1999) redefined the threshold effects as an acceptable
risk level. The proposed threshold methodology was recently used by Shatnawi (2010)
96 Empirical Research in Software Engineering

to identify the threshold values of various OO metrics. Besides this, Shatnawi et al. (2010)
also investigated the use of receiver operating characteristics (ROCs) method to identify
threshold values. The detailed explanation of the above two methodologies is provided
in the below sub sections (Shatnawi 2006). Malhotra and Bansal (2014a), evaluated the
threshold approach proposed by Bender (1999) for fault prediction.

3.9.3.1 Statistical Model to Compute Threshold


The Bender (1999) method known as value of an acceptable risk level (VARL) is used to
compute the threshold values, where the acceptable risk level is given by a probability
Po (e.g., Po = 0.05 or 0.01). For the classes with metrics values below VARL, the risk of
a fault occurrence is lower than the probability (Po). In other words, Bender (1999) has
suggested that the value of Po can be any probability, which can be considered as the
acceptable risk level.
The detailed description of VARL is given by the formula for VARL as follows (Bender
1999):

1   Po  
VARL = ln   − α
β   1 − Po  

where:
α is a constant
β is the estimated coefficient
Po is the acceptable risk level

In this formula, α and β are obtained using the standard logistic regression formula
(refer Section 7.2.1). This formula will be used for each metric individually to find its
threshold value.
For example, consider the following data set (Table A.8 in Appendix I) consisting of
the metrics (independent variables): LOC, DIT, NOC, CBO, LCOM, WMC, and RFC. The
dependent variable is fault proneness. We calculate the threshold values of all the metrics
using the following steps:

Step 1: Apply univariate logistic regression to identify significant metrics.


The formula for univariate logistic regression is:

e ( )
g x
P=
1+ e ( )
g x

where:
g ( x ) = α + βx

where:
x is the independent variable, that is, an OO metric
α is the Y-intercept or constant
β is the slope or estimated coefficient

Table 3.11 shows the statistical significance (sig.) for each metric. The “sig.” parame-
ter provides the association between each metric and fault proneness. If the “sig.”
Software Metrics 97

TABLE 3.11
Statistical Significance of Metrics
Metric Significance

WMC 0.013
CBO 0.01
RFC 0.003
LOC 0.001
DIT 0.296
NOC 0.779
LCOM 0.026

value is below or at the significance threshold of 0.05, then the metric is said to
be significant in predicting fault proneness (shown in bold). Only for significant
metrics, we calculate the threshold values. It can be observed from Table 3.11 that
DIT and NOC metrics are insignificant, and thus are not considered for further
analysis.
Step 2: Calculate the values of constant and coefficient for significant metrics.
For significant metrics, the values of constant (α) and coefficient (β) using univariate
logistic regression are calculated. These values of constant and coefficient will be
used in the computation of threshold values. The coefficient shows the impact of
the independent variable, and its sign shows whether the impact is positive or
negative. Table 3.12 shows the values of constant (α) and coefficient (β) of all the
significant metrics.
Step 3: Computation of threshold values.
We have calculated the threshold values (VARL) for the metrics that are found to be
significant using the formula given above. The VARL values are calculated for
different values of Po, that is, at different levels of risks (between Po = 0.01 and
Po = 0.1). The threshold values at different values of Po (0.01, 0.05, 0.08, and 0.1)
for all the significant metrics are shown in Table 3.13. It can be observed that the
threshold values of all the metrics change significantly as Po changes. This shows
that Po plays a significant role in calculating threshold values. Table 3.13 shows
that at risk level 0.01 and 0.05, VARL values are out of range (i.e., negative values)
for all of the metrics. At Po = 0.1, the threshold values are within the observation
range of all the metrics. Hence, in this example, we say that Po = 0.1 is the appro-
priate risk level and the threshold values (at Po = 0.1) of WMC, CBO, RFC, LOC,
and LOCM are 17.99, 14.46, 52.37, 423.44, and 176.94, respectively.

TABLE 3.12
Constant (α) and Coefficient (β) of Significant Metrics
Metric Coefficient (β) Constant (α)

WMC 0.06 −2.034


CBO 0.114 −2.603
RFC 0.032 −2.629
LOC 0.004 −2.648
LCOM 0.004 −1.662
98 Empirical Research in Software Engineering

TABLE 3.13
Threshold Values on the basis of Logistic Regression Method
Metrics VARL at 0.01 VARL at 0.05 VARL at 0.08 VARL at 0.1
WMC −42.69 −15.17 −6.81 17.99
CBO −17.48 −2.99 1.41 14.46
RFC −61.41 −9.83 5.86 52.37
LOC −486.78 −74.11 51.41 423.44
LCOM −733.28 −320.61 −195.09 176.94

3.9.3.2 Usage of ROC Curve to Calculate the Threshold Values


Shatnawi et al. (2010) calculated threshold values of OO metrics using ROC curve. To plot
the ROC curve, we need to define two variables: one binary (i.e., 0 or 1) and another con-
tinuous. Usually, the binary variable is the actual dependent variable (e.g., fault proneness
or change proneness) and the continuous variable is the predicted result of a test. When
the results of a test fall into one of the two obvious categories, such as change prone or
not change prone, then the result is a binary variable (1 if the class is change prone, 0 if
the class is not change prone) and we have only one pair of sensitivity and specificity. But,
in many situations, making a decision in binary is not possible and, thus, the decision or
result is given in probability (i.e., probability of correct prediction). Thus, the result is a
continuous variable. In this scenario, different cutoff points are selected that make each
predicted value (probability) as 0 or 1. In other words, different cutoff points are used to
change the continuous variable into binary. If the predicted probability is more than the
cutoff then the probability is 1, otherwise it is 0. In other words, if the predicted probability
is more than the cutoff then the class is classified as change prone, otherwise it is classified
as not change prone.
The procedure of ROC curves is explained in detail in Section 7.5.6, however, we sum-
marize it here to explain the concept. This procedure is carried for various cutoff points,
and values of sensitivity and 1-specificity is noted at each cutoff point. Thus, using the
(sensitivity, 1-specificity) pairs, the ROC curve is constructed. In other words, ROC curves
display the relationship between sensitivity (true-positive rate) and 1-specificity (false-
positive rate) across all possible cutoff values. We find an optimal cutoff point, the cutoff
point where balance between sensitivity and specificity is provided. This optimal cutoff
point is considered as the threshold value for that metric. Thus, threshold value (optimal
cutoff point) is obtained for each metric.
For example, consider the data set shown in Table A.8 (given in Appendix I). We need
to calculate the threshold values for all the metrics with the help of ROC curve. As dis-
cussed, to plot ROC curve, we need a continuous variable and a binary variable. In this
example, the continuous variable will be the corresponding metric and the binary vari-
able will be “fault.” Once ROC curve is constructed, the optimal cutoff point where sen-
sitivity equals specificity is found. This cutoff point is the threshold of that metric. The
thresholds (cutoff points) of all the metrics are given in Table 3.14. When the ROC curve,
is constructed the optimal cutoff point is found to be 62. Thus, the threshold value of LOC
is 62. This means that if a class has LOC value >62, it is more prone to faults (as our
dependent variable in this example is fault proneness) as compared to other classes.
Thus, focused attention can be laid on such classes and judicious allocation of resources
can be planned.
Software Metrics 99

TABLE 3.14
Threshold Values or the basis of
ROC Curve Method
Metric Threshold Value
WMC 7.5
DIT 1.5
NOC 0.5
CBO 8.5
RFC 43
LCOM 20.5
LOC 304.5

3.9.4 Practical Relevance and Use of Software Metrics in Research


From the research point of view, the software metrics have a wide range of applications,
which help to design a better and much improved quality system:

1. Using software metrics, the researcher can identify change/fault-prone classes that
a. Enables software developers to take focused preventive actions that can reduce
maintenance costs and improve quality.
b. Helps software managers to allocate resources more effectively. For example, if
we have 26% testing resources, then we can use these resources in testing top
26% of classes predicted to be faulty/change prone.
2. Among a large set of software metrics (independent variables), we can find a suit-
able subset of metrics using various techniques such as correlation-based feature
selection, univariate analysis, and so on. These techniques help in reducing the
number of independent variables (termed as “data dimensionality reduction”).
Only the metrics that are significant in predicting the dependent variable are con-
sidered. Once the metrics found to be significant in detecting faulty/change-prone
classes are identified, software developers can use them in the early phases of
software development to measure the quality of the system.
3. Another important application is that once one knows the metrics being captured
by the models, and then such metrics can be used as quality benchmarks to assess
and compare products.
4. Metrics also provide an insight into the software, as well as the processes used to
develop and maintain it.
5. There are various metrics that calculate the complexity of the program. For exam-
ple, McCabe metric helps in assessing the code complexity, Halstead metrics helps
in calculating different measurable properties of software (programming effort,
program vocabulary, program length, etc.), Fan-in and Fan-out metrics estimate
maintenance complexity, and so on. Once the complexity is known, more complex
programs can be given focused attention.
6. As explained in Section 3.9.3, we can calculate the threshold values of different
software metrics. By using threshold values of the metrics, we can identify and
focus on the classes that fall outside the acceptable risk level. Hence, during the
100 Empirical Research in Software Engineering

project development and progress, we can scrutinize the classes and prepare
alternative design structures wherever necessary.
7. Evolutionary algorithms such as genetic algorithms help in solving the optimiza-
tion problems and require the fitness function to be defined. Software metrics help
in defining the fitness function (Harman and Clark 2004) in these algorithms.
8. Last, but not the least, some new software metrics that help to improve the quality
of the system in some way can be defined in addition to the metrics proposed in
the literature.

3.9.5 Industrial Relevance of Software Metrics


The software design measurements can be used by the software industry in multiple ways:
(1) Software designers can use them to obtain quality benchmarks to assess and compare
various software products (Aggarwal et al. 2009). (2) Managers can use software metrics in
controlling and auditing the quality of the software during the software development life
cycle. (3) Software developers can use the software metrics to identify problematic areas and
use source code refactoring to improve the internal quality of the software. (4) Software testers
can use the software metrics in effective planning and allocation of testing and maintenance
resources (Aggarwal et al. 2009). In addition to this, various companies can maintain a large
database of software metrics, which allow them to compare a specific company’s application
software with the rest of the industry. This gives an opportunity to relatively measure that
software against its competitors. Comparing the planned or projected resource consumption,
code completion, defect rates, and milestone completions against the actual consumption as
the work progresses can make an assessment of progress of the software. If there are huge
deviations from the expectation, then the managers can take corrective actions before it is too
late. Also, to compare the process productivity (can be derived from size, schedule time, and
effort [person-months]) of projects completed in a company within a given year against that
of projects completed in previous years, the software metrics on the projects completed in a
given year can be compared against the projects completed in the previous years. Thus, it can
be seen that software metrics contribute in a great way to software industry.

Exercises
3.1 What are software metrics? Discuss the various applications of metrics.
3.2 Discuss categories of software metrics with the help of examples of each category.
3.3 What are categorical metric scales? Differentiate between nominal scale and ordinal
scale in the measurements and also discuss both the concepts with examples.
3.4 What is the role and significance of Weyuker’s properties in software metrics.
3.5 Define the role of fan-in and fan-out in information flow metrics.
3.6 What are various software quality metrics? Discuss them with examples.
3.7 Define usability. What are the various usability metrics? What is the role of cus-
tomer satisfaction?
3.8 Define the following metrics:
a. Statement coverage metric
Software Metrics 101

b. Defect density
c. FCMs
3.9 Define coupling. Explain Chidamber and Kemerer metrics with examples.
3.10 Define cohesion. Explain some cohesion metrics with examples.
3.11 How do we measure inheritance? Explain inheritance metrics with examples.
3.12 Define the following metrics:
a. CLD
b. AID
c. NOC
d. DIT
e. NOD
f. NOA
g. NOP
h. SIX
3.13 What is the purpose and significance of computing the threshold of software
metrics?
3.14 How can metrics be used to improve software quality?
3.15 Consider that the threshold value of CBO metric is 8 and WMC metric is 15. What
does these values signify? What are the possible corrective actions according to you
that a developer must take if the values of CBO and WMC exceed these values?
3.16 What are the practical applications of software metrics? How can the metrics be
helpful to software organizations?
3.17 What are the five measurement scales? Explain their properties with the help of
examples.
3.18 How are the external and internal attributes related to process and product metrics?
3.19 What is the difference between process and product metrics?
3.20 What is the relevance of software metrics in research?

Further Readings
An in-depth study of eighteen different categories of software complexity metrics was
provided by Zuse, where he tried to give basic definition for metrics in each category:

H. Zuse, Software Complexity: Measures and Methods, Walter De Gryter, Berlin,


Germany, 1991.

Fenton’s book on software metrics is a classic and useful reference as it provides in-depth
discussions on measurement and key concepts related to metrics:

N. Fenton, and S. Pfleeger, Software Metrics: A Rigorous & Practical Approach, PWS
Publishing Company, Boston, MA, 1997.
102 Empirical Research in Software Engineering

The traditional Software Science metrics proposed by Halstead are listed in:

H. Halstead, Elements of Software Science, Elsevier North-Holland, Amsterdam,


the Netherlands, 1977.

Chidamber and Kemerer (1991) proposed the first significant OO design metrics. Then,
another paper by Chidamber and Kemerer defined and validated the OO metrics suite
in 1994. This metrics suite is widely used and has obtained widest attention in empirical
studies:

S. Chidamber, and C. Kemerer, “A metrics suite for object-oriented design,” IEEE


Transactions on Software Engineering, vol. 20, no. 6, pp. 476–493, 1994.

Detailed description on OO metrics can be obtained from:

B. Henderson-Sellers, Object Oriented Metrics: Measures of Complexity, Prentice Hall,


Englewood Cliffs, NJ, 1996.
M. Lorenz, and J. Kidd, Object-Oriented Software Metrics, Prentice Hall, Englewood
Cliffs, NJ, 1994.

The following paper explains various OO metric suites with real-life examples:

K.K. Aggarwal, Y. Singh, A. Kaur, and R. Malhotra,“Empirical study of object-


oriented metrics,” Journal of Object Technology, vol.5, no. 8, pp. 149–173, 2006.

Other relevant publications on OO metrics can be obtained from:

www.acis.pamplin.vt.edu/faculty/tegarden/wrk-pap/ooMETBIB.PDF

Complete list of bibliography on OO metrics is provided at:

“Object-oriented metrics: An annotated bibliography,” http://dec.bournemouth.


ac.uk/ESERG/bibliography.html.
4
Experimental Design

After the problem is defined, the experimental design process begins. The study must be
carefully planned and designed to draw useful conclusions from it. The formation of a
research question (RQ), selection of variables, hypothesis formation, data collection, and
selection of data analysis techniques are important steps that must be carefully carried
out to produce meaningful and generalized conclusions. This would also facilitate the
opportunities for repeated and replicated studies.
The empirical study involves creation of a hypothesis that is tested using statistical
techniques based on the data collected. The model may be developed using multivariate
statistical techniques or machine learning techniques. The steps involved in the experi-
mental design are presented to ensure that proper steps are followed for conducting an
empirical study. In the absence of a planned analysis, a researcher may not be able to draw
well-formed and valid conclusions. All the activities involved in empirical design are
explained in detail in this chapter.

4.1 Overview of Experimental Design


Experimental design is a very important activity, which involves laying down the back-
ground of the experiment in detail. This includes understanding the problem, identifying
goals, developing various RQ, and identifying the environment. The experimental design
phase includes eight basic steps as shown in Figure 4.1. In this phase, an extensive sur-
vey is conducted to have a complete overview of all the work done in literature till date.
Besides this, the research is formally stated, including a null hypothesis and an alternative
hypothesis. The next step in design phase is to determine and define the variables. The
variables are of two types: dependent and independent variables. In this step, the variables
are identified and defined. The measurement scale should also be defined. This imposes
restrictions on the type of data analysis method to be used. The environment in which the
experiment will be conducted is also determined, for example, whether the experiment
will use data obtained from industry, open source, or university. The procedure for mining
data from software repositories is given in Chapter 5. Finally, the data analysis methods to
be used for performing the analysis are selected.

4.2 Case Study: Fault Prediction Systems


An example of empirical study is taken to illustrate the experimental process and various
empirical concepts. The study will continue in Chapters 6 through 8, wherever required, to
help in explaining the concepts. The empirical study is based on predicting severity levels
103
104 Empirical Research in Software Engineering

Identify goals

Understanding the problem


Develop research
questions

Search, explore, and


gather papers
Exhaustive literature
survey
Critical analysis of
papers

Hypothesis
formulation

Creating solution
Variable selection to the problem

Empirical data
collection

Data analysis method


selection

FIGURE 4.1
Steps in experimental design.

of fault and has been published in Singh et al. (2010). Hereafter, the study will be referred
to as fault prediction system (FPS). The objective, motivation, and context of the study are
described below.

4.2.1 Objective of the Study


The aim of the work is to find the relationship between object-oriented (OO) metrics and
fault proneness at different severity levels of faults.

4.2.2 Motivation
The study predicts an important quality attribute, fault proneness during the early phases
of software development. Software metrics are used for predicting fault proneness. The
important contribution of this study is taking into account of the severity of faults dur-
ing fault prediction. The value of severity quantifies the impact of the fault on the soft-
ware operation. The IEEE standard (1044–1993, IEEE 1994) states, “Identifying the severity
of an anomaly is a mandatory category as is identifying the project schedule, and project
Experimental Design 105

cost impacts of any possible solution for the anomaly.” All the failures are not of the same
type; they may vary in the impact that they may cause. For example, a failure caused by a
fault may lead to a whole system crash or an inability to open a file (El Emam et al. 1999;
Aggarwal et al. 2009). In this example, it can be seen that the former failure is more severe
than the latter. Lack of determination of severity of faults is one of the main criticisms of
the approaches to fault prediction in the study by Fenton and Neil (1999). Therefore, there
is a need to develop prediction models that can be used to identify classes that are prone to
have serious faults. The software practitioners can use the model predicted with respect to
high severity of faults to focus the testing on those parts of the system that are likely to cause
serious failures. In this study, the faults are categorized with respect to all the severity levels
given in the NASA data set to improve the effectiveness of the categorization and provide
meaningful, correct, and detailed analysis of fault data. Categorizing the faults according to
different severity levels helps prioritize the fixing of faults (Afzal 2007). Thus, the software
practitioners can deal with the faults that are at higher priority first, before dealing with the
faults that are comparatively of lower priority. This would allow the resources to be judi-
ciously allocated based on the different severity levels of faults. In this work, the faults are
categorized into three levels: high severity, medium severity, and low severity.
Several regression (such as linear and logistic regression [LR]) and machine learning
techniques (such as decision tree [DT] and artificial neural network [ANN]) have been pro-
posed in the literature. There are few studies that are using machine learning techniques
for fault prediction using OO metrics. Most of the prediction models in the literature are
built using statistical techniques. There are many machine learning techniques, and there
is a need to compare the results of various machine learning techniques as they give dif-
ferent results. ANN and DT methods have seen an explosion of interest over the years and
are being successfully applied across a wide range of problem domains such as finance,
medicine, engineering, geology, and physics. Indeed, these methods are being introduced
to solve the problems of prediction, classification, or control (Porter 1990; Eftekhar 2005;
Duman 2006; Marini 2008). It is natural for software practitioners and potential users to
wonder, “Which classification technique is best?,” or more realistically, “What methods
tend to work well for a given type of data set?” More data-based empirical studies, which
are capable of being verified by observation, or experiments are needed. Today, the evi-
dence gathered through these empirical studies is considered to be the most powerful
support possible for testing a given hypothesis (Aggarwal et al. 2009). Hence, conduct-
ing empirical studies to compare regression and machine learning techniques is necessary
to build an adequate body of knowledge to draw strong conclusions leading to widely
accepted and well-formed theories.

4.2.3 Study Context


This study uses the public domain data set KC1 obtained from the NASA metrics data pro-
gram (MDP) (NASA 2004; PROMISE 2007). The independent variables used in the study
are various OO metrics proposed by Chidamber and Kemerer (1994), and the dependent
variable is fault proneness. The performance of the predicted models is evaluated using
receiver operating characteristic (ROC) analysis.

4.2.4 Results
The results show that the area under the curve (measured from the ROC analysis) of mod-
els predicted using high-severity faults is low compared with the area under the curve of
the model predicted with respect to medium- and low-severity faults.
106 Empirical Research in Software Engineering

4.3 Research Questions


The first step in the experimental design is to formulate the RQs. This step states the
problem in the form of questions, and identifies the main concepts and relations to be
explored.

4.3.1 How to Form RQ?


Most essential aspect of research is formulating an RQ that is clear, simple, and easy to
understand. In other words, the scientific process begins after defining RQs. The first ques-
tion that comes to our mind when doing a research is “What is the need to conduct research
(about a particular topic)?” The existing literature can provide answers to questions of
researchers. If the questions are not yet answered, the researcher intends to answer those
questions and carry forward the research. Thus, this fills the required “gap” by finding the
solution to the problem.
A research problem can be defined as a condition that can be studied or investigated
through the collection and analysis of data having theoretical or practical significance.
Research problem is defined as a part of research for which the researcher is continuously
thinking about and wants to find a solution for it. The RQs are extracted from the problem,
and the researcher may ask the following questions before framing the RQs:

• What issues need to be addressed in the study?


• Who can benefit from the analysis?
• How can the problem be mapped to realistic terms and measures?
• How can the problem be quantified?
• What measures should be taken to control the problem?
• Are there any unique scenarios for the problem?
• Any expected relationship between causes and outcomes?

Hence, the RQs must fill the gap between existing literature and current work and must
give some new perspective to the problem. Figure 4.2 depicts the context of the RQs. The
RQ may be formed according to the research types given below:

What is the existing


relation to the literature?

Research What methods, data-collection


question techniques, and data analysis
methods must be used to answer
the research questions?

What is the new


contribution in the area?

FIGURE 4.2
Context of research questions.
Experimental Design 107

1. Causal relationships: It determines the causal relationships between entities. Does


coupling cause increase in fault proneness?
2. Exploratory research: This research type is used to establish new concepts and
theories. What are the experiences of programmers using unified modeling lan-
guage (UML) tool?
3. Explanatory research: This research type provides explanation of the given
theories. Why do developers fail to develop good requirement document?
4. Descriptive research: It describes underlying mechanisms and events. How does
inspection technique actually work?

Some examples of RQs are as follows:

• Is inspection technique more effective than the walkthrough method in detecting


faults?
• Which software development life cycle model is more successful in the software
industry?
• Is the new approach effective in reducing testing effort?
• Can search-based techniques be applied to software engineering problems?
• What is the best approach for testing a software?
• Which test data generation technique is effective in the industry?
• What are the important attributes that affect the maintainability of the software?
• Is effort dependent on the programming language, developer’s experience, or size
of the software?
• Which metrics can be used to predict software faults at the early phases of soft-
ware development?

4.3.2 Characteristics of an RQ
The following are the characteristics of a good RQ:

1. Clear: The reader who may not be an expert in the given topic should understand
the RQs. The questions should be clearly defined.
2. Unambiguous: The use of vague statements that can be interpreted in multiple ways
should be avoided while framing RQs. For example, consider the following RQ:
Are OO metrics significant in predicting various quality attributes?
The above statement is very vague and can lead to multiple interpretations. This is
because a number of quality attributes are present in the literature. It is not clear
which quality attribute one wants to consider. Thus, the above vague statement
can be redefined in the following way. In addition, the OO metrics can also be
specified.
Are OO metrics significant in predicting fault proneness?
3. Empirical focus: This property requires generating data to answer the RQs.
4. Important: This characteristic requires that answering an RQ adds significant
contribution to the research and that there will be beneficiaries.
108 Empirical Research in Software Engineering

5. Manageable: The RQ should be answerable, that is, it should be feasible to answer.


6. Practical use: What is the practical application of answering the RQ? The RQ must
be of practical importance to the software industry and researchers.
7. Related to literature: The RQ should relate to the existing literature. It should fill
gaps in the existing literature.
8. Ethically neutral: The RQ should be ethically neutral. The problem statement
should not contain the words “should” or “ought”. Consider the following example:
Should the techniques, peer reviews, and walkthroughs be used for verification in
contrast to using inspection?
The above statement is said to be not ethically neutral, as it appears that the
researcher is favoring the techniques, peer reviews, and walkthroughs in contrast
to inspection. This should not be the situation and our question should appear to
be neutral by all means.
It could be restated scientifically as follows:
What are the strengths and weaknesses of various techniques available for veri-
fication, that is, peer review, walkthrough, and inspection? Which technique is
more suitable as compared to other in a given scenario?

Finally, the research problem must be stated in either a declarative or interrogative form.
The examples of both the forms are given below:

Declarative form: The present study focuses on predicting change-prone parts of the
software at the early stages of software development life cycle. Early prediction of
change-prone classes will lead to saving lots of resources in terms of money, man-
power, and time. For this, consider the famous Chidamber and Kemerer metrics
suite and determine the relationship between metrics and change proneness.
Interrogative form: What are the consequences of predicting the change-prone parts
of the software at the early stages of software development life cycle? What is the
relationship between Chidamber and Kemerer metrics and change proneness?

4.3.3 Example: RQs Related to FPS


The empirical study given in Section 4.2 addresses some RQs, which it intends to answer.
The formulation of such RQs will help the authors to have a clear understanding of the
problem and also help the readers to have a clear idea of what the study intends to dis-
cover. The RQs are stated below:

• RQ1: Which OO metrics are related to fault proneness of classes with regard to
high-severity faults?
• RQ2: Which OO metrics are related to fault proneness of classes with regard to
medium-severity faults?
• RQ3: Which OO metrics are related to fault proneness of classes with regard to
low-severity faults?
• RQ4: Is the performance of machine learning techniques better than the LR method?
Experimental Design 109

4.4 Reviewing the Literature


Once the research problem is clearly understood and stated, the next step in the initial
phases of the experiment design is to conduct an extensive literature review. A literature
review identifies the related and relevant research and determines the position of the work
being carried out in the specified field.

4.4.1 What Is a Literature Review?


According to Bloomberg and Volpe (2008), literature review is defined as:
An imaginative approach to searching and reviewing the literature includes having
a broad view of the topic; being open to new ideas, methods and arguments; “playing”
with different ideas to see whether you can make new linkages; and following ideas to
see where they may lead.

The main aim of the research is to contribute toward a better understanding of the con-
cerned field. A literature review analyzes a body of literature related to a research topic
to have a clear understanding of the topic, what has already been done on the topic, and
what are the key issues that need to be addressed. It provides a complete overview of the
existing work in the field. Figure 4.3 depicts various questions that can be answered while
conducting a literature review.
The literature review involves collection of research publications (articles, conference
paper, technical reports, book chapters, journal papers) on a particular topic. The aim
is to gather ideas, views, information, and evidence on the topic under investigation.

What are the key theories, What are the key areas where
concepts, and ideas? knowledge gaps exist?

What are the


What are the major
academic
issues and controversies
terminologies and
about the topic?
information sources?
Literature search
and review on
the topic
What are the key What have been the
questions that have various methodologies
been addressed till used? What is their
date? quality?

What are the areas in which


What are the areas on which
different authors have
further research can be done?
different views?

FIGURE 4.3
Key questions while conducting a review.
110 Empirical Research in Software Engineering

The purpose of the literature review is to effectively perform analysis and evaluation of
literature in relation to the area being explored. The major benefit of the literature review
is that the researcher becomes familiar with the current research before commencing his/
her own research in the same area.
The literature review can be carried out by two aspects. The research students perform
the review to gain idea about the relevant materials related to their research so that they
can identify the areas where more work is required. The literature review carried out as
a part of the experimental design is related to the second aspect. The aim is to examine
whether the research area being explored is worthwhile or not. For example, search-based
techniques have shown the predictive capabilities in various areas where classification
problem was of complex nature. But till date, mostly statistical techniques have been
explored in software engineering-related problems. Thus, it may be worthwhile to explore
the performance capability of search-based techniques in software engineering-related
problems. The second aspect of the literature review concerns with searching and analyz-
ing the literature after selecting a research topic. The aim is to gather idea about the current
work being carried out by the researcher, whether it has created new knowledge and adds
value to the existing research. This type of literature review supports the following claims
made by the researcher:

• The research topic is essential.


• The researcher has added some new knowledge to the existing literature.
• The empirical research supports or contradicts the existing results in the literature.

The goals of conducting a literature review are stated as follows:

1. Increase in familiarity with the previous relevant research and prevention from
duplication of the work that has already been done.
2. Critical evaluation of the work.
3. Facilitation of development of new ideas and thoughts.
4. Highlighting key findings, proposed methodologies, and research techniques.
5. Identification of inconsistencies, gaps, and contradictions in the literature.
6. Extraction of areas where attention is required.

4.4.2 Steps in a Literature Review


There are four main steps that need to be followed in a literature review. These steps
involve identifying digital portals for searching, conducting the search, analyzing the
most relevant research, and using the results in the current research.
The four basic steps in the literature review are as follows:

1. Develop search strategy: This step involves identification of digital portals,


research journals, and formation of search string. This involves survey of scholarly
journal articles, conference articles, proceeding articles, books, technical reports,
and Internet resources in various research-related digital portals such as:
a. IEEE
b. Springer
Experimental Design 111

c. ScienceDirect/Elsevier
d. Wiley
e. ACM
f. Google Scholar
Before searching in digital portals, the researchers need to identify the most
credible research journals in the related areas. For example, in the area of soft-
ware engineering, some of the important journals in which search can be done
are: Software: Practice and Experience, Software Quality Journal, IEEE Transactions on
Software Engineering, Information and Software Technology, Journal of Computer Science
and Technology, ACM Transactions on Software Engineering Methodology, Empirical
Software Engineering, IEEE Software Maintenance, Journal of Systems and Software, and
Software Maintenance and Evolution.
Besides searching the journals and portals, various educational books, scientific
monograms, government documents and publications, dissertations, gray litera-
ture, and so on that are relevant to the concerned topic or area of research should
be explored. Most importantly, the bibliographies and reference lists of the materi-
als that are read need to be searched. These will give the pointers to more articles
and can also be a good estimate about how much have been read on the selected
topic of research.
After the digital portals and Internet resources have been identified, the next step
is to form the search string. The search string is formed by using the key terms
from the selected topic in the research. The search string is used to search the
literature from the digital portal.
2. Conduct the search: This step involves searching the identified sources by using
the formed search string. The abstracts and/or full texts of the research papers
should be obtained for reading and analysis.
3. Analyze the literature: Once the research papers relevant to the research topic
have been obtained, the abstract should be read, followed by the introduction
and conclusion sections. The relevant sections can be identified and read by the
section headings. In case of books, the index must be scanned to obtain an idea
about the relevant topics. The materials that are highly relevant in terms of mak-
ing the greatest contribution in the related research or the material that seems the
most convincing can be separated. Finally, a decision about reading the necessary
content must be made.
The strengths, drawbacks, and omissions in the literature review must be iden-
tified on the basis of the evidence present in the papers. After thoroughly and
critically analyzing the literature, the differences of the proposed work from the
literature must be highlighted.
4. Use the results: The results obtained from the literature review must then be
summarized for later comparison with the results obtained from the current
work.

4.4.3 Guidelines for Writing a Literature Review


A literature review should have an introduction section, followed by the main body and
the conclusion section. The “introduction” section explains and establishes the importance
112 Empirical Research in Software Engineering

of the subject under concern. It discusses the kind of work that is done on the concerned
topic of research, along with any controversies that may have been encountered by
different authors. The “body” contains and focuses on the main idea behind each paper in
the review. The relevance of the papers cited should be clearly stated in this section of the
review. It is not important to simply restate what the other authors have said, but instead
our main aim should be to critically evaluate each paper. Then, the conclusion should be
provided that summarizes what the literature says. The conclusion summarizes all the
evidence presented and shows its significance. If the review is an introduction to our own
research, it indicates how the previous research has lead to our own research focusing and
highlighting on the gaps in the previous research (Bell 2005). The following points must be
covered while writing a literature review:

• Identify the topics that are similar in multiple papers to compare and contrast
different authors’ view.
• Group authors who draw similar conclusions.
• Group authors who are in disagreement with each other on certain topics.
• Compare and contrast the methodologies proposed by different authors.
• Show how the study is related to the previous studies in terms of the similarities
and the differences.
• Highlight exemplary studies and gaps in the research.

The above-mentioned points will help to carry out effective and meaningful literature
review.

4.4.4 Example: Literature Review in FPS


A summary of studies in the literature is presented in Table 4.1. The studies closest to the
FPS study are discussed below with key differences.
Zhou and Leung (2006) validated the same data set as in this study to predict fault
proneness of models with respect to two categories of faults: high and low. They cat-
egorized faults with severity rating 1 as high-severity faults and faults with other sever-
ity levels as low-severity faults. They did not consider the faults that originated from
design and commercial off-the-shelf (COTS)/operating system (OS). The approach in
FPS differs from Zhou and Leung (2006) as this work categorized faults into three sever-
ity levels: high, medium, and low. The medium-severity level of faults is more severe
than low-severity level of faults. Hence, the classes having faults of medium-severity
level must be given more attention compared with the classes with low-severity level
of faults. In the study conducted by Zhou and Leung, the classes were not categorized
into medium- and low-severity level of faults. Further, the faults produced from the
design were not taken into account. The FPS study also analyzed two different machine
learning techniques (ANN and DT) for predicting fault proneness of models and evalu-
ated the performance of these models using ROC analysis. Pai and Bechta Dugan (2007)
used the same data set using a Bayesian approach to find the relationship of software
product metrics to fault content and fault proneness. They did not categorize faults at
different severity levels and mentioned that a natural extension to their analysis is sever-
ity classification using Bayesian network models. Hence, their work is not comparable
with the work in the FPS study.
TABLE 4.1
Literature Review of Fault Prediction Studies
Empirical Data Collection Statistical Techniques
Univariate Multivariate Predicted Model
Experimental Design

Studies Language Environment Independent Variables Analysis Analysis Evaluation

Basili et al. C++ University environment, C&K metrics, 3 code metrics LR LR Contingency table,
(1996) 180 classes correctness,
completeness
Abreu and Melo C++ University environment, MOOD metrics Pearson Linear least square R2
(1996) UMD: 8 systems correlation
Binkley and C++, Java 4 case studies, CCS: CDM, DIT, NOC, NOD, NCIM, Spearman – –
Schach (1998) 113 classes, 82K SLOC, NSSR, CBO rank
29 classes, 6K SLOC correlation
Harrison et al. C++ University environment, DIT, NOC, NMI, NMO Spearman – –
(1999) SEG1: 16 classes SEG2: rho
22 classes SEG3:
27 classes
Benlarbi and C++ LALO: 85 classes, OVO, SPA, SPD, DPA, DPD, LR LR –
Melt (1999) 40K SLOC CHNL, C&K metrics, part of
coupling metrics
El Emam et al. Java V0.5: 69 classes, V0.6: Coupling metrics, C&K metrics LR LR R2, leave one-out
(2001a) 42 classes cross-validation
El Emam et al. C++ Telecommunication Coupling metrics, DIT LR LR R2
(2001b) framework: 174 classes
Tang et al. C++ System A: 20 classes C&K metrics (without LCOM) LR – –
(1999) System B: 45 classes
System C: 27 classes
Briand et al. C++ University environment, Suite of coupling metrics, LR LR R2, 10 cross-validation,
(2000) UMD: 180 classes 49 metrics correctness,
completeness
(Continued)
113
114

TABLE 4.1 (Continued)


Literature Review of Fault Prediction Studies
Empirical Data Collection Statistical Techniques
Univariate Multivariate Predicted Model
Studies Language Environment Independent Variables Analysis Analysis Evaluation

Glasberg et al. Java 145 classes NOC, DIT ACAIC, OCAIC, LR LR R2, leave one-out
(2000) DCAEC, OCAEC cross-validation, ROC
curve, cost-saving
model
El Emam et al. C++, Java Telecommunication C&K metrics, NOM, NOA LR – –
(2000a) framework: 174 classes,
83 classes, 69 classes of
Java system
Briand et al. C++ Commercial system, Suite of coupling metrics, OVO, LR LR R2, 10 cross-validation,
(2001) LALO: 90 classes, SPA, SPD, DPA, DPD, NIP, SP, correctness,
40K SLOC DP, 49 metrics completeness

Cartwright and C++ 32 classes, 133K SLOC ATTRIB, STATES, EVENT, Linear Linear regression –
Shepperd (2000) READS, WRITES, DIT, NOC regression
Briand and Java Commercial system, Polymorphism metrics, C&K LR LR, Mars 10 cross-validation,
Wüst (2002) XPOSE & JWRITER: correctness,
144 classes completeness
Yu et al. (2002) Java 123 classes, 34K SLOC C&K metrics, Fan-in, WMC OLS+LDA – –
Subramanyam C++, Java C&K metrics OLS OLS
and Krishnan
(2003)
Gyimothy et al. C++ Mozilla v1.6: C&K metrics, LCOMN, LOC LR, linear LR, linear 10 cross-validation,
(2005) 3,192 classes regression, regression, NN, correctness,
NN, DT DT completeness
Aggarwal et al. Java University environment, Suite of coupling metrics LR LR 10 cross-validation,
(2006a, 2006b) 136 classes sensitivity, specificity
Arisholm and Java XRadar and JHawk C&K metrics LR LR
Empirical Research in Software Engineering

Briand (2006) (Continued)


TABLE 4.1 (Continued)
Literature Review of Fault Prediction Studies
Empirical Data Collection Statistical Techniques
Univariate Multivariate Predicted Model
Studies Language Environment Independent Variables Analysis Analysis Evaluation

Yuming and C++ NASA data set, C&K metrics LR, ML LR, ML Correctness,
Experimental Design

Hareton (2006) 145 classes completeness


Kanmani et al. C++ Library management C&K and Briand metrics: Total 64 PC method LDA, LR, NN (BPN Type I and II error, corr-
(2007) S/w system developed (10 cohesion, 18 inheritance, and PNN) ectness, completeness,
by students, 1,185classes 29 coupling, and 7 size) efficiency, effectiveness
Aggarwal et al. Java 12 systems developed by 52 metrics (26 coupling, LR LR 9 cross-validation
(2009) undergraduate at the 7 cohesion, 11 inheritance, and
University School of 8 size) by C&K (1991, 1994),
Information Li and Henry (1993), Lee et al.
Technology (USIT) (1995), Briand et al. (1999a), Hitz
and Montazeri (1995), Bieman
and Kang (1995), Tegarden et al.
(1995), Henderson-Sellers (1996),
Lorenz and Kidd (1994), Lake
and Cook (1994)
Singh et al. C++ Public domain data set C&K metrics, LOC LR, ML (NN, LR, ML (NN,DT) Sensitivity, specificity,
(2010) KC1 from NASA MDP, DT) precision,
145 classes, completeness
2,107 methods, 40K LOC
Singh et al. C++ Public domain data set C&K metrics, LOC ML (SVM) ML (SVM) Sensitivity, specificity,
(2009) KC1 from the NASA precision,
MDP, 145 classes, completeness
2107 methods, 40K LOC
Malhotra et al. C++ Public domain data set C&K metrics, LOC ML (SVM) ML (SVM) Sensitivity, specificity,
(2010) KC1 from the NASA precision,
MDP, 145 classes, completeness
2,107 methods, 40K LOC
Zhou et al. Java Three major releases, 2.0, 10 metrics by C&K, Michura and LR and ML LR and ML (NB, Accuracy, sensitivity,
(2010) 2.1, and 3.0, with sizes Capretz (2005), Etzkorn et al. (1999), (NB, KStar, KStar, Adtree) specificity, precision,
796, 988, and 1,306K Olague et al. (2008), Lorenz and Adtree) F-measure
SLOC, respectively Kidd (1994), Briand et al. (2001)
115

(Continued)
116

TABLE 4.1 (Continued)


Literature Review of Fault Prediction Studies
Empirical Data Collection Statistical Techniques
Univariate Multivariate Predicted Model
Studies Language Environment Independent Variables Analysis Analysis Evaluation

Di Martino et al. Java Versions 4.0, 4.2, and C&K, NPM, LOC – Combination of Precision, accuracy,
(2011) 4.3 of the jEdit system GA+SVM, LR, recall, F-measure
C4.5, NB, MLP,
KNN, and RF
Azar and Java 8 open source software 22 metrics by Henderson-Sellers – ACO, C4.5, random Accuracy
Vybihad (2011) systems (2007), Barnes and Swim (1993), guessing
Coppick and Cheatham (1992),
C&K
Malhotra and – Open source data set C&K and QMOOD metrics LR LR and ML (ANN, Sensitivity, specificity,
Singh (2011) Arc, 234 classes RF, LB, AB, NB, precision
KStar, Bagging)
Malhotra and Java Apache POI, 422 classes MOOD, QMOOD, C&K LR LR, MLP (RF, Sensitivity, specificity,
Jain (2012) (19 metrics) Bagging, MLP, precision
SVM, genetic
algorithm)
Source: Compiled from multiple sources.
–implies that feature not examined.
LR: logistic regression, LDA: linear discriminant analysis, ML: machine learning, OLS: ordinary least square linear regression, PC: principal component analysis, NN:
neural network, BPN: back propagation neural network, PPN: probabilistic neural network, DT: decision tree, MLP: multilayer perceptron, SVM: support vector
machine, RF: random forest, GA+SVM: combination of genetic algorithm and support vector machine, NB: naïve Bayes, KNN: k-nearest neighbor, C4.5: decision tree,
ACO: ant colony optimization, Adtree: alternating decision tree, AB: adaboost, LB: logitboost, CHNL: class hierarchy nesting level: NCIM: number of classes inheriting
a method, NSSR: number of subsystems-system relationship: NPM, number of public methods: LCOMN, lack of cohesion on methods allowing negative value.
Related to metrics: C&K: Chidamber and Kemerer, MOOD: metrics for OO design, QMOOD: quality metrics for OO design.
Empirical Research in Software Engineering
Experimental Design 117

4.5 Research Variables


Before the detailed experiment design begins, the relevant independent and dependent
variables have to be selected.

4.5.1 Independent and Dependent Variables


The variables used in an experiment can be divided into two types: dependent variable
and independent variable. While conducting a research, the dependent and the indepen-
dent variables that are used in the study need to be defined.
In an empirical study, the independent variable is a variable that can be changed or
varied to see its effect on the dependent variable. In other words, a dependent variable
is a variable that has effect on the independent variables and can be controlled. Thus,
the dependent variable is “dependent” on the independent variable. As the experimenter
changes the independent variable, the change in the dependent variable is observed. The
selection of variables also involves selecting the measurement scale. Any of these vari-
ables may be discrete or continuous. A binary variable has only two values. For example,
whether a component is faulty or is not faulty. A continuous variable has many values
(refer to Chapter 3 for details). In software product metric validation, continuous variables
are usually counts. A count is characterized by being a non-negative integer, and hence
is a continuous variable. Usually, in empirical studies in software engineering, there is
one dependent variable. Figure 4.4 depicts the relationship between the dependent and
independent variable. Table 4.2 states the key differences between the independent and
dependent variables.

Independent variable 1
Dependent variable
Process

Independent variable N

FIGURE 4.4
Relationship between dependent and independent variables.

TABLE 4.2
Differences between Dependent and Independent Variables
Independent Variable Dependent Variable

Variable that is varied, changed, or manipulated. It is not manipulated. The response or outcome that is
measured when the independent variable is varied.
It is the presumed cause. It is the presumed effect.
Independent variable is the antecedent. Dependent variable is the consequent.
Independent variable refers to the status of the Dependent variable refers to the status of the
“cause,” which leads to the changes in the status of “outcome” in which the researcher is interested.
the dependent variable.
Also known as explanatory or predictor variable. Also known as response or predictor or target
variable.
For example, various metrics that can be used to For example, whether a module is faulty or not.
measure various software constructs.
118 Empirical Research in Software Engineering

4.5.2 Selection of Variables


The selection of appropriate variables is not easy and generally based on the domain knowl-
edge of the researcher. In research, the variables are identified according to the things
being measured. When exploring new topics, for selection of variables a researcher must
carefully analyze the research problem and identify the variables affecting the dependent
variable. However, in case of explored topics, the literature review can also help in identi-
fication of variables. Hence, the selection is based on the researcher’s experience, informa-
tion obtained from existing published research, and judgment or advice obtained from the
experts in the related areas. In fact, the independent variables are selected most of the time
on the basis of information obtained from the published empirical work of a researcher’s
own as well as from other researchers. Hence, the research problem must be thoroughly
and carefully analyzed to identify the variables.

4.5.3 Variables Used in Software Engineering


The independent variables can be different software metrics proposed in existing studies.
There are different types of metrics, that is, product-related metrics and process-related
metrics. Under product-related metrics, there are class level, method level, component
level, and file level metrics. All these metrics can be utilized as independent variables.
For example, software metrics such as volume, lines of code (LOC), cyclomatic complex-
ity, and branch count can be used as independent variables.
The variable describing the quality attributes of classes to be predicted is called depen-
dent variable. A variable used to explain a dependent variable is called independent
variable. The binary dependent variables of the models can be fault proneness, change
proneness, and so on, whereas, the continuous dependent variables can be testing effort,
maintenance effort, and so on.
Fault proneness is defined as the probability of fault detection in a class. Change prone-
ness is defined as the probability of a class being changed in future. Testing effort is defined
as LOC changed or added throughout the life cycle of the defect per class. Maintenance
effort is defined as LOC changed per class in its maintenance history. The quality attributes
are somewhat interrelated, for example, as the fault proneness of a class increase so will the
testing effort required to correct the faults in the class.

4.5.4 Example: Variables Used in the FPS


The independent variables are various OO metrics proposed by Chidamber and Kemerer
(1994). This includes coupling between object (CBO), response for a class (RFC), number of
children (NOC), depth of inheritance (DIT), lack of cohesion in methods (LCOM), weighted
methods per class (WMC), and LOC. The definitions of these metrics can be found in Chapter 3.
The binary dependent variable is fault proneness. Fault proneness is defined as the
probability of fault detection in a class (Briand et al. 2000; Pai and Bechta Dugan 2007;
Aggarwal et al. 2009). The dependent variable will be predicted based on the faults found
during the software development life cycle.

4.6 Terminology Used in Study Types


The choice of the empirical process depends on the type of the study. There are two types
of processes that can be followed based on the study type:
Experimental Design 119

• Hypothesis testing without model prediction


• Hypothesis testing after model prediction

For example, if the researcher wants to find whether a UML tool is better than a traditional
tool and the effectiveness of the tool is measured in terms of productivity of the persons
using the tool, then hypothesis testing can be used directly using the data given in Table 4.3.
Consider another instance where the researcher wants to compare two machine learn-
ing techniques to find the effect of software metrics on probability of occurrence of faults.
In this problem, first the model is predicted using two machine learning techniques. In the
next step, the model is validated and performance is measured in terms of performance
evaluation metrics (refer Chapter 7). Finally, hypothesis testing is applied on the results
obtained in the previous step for verifying whether the performance of one technique is
better than the other technique.
Figure 4.5 shows that the term independent and dependent variables is used in both
experimental studies and multivariate analysis. In multivariate analysis, the independent
and dependent variables are used in model prediction. The independent variables are used
as predictor variables to predict the dependent variable. In experimental studies, factors
for a statistical test are also termed as independent variables that may have one or more

TABLE 4.3
Productivity for Tools
UML Tool Traditional Tool

14 52
67 61
13 14

Dependent and independent Dependent and independent


variables in multivariate variables in experimental
analysis studies
Model prediction Hypothesis testing

Independent
Independent
variables or factors:
variables: software
techniques and
metrics such as fan-in,
methods such as
cyclomatic complexity
machine learning
techniques

Dependent variable: Value of factors or


quality attributes treatments such as
such as decision tree, neural
fault proneness network

Dependent variable:
accuracy

FIGURE 4.5
Terminology used in experimental studies and multivariate analysis studies.
120 Empirical Research in Software Engineering

levels called treatments or samples as suitable for a specific statistical test. For example, a
researcher may wish to test whether the mean of two samples is equal or not such as in the
case when a researcher wants to explore different software attributes like coupling before
and after a specific treatment like refactoring. Another scenario could be when a researcher
wants to explore the performance of two or more learning algorithms or whether two treat-
ments give uniform results. Thus, the dependent variable in experimental study refers to
the behavior measures of a treatment. In software engineering research, in some cases,
these may be the performance measures. Similarly, one may refer to performances on dif-
ferent data sets as data instances or subjects, which are exposed to these treatments.
In software engineering research, the performance measures on data instances are
termed as the outcome or the dependent variable in case of hypothesis testing in experi-
mental studies. For example, technique A when applied on a data set may give an accuracy
(performance measure, defined as percentage of correct predictions) value of 80%. Here,
technique A is the treatment and the accuracy value of 80% is the outcome or the dependent
variable. However, in multivariate analysis or model prediction, the independent variables
are software metrics and the dependent variable may be, for example, a quality attribute.
To avoid confusion, in this book, we use terminology related to multivariate analysis
unless and until specifically mentioned.

4.7 Hypothesis Formulation


After the variables have been identified, the next step is to formulate the hypothesis in the
research. This is one of the important steps in empirical research.

4.7.1 Experiment Design Types


In this section, we discuss the experimental design types used in experimental studies.
The selection of appropriate statistical test for testing hypothesis depends on the type of
experimental design. There are four experimental design types that can be used for design-
ing a given case study. Factor is the technique or method used in an empirical study such
as machine learning technique or verification method. Treatment is the type of techniques
such as DT is a machine learning technique and inspection is a verification technique. The
types of experiment design are summarized below.

Case 1: One factor, one treatments—In this case, there is one technique under obser-
vation. For example, if the distribution of the data needs to be checked for a given
variable, then this design type can be used. Consider a scenario where 25 students
had developed the same program. The cyclomatic complexity values of the pro-
gram can be evaluated using chi-square test.
Case 2: One factor, two treatments—This type of design may be purely randomized
or paired design. For example, a researcher wants to compare the performance
of two verification techniques such as walkthroughs and inspections. Another
instance is when a researcher wants to compare the performance of two machine
learning techniques, naïve Bayes and DT, on a given or over multiple data sets. In
these two examples, factor is one (verification method or machine learning tech-
nique) but treatments are two. Paired t-test or Wilcoxon test can be used in these
cases. Chapter 6 provides examples for these tests.
Experimental Design 121

TABLE 4.4
Factors and Levels of Example
Factor Level 1 Level 2

Paradigm type Structural OO


Software complexity Difficult Simple

Case 3: One factor, more than two treatments—In this case, the technique that is to
be analyzed contains multiple values. For example, a researcher wants to compare
multiple search-based techniques such as genetic algorithm, particle swarm opti-
mization, genetic programming, and so on. Friedman test can be used to solve this
example. Section 6.4.13 provides solution for this example.
Case 4: Multiple factors and multiple treatments—In this case, more than one factor
is considered with multiple treatments. For instance, consider an example where
a researcher wants to compare paradigm types such as structured paradigm with
OO paradigm. In conjunction to the paradigm type, the researcher also wants to
check the complexity of the software being difficult or simple. This example is
shown in Table 4.4 along with the factors and levels. ANOVA test can be used to
solve such examples.

The examples of the above experimental design types are given in Section 6.4. After deter-
mining the appropriate experiment design type, the hypothesis needs to be formed in an
empirical study.

4.7.2 What Is Hypothesis?


The main objective of an experiment usually is to evaluate a given relationship or hypoth-
esis formed between the cause and the effect. Many authors understand the definition of
hypothesis differently:
A hypothesis may be precisely defined as a tentative proposition suggested as a solution
to a problem or as an explanation of some phenomenon. (Ary et al. 1984)
Hypothesis is a formal statement that presents the expected relationship between an
independent and dependent variable. (Creswell 1994a)

Hence, hypothesis can be defined as a mechanism to formally establish the relationship


between variables in the research. The things that a researcher intends to investigate are for-
mulated in the form of a hypothesis. By formulating a hypothesis, the research objectives or
the key concepts involved in the research are defined more specifically. Each hypothesis can
be tested for its verifiability or falsifiability. Figure 4.6 shows the process of generation of
hypothesis in a research. As shown in the figure research questions can either be generated
through problem statement or from well-formed ideas extracted from literature survey.
After the development of the research questions, the research hypothesis can be formed.

4.7.3 Purpose and Importance of Hypotheses in an Empirical Research


Aquino (1992) defined the importance of formation of hypothesis in an empirical study.
The key advantages of hypothesis formation are given below:

• It provides the researcher with a relational statement that can be directly tested in
a research study.
122 Empirical Research in Software Engineering

Primary thought
(not fully formed)

Primary observations Problem statement Exploring, searching


data, conducting survey

Thought-through and
Research questions
well-formed idea

Research hypothesis

FIGURE 4.6
Generation of hypothesis in a research.

• It helps in formulation of conclusions of the research.


• It helps in forming a tentative or an educated guess about any phenomena in a
research.
• It provides direction to the collection of data for validation of hypothesis and thus
helps in carrying the research forward.
• Even if the hypothesis is proven to be false, it leads to a specific conclusion.

4.7.4 How to Form a Hypothesis?


Once the RQs are developed or research problem is clearly defined, hypothesis can be
derived from the RQs or research problem by identifying key variables and identifying
the relationship between the identified variables. The steps that are followed to form the
hypothesis are given below:

1. Understand the problem/situation: Clearly understanding the problem is very


important. This can be done by breaking down the problem into smaller parts and
understanding each part separately. The problem can be restated in words to have
a clear understanding of the problem. The meaning of all the words used in stating
the problem should be clear and unambiguous. The remaining steps are based on
the problem definition. Hence, understanding the problem becomes a crucial step.
2. Identify the key variables required to measure the problem/situation: The key
variables used in the hypothesis testing must be selected from the independent
and dependent variables identified in Section 4.5. The effect of independent vari-
able on the dependent variable needs to be identified and analyzed.
3. Make an educated guess as to understand the relationship between the variables:
An “educated guess” is a statement based on the available RQs or given problem
and will be eventually tested. Generally, the relationship established between the
independent and dependent variable is stated as an “educated guess.”
Experimental Design 123

TABLE 4.5
Transition from RQ to Hypothesis
RQ Corresponding Hypothesis

Is X related to Y? If X, then Y.
How are X and Y related to Z? If X and Y, then Z.
How is X related to Y and Z? If X, then Y and Z.
How is X related to Y under conditions Z and W? If X, then Y under conditions Z and W.

4. Write down the hypotheses in a format that is testable through scientific research:
There are two types of hypothesis—null and alternative hypotheses. Correct for-
mation of null and alternative hypotheses is the most important step in hypoth-
esis testing. The null hypothesis is also known as hypothesis of no difference and
denoted as H0. The null hypothesis is the proposition that implies that there is no
statistically significant relationship within a given set of parameters. It denotes the
reverse of what the researcher in his experiment would actually expect or predict.
Alternative hypothesis is denoted as Ha. The alternative hypothesis reflects that a
statistically significant relationship does exist within a given set of parameters. It
is the opposite of null hypothesis and is only reached if H0 is rejected. The detailed
explanation of null and alternative hypothesis is stated in the next Section 4.7.5.
Table 4.5 presents corresponding hypothesis to given RQs.

Some of the examples to show the transition from an RQ to a hypothesis are stated below:

RQ: What is the relation of coupling between classes and maintenance effort?
Hypothesis: Coupling between classes and maintenance effort are positively related
to each other.

RQ: Are walkthroughs effective in finding faults than inspections?


Hypothesis: Walkthroughs are more effective in finding faults than inspections.

Example 4.1:
There are various factors that may have an impact on the amount of effort required to
maintain a software. The programming language in which the software is developed
can be one of the factors affecting the maintenance effort. There are various program-
ming languages available such as Java, C++, C#, C, Python, and so on. There is a
need to identify whether these languages have a positive, negative, or neutral effect
on the maintenance effort. It is believed that programming languages have a positive
impact on the maintenance effort. However, this needs to be tested and confirmed
scientifically.

Solution:
The problem and hypothesis derived from it is given below:

1. Problem: Need to identify the relationship between the programming lan-


guage used in a software and the maintenance effort.
2. RQ: Is there a relation between programming language and maintenance effort?
3. Key variables: Programming language and maintenance effort
124 Empirical Research in Software Engineering

4. Educated guess: Programming language is related to effort and has a positive


impact on the effort.
5. Hypothesis: Programming language and maintenance effort are positively
related to each other.

4.7.5 Steps in Hypothesis Testing


The hypothesis testing involves a series of steps. Figure 4.7 depicts the steps in hypoth-
esis testing. Hypothesis testing is based on the assumption that null hypothesis is correct.
Thus, we prove that the assumption of no difference (null hypothesis) is not consistent
with the research hypothesis. For example, if we strongly believe that technique A is bet-
ter than technique B, despite our strong belief, we begin by assuming that the belief is not
true, and hence we want to fail the test by rejecting null hypothesis.
The various steps involved in hypothesis testing are described below.

4.7.5.1 Step 1: State the Null and Alternative Hypothesis


The null hypothesis is popular because it is expected to be rejected, that is, it can be
shown to be false, which then implies that there is a relationship between the observed
data. One needs to be specific about what it means if the null hypothesis is not rejected.
It only means that there is no sufficient evidence present against null hypothesis (H0),
which is in favor of alternative hypothesis (Ha). There might actually be a difference, but
on the basis of the sample result such a difference has not been detected. This is analo-
gous to a legal scenario where if a person is declared “not guilty,” it does not mean that
he is innocent.

Define hypothesis
• Define null hypothesis
• Define alternate hypothesis

Select the appropriate statistical test


• Check test assumptions

Apply test and calculate p-value

Define significance level


• Determine p-value

Derive conclusions
• Check statistical significance of results

FIGURE 4.7
Steps in hypothesis testing.
Experimental Design 125

The null hypothesis can be written in mathematical form, depending on the particular
descriptive statistic using which the hypothesis is made. For example, if the descriptive
statistic is used as population mean, then the general form of null hypothesis is,

Ho : µ = X

where:
µ is the mean
X is the predefined value

In this example, whether the population mean equals X or not is being tested.
There are two possible scenarios through which the value of X can be derived. This
depends on two different types of RQs. In other words, the population parameter (mean in
the above example) can be assigned a value in two different ways. First reason is that the
predetermined value is selected for practical or proved reasons. For example, a software
company decides that 7 is its predetermined quality parameter for mean coupling. Hence,
all the departments will be informed that the modules must have a value of <7 for coupling
to ensure less complexity and high maintainability. Similarly, the company may decide
that it will devote all the testing resources to those faults that have a mean rating above 3.
The testers will therefore want to test specifically all those faults that have mean rating >3.
Another situation is where a population under investigation is compared with another
population whose parameter value is known. For example, from the past data it is known
that average productivity of employees is 30 for project A. We want to see whether the
average productivity of employees is 30 or not for project B? Thus, we want to make an
inference whether the unknown average productivity for project B is equal to the known
average productivity for project A.
The general form of alternative hypothesis when the descriptive parameter is taken as
mean (µ) is,
Ha : µ ≠ X

where:
µ is the mean
X is the predefined value

The above hypothesis represents a nondirectional hypothesis as it just denotes that there
will be a difference between the two groups, without discussing how the two groups differ.
The example is stated in terms of two popularly used methods to measure the size of soft-
ware, that is, (1) LOC and (2) function point analysis (FPA). The nondirectional hypothesis
can be stated as, “The size of software as measured by the two techniques is different.”
Whereas, when the hypothesis is used to show the relationship between the two groups
rather than simply comparing the groups, then the hypothesis is known as directional
hypothesis. The comparison terms such as “greater than,” “less than,” and so on is used in
the formulation of hypothesis. In other words, it specifies how the two groups differ. For
example, “The size of software as measured by FPA is more accurate than LOC.” Thus, the
direction of difference is mentioned. The same concept is represented by one-tailed and
two-tailed tests in statistical testing and is explained in Section 6.4.3.
One important point to note is that the potential outcome that a researcher is expecting
from his/her experiment is denoted in terms of alternative hypothesis. What is believed
to be the theoretical expectation or concept is written in terms of alternative hypothesis.
126 Empirical Research in Software Engineering

Thus, sometimes the alternative hypothesis is referred to as the research hypothesis. Now,
if the alternative hypothesis represents the theoretical expectation or concept, then what
is the reason for performing the hypothesis testing? This is done to check whether the
formed or assumed concepts are actually significant or true. Thus, the main aim is check
the validity of the alternative hypothesis. If null hypothesis is accepted, it signifies that the
idea or concept of research is false.

4.7.5.2 Step 2: Choose the Test of Significance


There are a number of tests available to assess the null hypothesis. The choice among them is
to be made to check which test is applicable in a particular situation. The four important fac-
tors that are based on assumptions of statistical tests and help in test selection are as follows:

• Type of distribution—Whether data is normally distributed or not?


• Sample size—What is the sample size of the data set?
• Type of variables—What is the measurement scale of variables?
• Number of independent variables—What is the number of factors or variables in
the study?

There are various tests available in research for verifying hypothesis and are given as
follows:

1. t-test for the equality of two means


2. ANOVA for equality of means
3. Paired t-test
4. Chi-square test for goodness-of-fit
5. Friedman test
6. Mann–Whitney test
7. Kruskal–Wallis test
8. Wilcoxon signed-rank test

The details of all the tests can be found in Section 6.4.

4.7.5.3 Step 3: Compute the Test Statistic and Associated p-Value


In this step, the descriptive statistic is calculated, which is specified by the null hypoth-
esis. There can be many statistical tests as discussed above that can be applied in practice.
But the statistic that is actually calculated depends on the statistic used in hypothesis.
For example, if the null hypothesis were defined by the statistic µ, then the statistics
computed on the data set would be the mean and the standard deviation. Usually, the
calculated statistic does not conform to the value given by null hypothesis. But this is not
a cause for concern. What is actually needed to calculate the probability of obtaining the
test statistic result that has the value specified in the null hypothesis? This is called as
the significance of the test statistic, known as the p-value. This p-value is compared with
the certain significance level determined in the next step. This step is carried out in result
execution phase of an empirical study.
Experimental Design 127

Critical region or
region of rejection

FIGURE 4.8
Critical region.

4.7.5.4 Step 4: Define Significance Level


The critical value or significance value (typically known as α) is determined at this step.
The level of significance or α-value is a threshold. When a researcher plans to perform
a significance test in an empirical study, a decision has to be made on what is the maxi-
mum significance that will be tolerated such that the null hypothesis can be rejected.
Figure 4.8 depicts the critical region in a normal curve as shaded portions at two ends.
Generally, this significance value is taken as 0.05 or 0.01. The critical value signifies the
critical region or region of rejection. The critical region or region of rejection specifies the
range of values, which makes the null hypothesis to be rejected. Using the significance
value, the researcher determines the region of rejection and region of acceptance for the
null hypothesis.

4.7.5.5 Step 5: Derive Conclusions


The conclusions about the acceptance or rejection of the formed hypothesis are made in
this step. Using the decided significance value, the region of rejection is determined. The
significance value is used to decide whether or not to reject the null hypothesis. The lower
the observed p-value, the more are the chances of rejecting the null hypothesis. If the
computed p-value is less than the defined significance threshold then the null hypothesis
is rejected and the alternative hypothesis is accepted. In other words, if the p-value lies in
the rejection region then the null hypothesis is rejected. Figure 4.9 shows the significance
levels of p-value.
The meaning or inference from the results must be determined in this step rather than
just repeating the statistics. This step is part of the execution phase of empirical study.
Consider the data set given in Table 4.6. The data consists of six data points. In this exam-
ple, the coupling aspect for faulty and nonfaulty classes is to be compared. The coupling for
faulty classes and coupling of nonfaulty classes for a given software is shown in Table 4.6.

Significant at 0.01 Significant at 0.05 Not significant

p-value ≤ 0.01 0.01 < p-value ≤ 0.05 p-value > 0.05

FIGURE 4.9
Significance levels.
128 Empirical Research in Software Engineering

TABLE 4.6
A Sample Data Set
CBO for Faulty CBO for Nonfaulty
S. No. Modules Modules

1 45 9
2 56 9
3 34 9
4 71 7
5 23 10
6 9 15
Mean 39.6 9.83

Step 1: RQ from problem statement


RQ: Is there a difference in coupling values for faulty classes and coupling values for
nonfaulty classes?
Step 2: Deriving hypothesis from RQ
In the first step, the hypothesis is derived from the RQ:
H0: There is no statistical difference between the coupling for faulty classes and
coupling for nonfaulty classes.
Ha: The coupling for faulty classes is more than the coupling for nonfaulty classes.
Mathematically,

H 0 : µ ( CBO faulty ) = µ ( CBO nonfaulty )

H a : µ ( CBO faulty ) > µ ( CBO nonfaulty ) or µ ( CBO faulty ) < µ ( CBO nonfaulty )

Step 3: Determining the appropriate test to apply


As the problem is of comparing means of two dependent samples (collected from
same software), the paired t-test is used. In Chapter 6, the conditions for selecting
appropriate tests are given.
Step 4: Calculating the value of test statistic
Table 4.7 shows the intermediary calculations of t-test.
The t-statistics is given as:
µ1 − µ 2
t=
σd n

where:
µ1 is the mean of first population
µ2 is the mean of second population

∑ d − ( ∑ d ) 
2
2
n
σd = 
n−1
Experimental Design 129

TABLE 4.7
T-Test Calculations
CBO for Faulty CBO for Nonfaulty Difference
Modules Modules (d) D2

45 9 36 1,296
56 9 47 2,209
34 9 25 625
71 7 64 4,096
23 10 13 169
9 15 –6 36

where:
n represents number of pairs and not total number of samples
d is the difference between values of two samples
Substituting the values of mean, variance, and sample size in the above formula, the
t-score is obtained as:

∑ d − ( ∑ d ) 
2
2
n 8431 − ( 179 ) 6 
2

σd =  =   = 24.86
n−1 5
µ1 − µ 2 39.66 − 9.83
t= = = 2.93
σd n 24.86 6

As the alternative hypothesis is of the form, H1: µ > X or µ < X, the tail of sampling
distribution is nondirectional. Let us take the level of significance (α) for one-
tailed test as 0.05.
Step 5: Determine the significance value
The p-value at significance level of 0.05 (two-tailed test) is considered and df as 5.
From the t-distribution table, it is observed that the p-value is 0.032 (refer to Section
6.4.6 for computation of p-value).
Step 6: Deriving conclusions
Now, to decide whether to accept or reject the null hypothesis, this p-value is compared
with the level of significance. As the p-value (0.032) is less than the level of significance
(0.05), the H0 is rejected. In other words, the alternative hypothesis is accepted. Thus,
it is concluded that there is statistical difference between the average of coupling
metrics for faulty classes and the average of coupling metrics for nonfaulty classes.

4.7.6 Example: Hypothesis Formulation in FPS


There are few RQs that the study intends to answer (stated in Section 4.3.3). Based on these
RQs, the study built some hypotheses that are tested. There are two sets of hypothesis,
“Hypothesis Set A” and “Hypothesis Set B.” Hypothesis set A focuses on the hypothesis
related to the relationship between OO metrics and fault proneness; whereas hypothe-
sis set B focuses on the comparison in the performance of machine learning techniques
and LR method. Thus, hypothesis in set A deals with the RQs 1, 2, and 3, whereas hypoth-
esis in set B deals with the RQ 4.
130 Empirical Research in Software Engineering

4.7.6.1 Hypothesis Set A


There are a number of OO metrics used as independent variables in the study. These are
CBO, RFC, LCOM, NOC, DIT, WMC, and source LOC (SLOC). The hypotheses given below
are tested to find the individual effect of each OO metric on fault proneness at different
severity levels of faults:

CBO hypothesis—H0: There is no statistical difference between a class having high


import or export coupling and a class having less import or export coupling.
Ha: A class with high import or export coupling is more likely to be fault prone than
a class with less import or export coupling.
RFC hypothesis—H0: There is no statistical difference between a class having a high
number of methods implemented within a class and the number of methods acces-
sible to an object class because of inheritance, and a class with a low number of
methods implemented within a class and the number of methods accessible to an
object class because of inheritance.
Ha: A class with a high number of methods implemented within a class and the
number of methods accessible to an object class because of inheritance is more
likely to be fault prone than a class with a low number of methods implemented
within a class and the number of methods accessible to an object class because of
inheritance.
LCOM hypothesis—H0: There is no statistical difference between a class having less
cohesion and a class having high cohesion.
Ha: A class with less cohesion is more likely to be fault prone than a class with high
cohesion.
NOC hypothesis—H0: There is no statistical difference between a class having greater
number of descendants and a class having fewer descendants.
Ha: A class with a greater number of descendants is more likely to be fault prone than
a class with fewer descendants.
DIT hypothesis—H0: There is no statistical difference between a class having large
depth in inheritance tree and a class having small depth in inheritance tree.
Ha: A class with a large depth in inheritance tree is more likely to be fault prone than
a class with a small depth in inheritance tree.
WMC hypothesis—H0: There is no statistical difference between a class having a large
number of methods weighted by complexities and a class having a less number of
methods weighted by complexities.
Ha: A class with a large number of methods weighted by complexities is more likely
to be fault prone than a class with a fewer number of methods weighted by
complexities.

4.7.6.2 Hypothesis Set B


The study constructs various fault proneness prediction models using a statistical tech-
nique and two machine learning techniques. The statistical technique used is the LR and
the machine learning techniques used are DT and ANN. The hypotheses given below
Experimental Design 131

are tested to compare the performance of regression and machine learning techniques at
different severity levels of faults:

1. H0: LR models do not outperform models predicted using DT.


Ha: LR models do outperform models predicted using DT.
2. H0: LR models do not outperform models predicted using ANN.
Ha: LR models do outperform models predicted using ANN.
3. H0: ANN models do not outperform models predicted using DT.
Ha: ANN models do outperform models predicted using DT.

4.8 Data Collection


Empirical research involves collecting and analyzing data. The data collection needs to be
planned and the source (people or repository) from which the data is to be collected needs
to be decided.

4.8.1 Data-Collection Strategies


The data collected for research should be accurate and reliable. There are various data-
collection techniques that can be used for collection of data. Lethbridge et al. (2005) divides
the data-collection techniques into the following three levels:

First degree: The researcher is in direct contact or involvement with the subjects
under concern. The researcher or software engineer may collect data in real-time.
For example, under this category, the various methods are brainstorming, inter-
views, questionnaires, think-aloud protocols, and so on. There are various other
methods as depicted in Figure 4.10.
Second degree: There is no direct contact of the researcher with the subjects during
data collection. The researcher collects the raw data without any interaction with
the subjects. For example, observations through video recording and fly on the wall
(participants taping their work) are the two methods that come under this category.
Third degree: There is access only to the work artifacts. In this, already avail-
able and compiled data is used. For example, analysis of various documents
produced from an organization such as the requirement specifications, fail-
ure reports, document change logs, and so on come under this category. There
are various reports that can be generated using different repositories such
as change report, defect report, effort data, and so on. All these reports play
an important role while conducting a research. But the accessibility of these
reports from the industry or any private organization is not an easy task. This
is discussed in the next subsection, and the detailed collection methods are
presented in Chapter 5.

The main advantage of the first and second degree methods is that the researcher has
control over the data to a large extent. Hence, the researcher needs to formulate and decide
132 Empirical Research in Software Engineering

• Inquisitive techniques
Brainstorming and focus groups
Interviews
Questionnaires
First degree Conceptual modeling
(direct involvement of • Observational techniques
software engineers) Work diaries
Think-aloud protocols
Shadowing and observation
synchronized shadowing
Participant observation (join the team)
Second degree • Instrumenting systems
(indirect involvement of • Fly on the wall (participants taping their
software engineers) work)

• Analysis of electronic database of work


performed
Third degree
(study of work artifacts • Analysis of tool use logs
only) • Documentation analysis
• Static and dynamic analyis of a system

FIGURE 4.10
Various data-collection strategies.

on data-collection methods in the experimental design phase. The methods under these
categories require effort from both the researcher and the subject. Because of this reason,
first degree methods are most expensive than the second or third degree methods. Third
degree methods are least expensive, but the control over data is minimum. This compro-
mises the quality of the data as the correctness of the data is not under the direct control
of the researcher.
Under first degree category, the interviews and questionnaires are the most easy
and straightforward methods. In interview-based data collection, the researcher pre-
pares a list of questions about the areas of interest. Then, an interview session takes
place between the researcher and the subject(s), wherein the researcher can ask vari-
ous research-related questions. Questions can be either open, inviting multiple and
broad range of answers, or closed, offering a limited set of answers. The drawback of
collecting data from interviews and questionnaires is that they produce typically an
incomplete picture. For example, if one wants to know the number of LOC in a soft-
ware program. Conducting interviews and questionnaires will only provide us general
opinions and evidence, but the accurate information is not provided. Methods such as
think-aloud protocols and work diaries can be used for this strategy of data collection.
Second degree requires access to the environment in which participants or subject(s)
work, but without having direct contact with the participants. Finally, the third degree
requires access only to work artifacts, such as source code or bugs database or docu-
mentation (Wohlin 2012).

4.8.2 Data Collection from Repositories


The empirical study is based on the data that is often collected from software reposito-
ries. In general, it is seen in the literature that data collected is either from academic or
Experimental Design 133

TABLE 4.8
Differences between the Types of Data Sets
S. No. Academic Industrial Open Source

1 Obtained from the projects Obtained from the projects Obtained from the projects
made by the students of developed by experienced and developed by experienced
some university qualified programmers developers located at different
geographical locations
2 Easy to obtain Difficult to obtain Easy to obtain
3 Obtained from data set that is Obtained from data set Obtained from data set
not necessarily maintained maintained over a long period maintained over a long period
over a long period of time of time of time
4 Results are not reliable and Results are highly reliable and Results may be reliable and
acceptable acceptable acceptable
5 It is freely available May or may not be freely It is generally freely available
available
6 Uses ad hoc approach to Uses very well planned Uses well planned and mature
develop projects approach approach
7 Code may be available Code is not available Code is easily available
8 Example: Any software Example: Performance Manage- Example: Android, Apache
developed in university such ment traffic recording (Lindvall Tomcat, Eclipse, Firefox, and
as LALO (Briand et al. 2001), 1998), commercial OO system so on
UMD (Briand et al. 2000), implemented in C++ (Bieman
USIT (Aggarwal et al. 2009) et al. 2003), UIMS (Li and Henry
1993), QUES (Li and Henry 1993)

university systems, industrial or commercial systems, and public or open source soft-
ware. The academic data is the data that is developed by the students of some univer-
sity. Industrial data is the proprietary data belonging to some private organization or a
company. Public data sets are available freely to everyone for use and does not require any
payment from the user. The differences between them are stated in Table 4.8.
It is relatively easy to obtain the academic data as it is free from confidentiality concerns
and, hence, gaining access to such data is easier. However, the accuracy and reliability of
the academic data is questionable while conducting research. This is because the university
software is developed by inexperienced, small number of programmers and is typically
not applicable in real-life scenarios. Besides the university data sets, there is public or open
source software that is widely used for conducting empirical research in the area of soft-
ware engineering. The use of open source software allows the researchers to access vast
repositories of reasonable quality, large-sized software. The most important type of data is
the proprietary/industrial data that is usually owned by a corporation/organization and
is not publically available.
The usage of open source software has been on the rise, with products such as Android
and Firefox becoming household names. However, majority of the software devel-
oped across the world, especially the high-quality software, still remains proprietary
software. This is because of the fact that given the voluntary nature of developers for
open source software, the attention of the developers might shift elsewhere leading to
lack of understanding and poor quality of the end product. For the same reason, there
are also challenges with timeliness of the product development, rigor in testing and
documentation, as well as characteristic lack of usage support and updates. As opposed
to this, the proprietary software is typically developed by an organization with clearly
134 Empirical Research in Software Engineering

demarcated manpower for design, development, and testing of the software. This allows
for committed, structured development of software for a well-defined end use, based on
robust requirement gathering. Therefore, it is imperative that the empirical studies in
software engineering be validated over data from proprietary systems, because the devel-
opers of such proprietary software would be the key users of the research. Additionally,
industrial data is better suited for empirical research because the development follows
a structured methodology, and each step in the development is monitored and docu-
mented along with its performance measurement. This leads to development of code that
follows rigorous standards and robustly captures the data sets required by the academia
for conducting their empirical research.
At the same time, access to the proprietary software code is not easily obtained. For most
of the software development organizations, the software constitutes their key intellectual
asset and they undertake multiple steps to guard the privacy of the code. The world’s most
valuable products, such as Microsoft Windows and Google search, are built around their
closely held patented software to guard against competition and safeguard their products
developed with an investment of billions of dollars. Even if there is appreciation of the role
and need of the academia to access the software, the enterprises typically hesitate to share
the data sets, leading to roadblocks in the progress of empirical research.
It is crucial for the industry to appreciate that the needs of the empirical research do not
impinge on their considerations of software security. The data sets required by the academia
are the metrics data or the data from the development/testing process, and does not com-
promise on security of the source code, which is the primary concern of the industry. For
example, assume an organization uses commercial code management system/test manage-
ment system such as HP Quality Center or HP Application Lifecycle Management. Behind
the scenes, a database would be used to store information about all modules, including all
the code and its versions, all development activity in full detail, and the test cases and their
results. In such a scenario, the researcher does not need access to the data/code stored in the
database, which the organization would certainly be unwilling to share, but rather specific
reports corresponding to the problem he wishes to address. As an illustration, for a defect
prediction study, only a list of classes with corresponding metrics and defect count would
be required, which would not compromise the interests of the organization. Therefore, with
mutual dialogue and understanding, appropriate data sets could be shared by the industry,
which would create a win-win situation and lead to betterment of the process. The key chal-
lenge, which needs to be overcome, is to address the fear of the enterprises regarding the
type of data sets required and the potential hazards. A constructive dialogue to identify the
right reports would go a long way towards enabling the partnership because access to the
wider database with source code would certainly be impossible.
Once the agreement with the industry has been reached and the right data sets have been
received, the attention can be shifted to actual conducting of the empirical research with
the more appropriate industrial data sets. The benefits of using the industrial database
would be apparent in the thoroughness of the data sets available and the consistency of
the software system. This would lead to more accurate findings for the empirical research.

4.8.3 Example: Data Collection in FPS


This empirical study given in Section 4.2 makes use of the public domain data set KC1 from
the NASA metrics date program (MDP) (NASA 2004; PROMISE 2007). The NASA data
repository stores the data, which is collected and validated by the MDP (2006). The data
in KC1 is collected from a storage management system for receiving/processing ground
Experimental Design 135

data, which is implemented in the C++ programming language. Fault data for KC1 is
collected since the beginning of the project (storage management system) but that data
can only be associated back to five years (MDP 2006). This system consists of 145 classes
that comprise 2,107 methods, with 40K LOC. KC1 provides both class-level and method-
level static metrics. At the method level, 21 software product metrics based on product’s
complexity, size, and vocabulary are given. At the class level, values of ten metrics are
computed, including six metrics given by Chidamber and Kemerer (1994). The seven OO
metrics are taken in this study for analyses. In KC1, six files provide association between
class/method and metric/defect data. In particular, there are four files of interest, the first
representing the association between classes and methods, the second representing asso-
ciation between methods and defects, the third representing association between defects
and severity of faults, and the fourth representing association between defects and specific
reason for closure of the error report.
First, defects are associated with each class according to their severities. The value of
severity quantifies the impact of the defect on the overall environment with 1 being most
severe to 5 being least severe as decided in data set KC1. The defect data from KC1 is
collected from information contained in error reports. An error either could be from the
source code, COTS/OS, design, or is actually not a fault. The defects produced from the
source code, COTS/OS, and design are taken into account. The data is further processed
by removing all the faults that had “not a fault” keyword used as the reason for closure of
error report. This reduced the number of faults from 669 to 642. Out of 145 classes, 59 were
faulty classes, that is, classes with at least one fault and the rest were nonfaulty.
In this study, the faults are categorized as high, medium, or low severity. Faults with
severity rating 1 were classified as high-severity faults. Faults with severity rating 2 were
classified as medium-severity faults and faults with severity rating 3, 4, and 5 as low-sever-
ity faults, as at severity rating 4 no class is found to be faulty and at severity rating 5 only
one class is faulty. Faults at severity rating 1 require immediate correction for the system
to continue to operate properly (Zhou and Leung 2006).
Table 4.9 summarizes the distribution of faults and faulty classes at high-, medium-, and
low-severity levels in the KC1 NASA data set after preprocessing of faults in the data set.
High-severity faults were distributed in 23 classes (15.56%). There were 48 high-severity
faults (7.47%), 449 medium-severity faults (69.93%), and 145 low-severity faults (22.59%). As
shown in Table 4.9, majority of the classes are faulty at severity rating medium (58 out of
59 faulty classes). Figure 4.11a–c shows the distribution of high-severity faults, medium-
severity faults, and low-severity faults. It can be seen from Figure 4.11a that 22.92% of
classes with high-severity faults contain one fault, 29.17% of classes contain two faults, and
so on. In addition, the maximum number of faults (449 out of 642) is covered at medium
severity (see Figure 4.11b).

TABLE 4.9
Distribution of Faults and Faulty Classes at High-, Medium-, and Low-Severity
Levels
Level of Number of % of Faulty Number of % of Distribution
Severity Faulty Classes Classes Faults of Faults

High 23 15.56 48 7.47


Medium 58 40.00 449 69.93
Low 39 26.90 145 22.59
136 Empirical Research in Software Engineering

1–2 Faults
8 Faults 29–77 Faults 7.75%
16.67% 1 Faults 18.64% 3–4 Faults
22.92% 7.26%

5–6 Faults
5 Faults
11.38%
10.42%
15–28 Faults
15.5%
4 Faults 7–9 Faults
8.33% 13.32%

2 Faults
29.17%
3 Faults 10–14 Faults
12.5% 26.15%
(a) (b)

1 Faults
9–12 Faults 5.52%
15.17%
2 Faults
16.55%

8–9 Faults 3 Faults


30.34% 14.48%

4–7 Faults
17.93%
(c)

FIGURE 4.11
Distribution of (a) high-, (b) medium-, and (c) low-severity faults.

4.9 Selection of Data Analysis Methods


There are various data analysis methods available in the literature (such as statistical,
machine learning) that can be used to analyze different kinds of gathered data. It is very
essential to carefully select the methods to be used while conducting a research. But it is very
difficult to select appropriate data analysis method for a given research. Among various
available data analysis methods, we can select the most appropriate method by comparing
different parameters and properties of all the available methods. Besides this, there are very
few sources available that provide guidance for selection of data analysis methods.
In this section, guidelines that can be used for the appropriate selection of the data anal-
ysis methods are presented. The selection of a data analysis technique can be made based
on the following three criteria: (1) the type of dependent variable, (2) the nature of data set,
or (3) the important aspects of different methods.
Experimental Design 137

4.9.1 Type of Dependent Variable


The data analysis methods can be selected based on the type of the dependent variable
being used. The dependent variable can be either discrete/binary or continuous. A discrete
variable is a variable that can only take a finite number of values, whereas a continuous
variable can take infinite number of values between any two points. If the dependent vari-
able is binary (e.g., fault proneness, change proneness), then among statistical techniques,
the researcher can use the LR and discriminant analysis. The examples of machine learning
classifiers that support binary-dependent variable are DT, ANN, support vector machine,
random forest, and so on. If the dependent variable is continuous, then the selection of
data analysis method depends on whether the variable is a count variable (i.e., used for
counting purpose) or not a count variable. The examples of continuous count variable are
number of faults, lines of source code, and development effort. ANN is one of the machine
learning techniques that can be used in this case. In addition, for noncount continuous-
dependent variable, the traditional ordinary least squares (OLS) regression model can be
used. The diagrammatic representation of the selection of appropriate data analysis meth-
ods based on type of dependent variable is shown in Figure 4.12.

4.9.2 Nature of the Data Set


Other factors to consider when choosing and applying a learning method include the
following:

1. Diversity in data: The variables or attributes of the data set may belong to different
categories such as discrete, continuous, discrete ordered, counts, and so on. If the
attributes are of many different kinds, then some of the algorithms are preferable
over others as they are easy to apply. For example, among machine learning tech-
niques, support vector machine, neural networks, and nearest neighbor methods
require that the input attributes are numerical and scaled to similar ranges (e.g., to
the [–1,1] interval). Among statistical techniques, linear regression and LR require

Logistic
regression
Statistical
Discriminant
analysis
Binary
Decision tree
Machine
Type of learning Support vector
dependent machine
variable Machine
learning Artificial neural
network
Continuous
Linear
Statistical regression

Ordinary least
square

FIGURE 4.12
Selection of data analysis methods based on the type of dependent variable.
138 Empirical Research in Software Engineering

the input attributes be numerical. The machine learning technique that can han-
dle heterogeneous data is DT. Thus, if our data is heterogeneous, then one may
apply DT instead of other machine learning techniques (such as support vector
machine, neural networks, and nearest neighbor methods).
2. Redundancy in the data: There may be some independent variables that are redun-
dant, that is, they are highly correlated with other independent variables. It is advis-
able to remove such variables to reduce the number of dimensions in the data set.
But still, sometimes it is found that the data contains the redundant information. In
this case, the researcher should make careful selection of the data analysis methods,
as some of the methods will give poor performance than others. For example, linear
regression, LR, and distance-based methods, will give poor performance because of
numerical instabilities. Thus, these methods should be avoided.
3. Type and existence of interactions among variables: If each attribute makes an
independent impact or contribution to the output or dependent variable, then
the techniques based on linear functions (e.g., linear regression, LR, support vec-
tor machines, naïve Bayes) and distance functions (e.g., nearest neighbor meth-
ods, support vector machines with Gaussian kernels) perform well. But, if the
interactions among the attributes are complex and huge, then DT and neural net-
work should be used as these techniques are particularly composed to deal with
these interactions.
4. Size of the training set: Selection of appropriate method is based on the tradeoff
between bias/variance. The main idea is to simultaneously minimize bias and
variance. Models with high bias will result in underfitting (do not learn relation-
ship between the dependent and independent variables), whereas models with
high variance will result in overfitting (noise in the data). Therefore, a good learn-
ing technique automatically adjusts the bias/variance trade-off based on the size
of training data set. If the training set is small, high bias/low variance classifiers
should be used over low bias/high variance classifiers. For example, naïve Bayes
has a high bias/low variance (naïve Bayes is simple and assumes independence of
variables) and k-nearest neighbor has a low bias/high variance. But as the size of
training set increases, low bias/high variance classifiers show good performance
(they have lower asymptotic error) as compared with high bias/low variance clas-
sifiers. High bias classifiers (linear) are not powerful enough to provide accurate
models.

4.9.3 Aspects of Data Analysis Methods


There are various machine learning tasks available. To implement each task, there are vari-
ous learning methods that can be used. The various machine learning tasks along with the
data analysis algorithms are listed in Table 4.10.
To implement each task, among various learning methods, it is required to select the
appropriate method. This is based on the important aspects of these methods.

1. Accuracy: It refers to the predictive power of the technique.


2. Speed: It refers to the time required to train the model and the time required to
test the model.
3. Interpretability: The results produced by the technique are easily interpretable.
4. Simplicity: The technique must be simple in its operation and easy to learn.
Experimental Design 139

TABLE 4.10
Data Analysis Methods Corresponding to Machine Learning Tasks
S. No. Machine Learning Tasks Data Analysis Methods

1 Multivariate querying Nearest neighbor, farthest neighbor


2 Classification Logistic regression, decision tree, nearest neighbor classifier,
neural network, support vector machine, random forest
3 Regression Linear regression, regression tree
4 Dimension reduction Principal component analysis, nonnegative matrix
factorization, independent component analysis
5 Clustering k-means, hierarchical clustering

Besides the four above-mentioned important aspects, there are some other considerations
that help in making a decision to select the appropriate method. These considerations are
sensitivity to outliers, ability to handle missing values, ability to handle nonvector data,
ability to handle class imbalance, efficacy in high dimensions, and accuracy of class prob-
ability estimates. They should also be taken into account while choosing the best data
analysis method. The procedure for selection of appropriate learning technique is further
described in Section 7.4.3.
The methods are classified into two categories: parametric and nonparametric. This
classification is made on the basis of the population under study. Parametric methods
are those for which the population is approximately normal, or can be approximated to
normal using a normal distribution. Parametric methods are commonly used in statistics
to model and analyze ordinal or nominal data with small sample sizes. The methods
are generally more interpretable, faster but less accurate, and more complex. Some of
the parametric methods include LR, linear regression, support vector machine, principal
component analysis, k-means, and so on. Whereas, nonparametric methods are those for
which the data has an unknown distribution and is not normal. Nonparametric meth-
ods are commonly used in statistics to model and analyze ordinal or nominal data with
small sample sizes. The data cannot even be approximated to normal if the sample size
is so small that one cannot apply the central limit theorem. Nowadays, the usage of non-
parametric methods is increasing for a number of reasons. The main reason is that the
researcher is not forced to make any assumptions about the population under study as is
done with a parametric method. Thus, many of the nonparametric methods are easy to
use and understand. These methods are generally simpler, less interpretable, and slower
but more accurate. Some of the nonparametric methods are DT, nearest neighbor, neural
network, random forest, and so on.

Exercises
4.1. What are the different steps that should be followed while conducting experi-
mental design?
4.2. What is the difference between null and alternative hypothesis? What is the
importance of stating the null hypothesis?
140 Empirical Research in Software Engineering

4.3. Consider the claim that the average number of LOC in a large-sized software is
at most 1,000 SLOC. Identify the null hypothesis and the alternative hypothesis
for this claim.
4.4. Discuss various experiment design types with examples.
4.5. What is the importance of conducting an extensive literature survey?
4.6. How will you decide which studies to include in a literature survey?
4.7. What is the difference between a systematic literature review, and a more general
literature review?
4.8. What is a research problem? What is the necessity of defining a research problem?
4.9. What are independent and dependent variables? Is there any relationship
between them?
4.10. What are the different data-collection strategies? How do they differ from one
another?
4.11. What are the different types of data that can be collected for empirical research?
Why the access to industrial data is difficult?
4.12. Based on what criteria can the researcher select the appropriate data analysis
method?

Further Readings
The book provides a thorough and comprehensive overview of the literature review
process:

A. Fink, Conducting Research Literature Reviews: From the Internet to Paper. 2nd edn.
Sage Publications, London, 2005.

The book provides an excellent text on mathematical statistics:

E. L. Lehmann, and J.P. Romano, Testing Statistical Hypothesis, 3rd edn., Springer,
Berlin, Germany, 2008.

A classic paper provides techniques for collecting valid data that can be used for gathering
more information on development process and assess software methodologies:

V. R. Basili, and D. M. Weiss, “A methodology for collecting valid software engineer-


ing data,” IEEE Transactions on Software Engineering, vol. 10, no. 6, pp. 728–737,
1984.

The following book is a classic example of concepts on experimentation in software


engineering:

V. R. Basili, R. W. Selby, and D. H. Hutchens, “Experimentation in software engineer-


ing,” IEEE Transactions on Software Engineering, vol. 12, no. 7, pp. 733–743, 1986.
Experimental Design 141

A taxonomy of data-collection techniques is given by:

T. C. Lethbridge, S. E. Sim, and J. Singer, “Studying software engineers: Data collec-


tion techniques for software field studies,” Empirical Software Engineering, vol. 10,
pp. 311–341, 2005.

The following paper provides an overview of methods in empirical software engineering:

S. Easterbrook, J. Singer, M.-A. Storey, and D. Damian, “Selecting empirical methods


for software engineering research,” In: F. Shull, J. Singer, and D.I. Sjøberg (eds.),
Guide to Advanced Empirical Software Engineering, Springer, London, 2008.

You might also like