Thanks to visit codestin.com
Credit goes to www.scribd.com

0% found this document useful (0 votes)
26 views6 pages

Software Project Cost Estimation Using AI Techniqu

Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
26 views6 pages

Software Project Cost Estimation Using AI Techniqu

Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 6

See discussions, stats, and author profiles for this publication at: https://www.researchgate.

net/publication/234810400

Software project cost estimation using AI techniques

Article · September 2005

CITATIONS READS

3 5,580

3 authors, including:

Joaquín Villanueva Balsera Gemma Martínez


University of Oviedo University of Oviedo
53 PUBLICATIONS 604 CITATIONS 10 PUBLICATIONS 108 CITATIONS

SEE PROFILE SEE PROFILE

All content following this page was uploaded by Gemma Martínez on 15 December 2015.

The user has requested enhancement of the downloaded file.


Proceedings of the 5th WSEAS/IASME Int. Conf. on SYSTEMS THEORY and SCIENTIFIC COMPUTATION, Malta, September 15-17, 2005 (pp289-293)

Software project cost estimation using AI techniques


Rodríguez Montequín, V.; Villanueva Balsera, J.; Alba González, C.; Martínez Huerta, G.
Project Management Area
University of Oviedo
C/Independencia 4, 33004 Oviedo
SPAIN
http://www.api.uniovi.es

Abstract: - The software cost estimation is an important task within projects. It determines the success or failure
of a project. In order to improve the estimation, it is very important to identify and study the most relevant
factors and variables. This paper describes a method to perform this estimation based on AI techniques and using
Data Mining methodologies.

Key-Words: - Software Cost Estimation, Artificial Intelligent, Data Mining

1 Introduction schedule. The size estimation is the measuring of the


The software cost estimation for Information Systems project size, usually in lines of code or equivalent.
is process used by an organization in order to forecast Since software is a product without physical presence
the cost for the development of a software project. and the main cost is the design and development of
The estimation of resources and time needed is very the product, the cost is dominated by the cost of the
important for all the projects, but specially within human resources, measuring this effort in man-
Information Systems, where budget and schedule are months. Finally, the schedule estimation is the
usually overcome. amount of time needed to accomplish the estimated
All the estimation methods have to take as reference a effort, considering the organizational restrictions and
software size metric. [1][2][3]. the parallelism between project tasks. At the end of
The software project estimation present special the process, we can get an economical value for the
difficulties, compared with other sectors. The existing project cost, multiplying the number of man-month
methods are highly dependant of the available estimated by unitary cost. So the project estimation is
information for the project. When the project steps, a forecast of the expected effort to develop a project
the estimation is more accurate because there is more and the scheduled needed to accomplish it.
information and it is more reliable. The estimation Because the complexity and variety of factors
process should be a continuous process, including the influencing over the accuracy of the effort estimation,
new information. we need to develop analytical models that take
This work establish a method for software cost consideration of every factor.
estimation based on Artificial Intelligent (AI) The base of the software cost estimation was
techniques, identifying the most influencing factors established by Lawrence H. Putnam and Ann
and variables over the software cost. The work is Fitzsimmons [4], although the first approaches were
done based on a historical dataset of projects, taking carried out during the sixties. The more important
the data provided by the International Software progresses were performed within the big companies
Benchmarking Standards Group (ISBSG), gathering of the epoch. So, Frank Freiman, from RCA,
information from more than 2000 projects. This developed the concept of parametric estimation with
dataset contains numerical values as well as his tool named PRICE. Norman Peter, from IBM,
categorical data. Within this dataset there are a high developed a model based on adjusted curves [5].
percentage of missing values. Due to this, the data During the seventies the number of software projects
mining techniques are used for preprocessing the and its size suffered a big increment. Most projects
information. performed during this epoch failed. Due to this, more
people focused on project estimation. Using statistic
techniques (mainly correlations), people researched
about the factors influencing over project effort. In
2 Problem Formulation this way, the more emblematic model, COCOMO,
Usually, the process for effort estimation within
was developed by Barry W. Boehm [6]. Most of these
Information Systems has been noted as cost
models consider the effort (E) as result of an equation
estimation, although cost is just only the result
based on:
derived from the estimation of size, effort and
Proceedings of the 5th WSEAS/IASME Int. Conf. on SYSTEMS THEORY and SCIENTIFIC COMPUTATION, Malta, September 15-17, 2005 (pp289-293)

information. Due to this, during the last years there


E = a ⋅ S (1)
b were some tries to establish repositories with
information about software projects. One of this
Where E is the effort, S is the project size in code approaches were performed by the ISBSG [16], the
lines, a reflects the productivity and b is an scale dataset used in this work. This repository contains
economy factor. The result is adjusted with a set of information about:
drivers representing the development environment • Size metrics
and the project features (15 drivers for COCOMO • Efforts
model). • Data quality
The main issue of these models (and even the main • Type and quality of the product: information
issue of present models) is considering the size as a relative to the development, the platform, the
free variable, when the size is unknown until the end language, the type of application,
of the project. The size must be estimated before the organization, number of defects, etc.
project start. During this epoch, Albreacht and • CASE tools utilization
Gaffney [7][8] replaced the lines of code by the • Team size and characteristics
Function Points (FP) as unit for measuring the project • Schedule information
size. The Function Points measure the size of the • Effort ratios
software independently of the technology and the
language used to code the programs. It involves the Although the existing methods have improved
change from the size oriented metrics to the significantly the way in which estimation is
functionality oriented metrics. The productivity of performed, they don’t reach the required accuracy.
develop will be countered as FP per man-month. The limitations of the existing models are derived
During the seventies Putnam [9] developed other from the difficult to quantify the factors, as well as
popular model, SLIM, based on Rayleigh curve, the simplifications done in the models. The datasets
adjusted using data from 50 projects. used to adjust the models shall be representative.
The eighties is a transition period and the best Finally, considering the non-linear of the process and
methods (like COCOMO and SLIM) are the dependencies of non quantified parameters, the
consolidated. Caper Jones [10] improved the problem is suitable to be studied under the framework
Function Points method to consider complex of the AI techniques.
algorithms, and KPMG developed the MARK II [11],
other improvement method for measuring FP.
In the nineties, Boehm developed a new version of
COCOMO, named COCOMO 2.0 [12], adapted to
4 Technical and Methodology
the new circumstances of the software (object
Data Mining techniques can help in data analysis,
oriented, transactions, software reusing, etc.) Until
modelling and optimization. The software estimation
the nineties, most of the improvement efforts were
process is influenced by a lot of variables. In order to
address to disaggregate the components of the models
get a successful model a work methodology must be
and proceed to adjust the parameters using
use for Data Mining projects. CRISP-DM [1] is one
regressions. Other approaches were also used, in
of the most usual process models. It divides life cycle
example rule systems were used by Mukhopadhyay,
for Data Mining projects in six phases.
[13], or decision trees by Porter [14][15]. But the
The methodology CRISP-DM [8] constructs the cycle
results were not satisfactory and the application of
of life of a project of data mining in six phases, which
these techniques presented some problems. With the
interact between them on iterative form during the
explosion of the AI techniques in the beginning of the
development of the project
nineties, new approaches were used: Fuzzy Logic,
Genetic Algorithms, Neural Networks and so on. The
new modelling techniques allow a most suitable
selection of variables and the study and work with
more representative datasets. Additionally, the use of
these techniques is useful combining the knowledge
of the domain (those information we have about the
problem) with the processing of large data
information. But this links with other of the existing
problems, the lack of reliable datasets. Using these
techniques, we can analyze large quantity of
Proceedings of the 5th WSEAS/IASME Int. Conf. on SYSTEMS THEORY and SCIENTIFIC COMPUTATION, Malta, September 15-17, 2005 (pp289-293)

on the success criteria established in the first phase,


one proceeds to the development of the model.
Normally the projects of data mining do not end in
the model implantation but it is necessary to
document and present the results of an
understandable way to achieve an increase of the
knowledge. In addition, in the development phase it
is necessary to assure the maintenance of the
application and the possible diffusion of the results
[3].

5 Modeling Method
Fig.1 Phases of the Modelling process of
methodology CRISP-DM. Following the steps of the methodology, the data
acquisition is realized, organized.
The first phase, business understanding is an analysis To begin the analysis of the data set, it get the
of the problem, includes the understanding of the historical set that there has provided ISBSG
objectives and requirements of the project from a (International Software Benchmarking Standards
managerial perspective, in order to turn them into Group).
technical objectives and plans. Later it proceeds to make a data exploration and a
The second phase, data understanding is an analysis monitoring of the quality. For which there are
of data includes the initial compilation of realized statistical basic technologies, to find the data
information, to establish the first contact with the properties. Since, given that there are more
problem, identifying the quality of the information categorical variables we proceed to realize
and establishing the most evident relations that allow histograms with the occurrence frequencies.
establishing the first hypotheses. In this point starts the data preparation phase. This
Once, realized the analysis of information, the phase has been very costly due to there are more
methodology establishes that one proceeds to the data missing values, on which there has been analyzed the
preparation, in such a way that they could be treated use of diverse technologies to predict or to delete this
by the modelling technologies. The preparation of hollow in the information.
information includes the general tasks of data
selection to which the modelling technology is going 80%
to be applied (variables and samples), data 70%
cleanliness, generation of additional variables, 60%
integration of different data origins and format 50%
changes. 40%
The phase of data preparation, it is more related to the 30%
modelling phase, since depending on the modelling 20%
technology that is going to be used, the data need to 10%
be processed in different forms. Therefore the phases 0%
>10

of preparation and modelling interact between then.


>20

>30

>40

>50

>60

>70

>80

>90

In the modelling phase the technologies more adapted


P t j d i bl l
for the specific project of data mining are selected. Fig. 2 Percentage of variables with missing values
Before proceeding to data modelling, it must
establish a design of the evaluation method of the
models, which allows establishing the confidence Studies have been realized to verify if this missing
degree of the models. Once realized these generic values has some type of influence in the effort
tasks one proceeds to the generation and evaluation (person per time) needed to realize the project, this is
of the model. The parameters used in the generation the objective variable that has been identified as
of the model depend on the data characteristics. model goal.
In the evaluation phase, the model is evaluated in that Other one of the problems is the great presence of
degree they are fulfilled of the success criteria of the categorical variables of difficult processing for
problem. If the generated model is valid depending someone modeling methods.
Proceedings of the 5th WSEAS/IASME Int. Conf. on SYSTEMS THEORY and SCIENTIFIC COMPUTATION, Malta, September 15-17, 2005 (pp289-293)

They have been measured different technologies for Fig 3 Relative importance of the model parameters
the per-process and transformation of these variables.
When the number of classes was reduced (lower than In the previous figure the relative importance of the
six), the process to the categorical information has variables of the best model can be seen since it has
created so many variables as classes. This way for commented previously.
example, if the categorical variable "platform of
development", it was containing the values MR, MF
and PC, then 3 variables have created like (1,0,0) if
the value of the variable is MR, (0,1,0) if it is MF and
(0,0,1) if it is PC.
When the number of classes of a variable was very
high (Superior for six) has been transformed the
value of the category directly to a numerical value.
For the process of the missing information has chosen
to select a robust technology that allows the
processed of this type of information, such as nets
SOM, MARS [2] and MART [4].
For the results evaluation of the information has been
cut in three separated sets from random way: one of Fig.4 Real effort front of estimate effort
them that contains 75 % of the data that it has been
send for the construction of the model, 10 % for the For the construction of model MARS it has been used
model test and selection of the best model. The the following parameters, interrelation between
results have been tested by 15 % of remaining data. variables at level 3, base functions of second degree.
Once has been generated the model it is possible The results are in the following table.
observe that the variable that contributes with more
information to the effort estimation, in this model, is Absolute % old % train % test
the maximum team size. It is also important the error success success
Function Points and the value of adjustment factor. 396 22% 24% 15%
Another important factor is the development platform 792 30% 45% 38%
used and the type of language that is used in the 1189 33% 58% 52%
programming, it is relevant that the missing values 1585 41% 67% 61%
give knowledge to the effort estimation model. 1981 44% 73% 69%
The model also considers if an adaptation of the code 2377 48% 80% 72%
3960 56% 83% 76%
has been made, if planning has been used, as well as
other variables related to the metric one used and the Table 1 Model results
implication of the resources.
The relative importance of each variable in that This is a significant improvement with respect to the
model is analyzed, the variables that more importance reference old model that is a model based on
contributes to the model are selected and they are analogies.
added according to his importance while they
improve the results.
4 Conclusion
All the methodologies of project management make
management of plan and costs in any type of project
and in the projects of software.
The chosen system to make the estimations has to
have the confidence of the project management and
to allow to adapt again to the changing necessities of
the software. The historical data summary in the end
of the project is essential to update the data base of
projects and so that the system can fit its parameters
to the changing conditions of software.
Proceedings of the 5th WSEAS/IASME Int. Conf. on SYSTEMS THEORY and SCIENTIFIC COMPUTATION, Malta, September 15-17, 2005 (pp289-293)

References:
[1] Fenton, N. y Pfleeger, S.L. , Software Metrics, A
Rigurous & Practical Approch. PWS Publishing
Company 1997.
[2] Kitchenahm, B., Pfleeger, S.L. y Fenton, N.E.,
Towards a framework for software measurement
validation. IEEE Transactions on software
engineering, vol. 21, nº 12, 1995, pp. 929-944.
[3] Minguet Melián J.M. y Hernández Ballesteros
J.F. , La Calidad del Software y su Medida.
Centro de estudios Ramón Areces, S.A. 2003.
[4] Putnam LH, Ann Fitzsimmons. Estimating
software cost, Datamation; 1979.
[5] Norden Peter V. Curve fitting for a model of
applied research and development scheduling.
IBM J Res Develop 1958;2(3).
[6] Barry W. Boehm, Software Engineering
Economics, Prentice Hall PTR Prentice-Hall Inc.,
1981.
[7] Albrecht, Allan J., "Measuring Application
Development Productivity," Proceedings of the
Joint SHARE, GUIDE, and IBM Application
Development Symposium, Oct. 1417, 1979.
[8] Albrecht, Allan J., and John E. Gaftney,
"Software Function, Source Lines of Code, and
Development Effort Prediction: A Software
Science Validation," IEEE Transactions on
Software Engineering, Vol. 9, No. 2, November
1983.
[9] Putnam LH, Ann Fitzsimmons. Estimating
software cost, Datamation; 1979.
[10] Jones, Capers, "The SPR Feature Point
Method," Software Productivity Research, Inc.,
1986.
[11] Symons, Charles, "Software Sizing Estimating:
Mark II FPA," Wiley, 1991.
[12] Boehm BW, Abts C, Clark B, Devnani-Chulani
S. COCOMO II model definition manual, The
University of Southern California; 1997.
[13] T. Mukhopadhyay, S.S. Vicinanza, and M.J.
Prietula, “Examining the Feasibility of a Case-
Based Reasoning Model for Software Effort
Estimation,” MIS Quarterly, vol. 16, pp. 155-171,
June, 1992.
[14] A. Porter and R. Selby, “Empirically Guided
Software Development Using Metric-Based
Classification Trees,” IEEE Software, no. 7, pp.
46-54, 1990.
[15] A. Porter and R. Selby, “Evaluating Techniques
for Generating Metric-Based Classification
Trees,” J. Systems Software, vol. 12, pp. 209-218,
1990.
[16] ISBSG: International Software Benchmarking
Standards Group. http://www.isbsg.org

View publication stats

You might also like