Thanks to visit codestin.com
Credit goes to www.scribd.com

0% found this document useful (0 votes)
4 views4 pages

ML LS5

This paper explores the use of natural language processing (NLP) models, specifically BERT and CuBERT, to predict software code changes based on empirical data. The study evaluates the effectiveness of these models compared to traditional metrics-based approaches, finding that CuBERT outperforms BERT and conventional methods in prediction accuracy. Additionally, the paper discusses retraining strategies for NLP models, concluding that transfer learning is generally more effective than fine-tuning for BERT, while the opposite is true for CuBERT.

Uploaded by

yvabhiram
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
4 views4 pages

ML LS5

This paper explores the use of natural language processing (NLP) models, specifically BERT and CuBERT, to predict software code changes based on empirical data. The study evaluates the effectiveness of these models compared to traditional metrics-based approaches, finding that CuBERT outperforms BERT and conventional methods in prediction accuracy. Additionally, the paper discusses retraining strategies for NLP models, concluding that transfer learning is generally more effective than fine-tuning for BERT, while the opposite is true for CuBERT.

Uploaded by

yvabhiram
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 4

2023 IEEE 34th International Symposium on Software Reliability Engineering Workshops (ISSREW)

Towards Predicting Source Code Changes Based on


Natural Language Processing Models: An Empirical
2023 IEEE 34th International Symposium on Software Reliability Engineering Workshops (ISSREW) | 979-8-3503-1956-9/23/$31.00 ©2023 IEEE | DOI: 10.1109/ISSREW60843.2023.00056

Evaluation
Yuto ,Kaibe Hiroyuki Okamura Tadashi Dohi
Graduate School of Advanced Science Graduate School of Advanced Science Graduate School of Advanced Science
and Engineering, Hiroshima University and Engineering, Hiroshima University and Engineering, Hiroshima University
Higashi Hiroshima, Japan Higashi Hiroshima, Japan Higashi Hiroshima, Japan
[email protected] [email protected]

Abstract—In this paper, we investigate the prediction of soft- information about the quality of software products from the
ware code changes using a natural language processing (NLP) empirical data of the number of source code changes.
model. NLP is one of the most rapidly developing fields in Popular bug count analyses are fault-prone module predic-
recent years, allowing various tasks related to natural language
to be performed using large-scale models. In particular, BERT tion and bug prediction. Fault-prone module prediction is the
(bidirectional encoder representations from transformers) is a problem of determining a software module that is expected
well-known model for encoding the input sentences of natural to contain software bugs, and bug prediction is the problem
language into an appropriate vector space and is used for various of estimating the number of software bugs contained in the
classification tasks. In this paper, we use CuBERT (code under- software. Fault-prone module prediction and bug prediction
standing BERT), which was trained on programming languages
as data in the pre-training stage, to perform tasks related to belong to the discriminant and regression problems in statis-
program code. Specifically, we run a regression problem where tics, respectively, whose input variables are given by software
the output is the number of code changes. metrics. The software metrics are quantitative values that
Index Terms—software code changes, prediction, BERT, Code summarize the characteristics of software, such as lines of code
understanding BERT and complexity. In practice, however, it is difficult to determine
the best software metrics for defect analysis. In addition, there
I. I NTRODUCTION is no good practice on which metrics should be measured to
Agile development has gained a lot of traction in software improve prediction accuracy.
development in recent years and is the development style in Outside of programming languages, the field of Natural
which software is developed with a small cycle consisting Language Processing (NLP) has seen dramatic improvements
of build and test phases. In agile development, refactoring in machine learning techniques. In particular, BERT (bidirec-
is one of the most important activities for developing high- tional encoder representations from transformers) [3] is one of
quality software. Refactoring is a technique for restructuring the most successful NLP models in a variety of tasks such as
and rewriting software code without changing its interfaces. language translation. A key point of BERT is the technique
Therefore, we observe the frequent changes in source code in of embedding, where tokens are numerically embedded in the
agile development. vector space. This concept allows us to skip the procedure
On the other hand, the frequency of source code changes is of extracting and selecting good metrics and is expected to
available as a metric to check the health of software develop- be useful for the analysis of programming languages. In fact,
ment. It is well known that such changes occur frequently Kanade et al. [4] discussed the embedding-based NLP model
in the early phase of software development because user called CuBERT, which is BERT whose pre-training used
requirements and related software architectures are not stable program source codes. CuBERT outperformed the previous
in the early phase. It can also be seen that the frequency models in five classification tasks and one modification task
of source code changes gradually decreases as the software for source codes. This implies that it is effective to use the data
matures. This phenomenon is similar to the growth of software collected from the target domain in pre-training. However, the
reliability, i.e., the number of detected defects decreases as question remains whether this is always effective.
software testing progresses in traditional waterfall software This paper attempts to discuss the applicability of NLP
development. In traditional software reliability engineering, models for the task of predicting source code changes, i.e.,
researchers have discussed the trends of defects detected in BERT and CuBERT can be applied to predict the number
the testing process with statistical models [1], [2], and have of source code changes based on the information about the
evaluated the quality of software products statistically. This source codes directly. Here we consider the following research
means that even in agile development, we obtain statistical questions:

979-8-3503-1956-9/23/$31.00 ©2023 IEEE 108


DOI 10.1109/ISSREW60843.2023.00056
Authorized licensed use limited to: Amrita School of Engineering. Downloaded on February 02,2025 at 16:48:41 UTC from IEEE Xplore. Restrictions apply.
RQ1 How effective is the NLP-based embedding approach III. NLP M ODELS
for the prediction task? A. BERT
RQ2 What is the best approach for retraining NLP mod-
Natural language processing (NLP) has been one of the
els?
most attractive AI models in recent years. In fact, a well-
RQ3 What is the appropriate model for predicting the
known NLP model called BERT uses deep learning techniques
number of source code changes?
to extract important features from raw text data. The archi-
In RQ1 we try to show how effective BERT and CuBERT are tecture of BERT is based on the transformer [11]. Roughly
in predicting the number of source code changes compared speaking, BERT is an extension of recurrent neural networks
to the conventional metrics-based regression approach. In (RNNs) that deal with time series data. The RNN reveals the
addition, by comparing BERT and CuBERT, we discuss the dependence of the time series data based on one direction.
effect of pre-training data on the accuracy of specified tasks. BERT represents the bidirectional dependency of time series
RQ2 is related to the retraining strategy. In general, there are data with an attention mechanism, which is the most important
two phases of training in modern AI models: pre-training and feature of BERT. Since BERT uses a bidirectional structure, it
retraining. The pre-training is to adjust the model parameters essentially outputs a fixed-size vector sequence from a fixed-
to learn the basic knowledge in the target domain with a size word sequence, i.e., a sentence.
large amount of training data. Retraining is to make small The BERT pre-training consists of the following two tasks:
adjustments to the model parameters so that the pre-trained • The Masked Language Model (MLM): First, some of the
AI model fits the specified task with a small amount of data. words in an input sentence are replaced by the special
Generally, there are two retraining strategies: fine-tuning and token [MASK]. BERT is trained to predict which are
transfer learning. In RQ2, we try to answer the question which the original words for the masked ones from the context
is better for fine-tuning and transfer learning to predict the of the sentence. For example, consider the following
number of source code changes. sentence:
In RQ3, we discuss the architecture of regression models. This morning he had breakfast.
Apart from the neural network based models, the generalized The training data is generated by replacing randomly
linear regression model (GLM) is widely used for practical selected words with [MASK], e.g,
problems. In the context of GLM, the GLM models are This morning he [MASK] breakfast.
generally classified based on the link functions and their BERT trains to guess the original words only from
associated loss functions. In the past literature, the regression the words before and after the masked words. In the
task for the number of defects was performed with linear and MLM task, a selected word is replaced by [MASK]
Poisson regression models as specific models of GLM. Even with probability 0.8 and by another randomly selected
in the NLP-based models, we can deal with both linear and word with probability 0.1. Otherwise, it remains with a
Poisson regression-type models. RQ3 attempts to reveal the probability of 0.1.
predictive power of these models. • The Next Sentence Prediction (NSP): Consider two
sentences separated by the special token [SEP]. BERT
predicts whether or not there is a semantic connection
II. R ELATED W ORKS between the two sentences from the context of the two
sentences. For example, consider the following sentences
The statistical behavior of the number of source code [CLS] I [MASK] to a bookstore.
changes is similar to the behavior of the number of defects. [SEP] There I bought three books.
Predicting the number of defects is one of the most important There is a semantic connection between the two sentences
tasks in software reliability engineering. As mentioned in above. On the other hand, it is clear that there is no se-
the introduction, defect prediction is categorized into fault- mantic connection between the following two sentences:
prone module detection and defect number estimation, which [CLS] I [MASK] to a bookstore.
are related to discrimination and regression in the statistical [SEP] People are mammals.
problems, respectively.
BERT learns to predict the semantic connection between
Fault-prone module detection is the process of identifying them as a discriminant problem. The training data is
software modules that contain software defects in a software organized in such a way that 50% are positive examples,
project. Once the defective modules can be identified, we i.e. two sentences have a semantic connection, and the
reduce the testing effort by focusing software testing on the remaining 50% are negative examples, i.e. there is no
defective modules. Shen et al. [5] discussed module detection semantic connection. Also, [CLS] is the special token to
as an early paper. Since the 1990s, several researchers [6]–[8] express the beginning of the sentence.
have proposed methods based on software metrics such as CK
metrics [9]. On the other hand, in predicting the number of B. CuBERT
defects, Khoshgoftaar [10] discussed Poisson regression with CuBERT (code-understanding BERT) [4] is a natural lan-
software metrics. guage processing model based on BERT and oriented towards

109

Authorized licensed use limited to: Amrita School of Engineering. Downloaded on February 02,2025 at 16:48:41 UTC from IEEE Xplore. Restrictions apply.
the processing of programming languages. As mentioned the regression model to minimize the defined loss function.
above, the pre-training tasks of BERT are MLM and NSP Transfer learning adjusts only the parameters of the regression
tasks with a large natural language corpus. On the other hand, model. In both cases, learning (optimization) is performed by
CuBERT is trained on a large source code corpus. Although Adam with 10 and 20 epochs.
the difference between BERT and CuBERT is the training The table I shows the results of the prediction task. In
data, CuBERT outperformed other machine learning models, the table, the columns of NLP and Regression indicate the
including BERT, on five programming language classification models used for NLP and regression, respectively. Also, the
tasks and one program modification task [4]. Retraining and Epoch columns show the retraining strategies
and the number of iterations for updating the parameters with
IV. E XPERIMENT I mini-batch data. The MPSE column shows the mean squared
To provide answers to RQ1 through RQ3, we conduct the prediction error between the prediction of the number of
experience of predicting the number of changes for software source code changes per token and the validation data. Table II
projects. We collect Java programs from GitHub as retraining shows the results of the prediction task for the metrics-based
data. In the style of Java language, one file contains one Java model. In this experiment, we collect the number of tokens, the
class and each class has several methods. The collected data number of if-statements, and the number of loops as metrics.
includes 74 classes and a total of 200 methods. Since the input The regression model, i.e. the last part of the model, is used
of BERT and CuBERT is one method, we divide 200 methods to obtain the number of source code changes per token from
into 150 methods as training data and 50 methods as validation these metrics.
data. The number of source code changes for each method is
also collected from GitHub. RQ1: How effective is the NLP-based embedding approach
for the prediction task?
Comparing tables I and II, we see that the best (smallest)
MPSE of the NLP-based approach is 0.044, while the best
MPSE of the metrics-based model is 0.921. Also, in many
cases, there is a tendency for the MPSEs of the NLP-based
approach to be smaller than those of the metrics-based model.
In particular, CuBERT outperforms BERT in terms of MPSE.
The reason why CuBERT is better than the others may be that
CuBERT was trained from source codes in the pre-training
phase. Since the architectures of BERT and CuBERT are the
same, this implies that it is important that the pre-training data
should be collected from the domain related to the subsequent
Fig. 1. The architecture of model. tasks.

To accomplish the prediction task, we use the network RQ2: What is the best retraining approach in NLP
architecture shown in Figure 1. The BERT and CuBERT work models?
for encoding the source codes into the vector sequences. The In this experiment, we applied two retraining strategies; fine-
following part corresponds to the regression model to predict tuning and transfer learning. In the case of BERT, the min-
the number of source code changes from the vector sequences imum MPSEs of fine-tuning and transfer learning are 2.063
as inputs. Furthermore, since the number of source code and 0.749, respectively. Similarly, in the case of CuBERT, the
changes may depend on the lines of code of the method itself, MPSE of fine-tuning is 0.044 and that of transfer learning
the prediction task is to estimate the number of source code is 0.907. That is, transfer learning is superior to fine-tuning
changes per token, i.e., the number of source code changes is in the case of BERT, we get the opposite result in the case
divided by the number of tokens of the method. of CuBERT. This may be due to the fact that BERT is not
We also consider two regression models as the last part suitable for understanding source codes, since it was trained
of the network. Basically, the regression model consists of with the natural language corpus. As mentioned before, the
two linear and dense layers. The output of the last layer is fine tuning adjusts all model parameters, including BERT.
a scalar value. In the case of linear regression, the output Since BERT is not fully trained in the pre-training stage, the
represents the number of source code changes per token as parameters of BERT are also significantly changed in the fine-
it is. The loss function of linear regression is a squared error tuning stage. However, the data size of fine tuning is not
between the output and the data. On the other hand, in the sufficient to tune BERT. On the other hand, CuBERT can
case of Poisson regression, the number of source code changes be fully trained with the corpus using source codes in the
is given by the logarithm of the output. The loss function pre-training stage. Then, the parameters of CuBERT are not
of Poisson regression is the negative log-likelihood function changed in the fine-tuning stage, and the fine-tuning focuses
of the Poisson distribution. Based on this architecture, fine- on updating the model parameter of the regression model so
tuning adjusts all layers consisting of BERT or CuBERT and that it fits a given specific task.

110

Authorized licensed use limited to: Amrita School of Engineering. Downloaded on February 02,2025 at 16:48:41 UTC from IEEE Xplore. Restrictions apply.
TABLE I a result, the NLP model with CuBERT is effective under the
R ESULTS OF P REDICTION TASK WITH NLP M ODELS . fine tuning and Poisson regression model.
In this paper, we have used CuBERT as an NLP model
NLP Retraining Regression Epoch MPSE for understanding source code. On the other hand, there have
10 epochs 2.063 been several NLP models dealing with source code in recent
Linear
20 epochs 2.175
Fine-Tuning
10 epochs 2.428 research [12]. Therefore, we will conduct an experiment using
Poisson such recent NLP models. Furthermore, in this paper, we use
20 epochs 2.247
BERT
10 epochs 0.918 the number of source code changes as static values, but they
Linear
20 epochs 0.749
Transfer Learning
10 epochs 1.643 change with the elapsed time. In other words, we should
Poisson consider that the model deals with the time series data.
20 epochs 0.873
10 epochs 2.833
Linear
Fine-Tuning
20 epochs 0.617 R EFERENCES
10 epochs 0.611
Poisson [1] M. Xie, Software Reliability Modelling. Singapore: World Scientific,
20 epochs 0.044
CuBERT 1991.
10 epochs 0.907
Linear [2] M. R. Lyu, Ed., Handbook of Software Reliability Engineering. New
20 epochs 0.888
Transfer Learning York: McGraw-Hill, 1996.
10 epochs 0.866
Poisson [3] J. Devlin, M.-W. Chang, K. Lee, and K. Toutanova, “BERT: Pre-
20 epochs 0.821
training of deep bidirectional transformers for language understanding,”
in Proceedings of the 2019 Conference of the North American Chapter
of the Association for Computational Linguistics: Human Language
TABLE II Technologies, Volume 1 (Long and Short Papers). Minneapolis,
R ESULTS OF P REDICTION TASK WITH M ETRICS -BASED M ODEL . Minnesota: Association for Computational Linguistics, Jun. 2019, pp.
4171–4186.
[4] A. Kanade, P. Maniatis, G. Balakrishnan, and K. Shi, “Learning and
Regression Epoch MPSE evaluating contextual embedding of source code,” in Proceedings of
10 epochs 0.958 the 37th International Conference on Machine Learning, ser. ICML’20.
Linear JMLR.org, 2020.
20 epochs 1.159
10 epochs 0.921 [5] V. Shen, T. jie Yu, S. Thebaut, and L. Paulsen, “Identifying error-
Poisson prone software—an empirical study,” IEEE Transactions on Software
20 epochs 1.124
Engineering, vol. SE-11, no. 4, pp. 317–324, 1985.
[6] V. Basili, L. Briand, and W. Melo, “A validation of object-oriented
design metrics as quality indicators,” IEEE Transactions on Software
Engineering, vol. 22, no. 10, pp. 751–761, 1996.
RQ3: What is the appropriate model to predict the number [7] L. C. Briand, J. Wüst, J. W. Daly, and D. Victor Porter, “Exploring the
of source code changes? relationships between design measures and software quality in object-
Comparing the MPSEs of the linear and Poisson regression oriented systems,” Journal of Systems and Software, vol. 51, no. 3, pp.
245–273, 2000.
models in tables I and II, the linear regression model is better [8] Y. Zhou and H. Leung, “Empirical analysis of object-oriented design
than the others in the case of BERT. Also in the case of metrics for predicting high and low severity faults,” IEEE Transactions
CuBERT, the Poisson regression model outperforms the linear on Software Engineering, vol. 32, no. 10, pp. 771–789, 2006.
[9] S. Chidamber and C. Kemerer, “A metrics suite for object oriented
model. This is also due to the fact that BERT is not suitable design,” IEEE Transactions on Software Engineering, vol. 20, no. 6,
for dealing with source code. pp. 476–493, 1994.
We summarize the answers to the research questions. In [10] T. Khoshgoftaar, K. Gao, and R. Szabo, “An application of zero-inflated
poisson regression for software fault prediction,” in Proceedings 12th
the experiment, the best approach is the NLP-based approach International Symposium on Software Reliability Engineering, 2001, pp.
with CuBERT, fine-tuning, and Poisson regression model. In 66–73.
this case, MPSE takes the lowest value of 0.044. [11] A. Vaswani, S. Bengio, E. Brevdo, F. Chollet, A. Gomez, S. Gouws,
L. Jones, Ł. Kaiser, N. Kalchbrenner, N. Parmar, R. Sepassi, N. Shazeer,
V. C ONCLUSION and J. Uszkoreit, “Tensor2Tensor for neural machine translation,” in
Proceedings of the 13th Conference of the Association for Machine
In this paper, we have discussed the NLP-based approach to Translation in the Americas (Volume 1: Research Track). Boston, MA:
predict the number of source code changes. In particular, we Association for Machine Translation in the Americas, Mar. 2018, pp.
193–199.
have dealt with two NLP models; BERT and CuBERT, and [12] D. Guo, S. Ren, S. Lu, Z. Feng, D. Tang, S. Liu, L. Zhou, N. Duan,
have compared them in the experiment. As a result, CuBERT A. Svyatkovskiy, S. Fu, M. Tufano, S. K. Deng, C. B. Clement,
outperformed BERT because CuBERT was trained with the D. Drain, N. Sundaresan, J. Yin, D. Jiang, and M. Zhou, “Graphcodebert:
Pre-training code representations with data flow,” in 9th International
corpus of source codes as pre-training. This implies that the Conference on Learning Representations, ICLR 2021, Virtual Event,
importance of the pre-training and the training data of the Austria, May 3-7, 2021. OpenReview.net, 2021.
pre-training should be appropriate for the subsequent specific
tasks. We have also compared the retraining strategies; fine-
tuning and transfer learning. In the context of NLP models,
fine-tuning is effective to adjust the model parameters when
the NLP model is fully trained in the pre-training stage. On
the other hand, transfer learning works when the NLP model
is not fully trained. The same tendency can be found in the
difference between linear and Poisson regression models. As

111

Authorized licensed use limited to: Amrita School of Engineering. Downloaded on February 02,2025 at 16:48:41 UTC from IEEE Xplore. Restrictions apply.

You might also like