Early a classification algorithm based on the labeled

Early Studies on bug prediction focused on software complexity measured in lines of code We aim to predict whether a particular file associated with a change is buggy or not. Traditionally All techniques follow the following steps:

 

1)    Training Data Extraction : For each change, label it as buggy or clean by mining a project’s revision history and issue tracking system. Buggy change means the change contains bugs (one or more), while clean change means the change has no bug.

We Will Write a Custom Essay Specifically
For You For Only $13.90/page!


order now

2)    Feature Extraction : Extract the values of various features from each change. Many different features have been used in past change classification studies.

3)    Model Learning : Build a model by using a classification algorithm based on the labeled changes and their corresponding features.

4)    Model Application : For a new change, extract the values of various features. Input these values to the learned model to predict whether the change is buggy or clean.

 

 

14 basic change measures

 

Name

Description

NS

The number of modified subsystems

ND

The number of modified directories

NF

The number of modified files

Entropy

Distribution of modified code across each file

LA

Lines of code added

LD

Lines of code deleted

LT

Lines of code in a file before the change

FIX

Whether or not the change is a defect fix

NDEV

The number of developers that changed the modified files

AGE

The average time interval between the last and the current change

NUC

The number of unique changes to the modified files

EXP

Developer experience

REXP

Recent developer experience

SEXP

Developer experience on a subsystem

 

 

 

 

The framework mainly contains two phases: a model building phase and a prediction phase. In model building we aim to build a classifier using deep learning techniques from historical changes with known label. In prediction phase we utilize the framework to predict whether a incoming change is buggy or not.

 

Our Framework extracts 14 basic features from a training set of changes that could determine whether a code is buggy or not. We preform data preprocessing to normalize and resample the data, as class of clean changes are more than buggy samples we perform random undersampling on the data. Then a deep learning network aka Deep Belief network(DBN) is used to generate and integrate more advanced features, advanced features are linear combination of initial features. Based on the advanced feautres we construct a logistic regression model to predict whether change is buggy or not.

 

Data preprocessing:

 

As the magnitudes of 14 features are not of same order, we perform data normalization on these features. In this approach, the data is scaled to a fixed range – usually 0 to 1.The cost of having this bounded range – in contrast to standardization – is that we will end up with smaller standard deviations, which can suppress the effect of outliers.

A Min-Max scaling is typically done via the following equation:

X_{norm} = frac{X – X_{min}}{X_{max}-X_{min}}

 

Random undersampling:

 

Random undersampling method randomly chooses observations from majority class which are eliminated until the data set gets balanced. This step is essential and important for defect prediction because it helps the learned classifier not to be biased to the majority class and thus it can improve the performance of the classifier.

 

RBN:

 

Boltzmann Machine is a stochastic recurrent neural network with stochastic binary units and undirected edges between units. Unfortunately, learning for Boltzmann machines is impractical and has a scalability issue. As a result, Restricted Boltzmann Machine (RBM) has been introduced 10,which has one layer of hidden units and restricts connections between hidden units. This allows for more efficient learning algorithm 6.The structure of RBM is depicted in the following figure:

 

 

 

 

 

 

 

Figure 3: The structure of RBM

 

 

Given these configurations, probability distributions over hidden and/or visible units are defined in terms of the energy function:

 

                                         

                                      

Then, the maximum likelihood learning algorithm can train the network by simply alternating between updating all the hidden units in parallel and all the visible units in parallel. To fasten the learning for a RBM, contrastive divergence algorithm is used and the general idea is to update all the hidden units in parallel starting with visible units, reconstruct visible units from the hidden units, and finally update the hidden units again.

 

Deep Belief Networks:

 

As Deep Belief Networks (DBN) name indicates, it is multi-layer belief networks. Each layer is Restricted Boltzmann Machine and they are stacked each other to construct DBN. The first step of training DBN is to 8 learn a layer of features from the visible units, using Contrastive Divergence (CD) algorithm. Then, the next step is to treat the activations of previously trained features as visible unites and learn features of features in a second hidden layer. Finally, the whole DBN is trained when the learning for the final hidden layer is achieved. Figure 5: Greedy learning for DBN This simple greedy learning algorithm works for training DBN. This is because that training RBM using CD algorithm for each layer looks for the local optimum and the next stacked RBM layer takes those optimally trained values and again look for the local optimum. At the end of this procedure, it is likely to get the global optimum as each layer consistently trained to get the optimum value.

 

Classifier:

 

Logistic regression 18 models the relationship between features and labels as a parametric distribution P(y|x), where y refers to the label of a data point (in our case: a change), and x refers to the data point represented as a set of features. The parameters of this distribution is directly estimated from the training data.

 

Experiments and Results:

 

We evaluate Deeper on six datasets from six wellknown open source projects, which are Bugzilla, Columba, Eclipse JDT, Eclipse Platform, Mozilla and PostgreSQL. These datasets are also used by Kamei et al. We used 10 times ten-fold cross validation  to evaluate the performance of Deeper. We randomly divide the dataset into 10 folds, in which 9 folds are used as training dataset, and the remaining one fold is used as testing dataset. To further reduce the bias due to training set selection, we run ten-fold cross validation 10 times and record the average performance. Cross validation is a standard evaluation setting, which is widely used in software engineering studies.

 

Evaluation Metrics:

 

We use two evaluation metrics to evaluate the performance of our approach. The first one is cost effectiveness and the other is F1-score. 1) Cost Effectiveness: Cost effectiveness is often used to evaluate defect prediction approaches 13, 15, 14. Cost effectiveness is measured by computing the percentage of buggy changes found when reviewing a specific percentage of the lines of code. To compute cost-effectiveness, given a number of changes, we firstly sort them according to their likelihood to be buggy. We then simulate to review the changes one-by-one from the highest ranked change to the lowest and record buggy changes found. Using this process we can obtain the percentage of buggy changes found when reviewing different percentages of lines of code (1% to 100%).

 

TABLE III.

CONFUSION MATRIX

 

 

 

 

Predicted Buggy

Predicted Clean

True Buggy

 

TP

FN

True Clean

 

FP

TN

 

 

 

 

 

 

 

 

 

In statistical analysis of binary classification, the F1 score (also F-score or F-measure) is a measure of a test’s accuracy. It considers both the precision p and the recall r of the test to compute the score: p is the number of correct positive results divided by the number of all positive results, and r is the number of correct positive results divided by the number of positive results that should have been returned. The F1 score can be interpreted as a weighted average of the precision and recall, where an F1 score reaches its best value at 1 and worst at 0. The traditional F-measure or balanced F-score (F1 score) is the harmonic mean of precision and recall: F1 = 2 · precision · recall precision + recall