A training sample selection method for predicting software defects

摘要

Software Defect Prediction (SDP) is an important method to analyze software quality and reduce development cost. Data from software life cycle has been widely used to predict the defect prone of software modules, and although many machine learning-based SDP models have been proposed, their predictive performance is not always satisfactory. Traditional machine learning-based classifiers usually assume that all samples have the same contribution to the training of SDP, which is not true. In fact, different training samples have different effects on the performance of the SDP model, the performance of machine learning-based SDP models is heavily dependent on the quality of training samples. For the above shortcoming of traditional machine learning-based classifiers, the contributions of this paper are as follows: (1) Inspired by the clustering algorithm, a method to calculate the contribution of each training sample to the SDP model is proposed, which not only considers the relationship between the contributions of the training samples to the SDP model, and also analyzes the influence of the distance between the sample and the category boundary on the performance of the SDP model, so it is different from the existing calculation method of sample contribution. (2) A Sample Selection (SS) method is proposed to improve the performance of the SDP model. It first calculates the contribution of each training sample based on several nearest neighbors of the sample and the label information of these neighbors, and then implements SS according to Hoeffding probability inequality and the contribution of each sample. To confirm the validity of the proposed SDP model, some experimental results are given. Both direct observations and statistical tests of the experimental results show that the SS method is very effective for improving the predictive performance of the SDP model.

关键词

Software defect prediction Sample contribution Sample selection Predictive performance