ST-TLF: Cross-version defect prediction framework based transfer learning

Zhao, Y., Wang, Y., Zhang, Y., Zhang, D., Gong, Y., & Jin, D. (2022). ST-TLF: Cross-version defect prediction framework based transfer learning. Information and Software Technology, 106939.（SCI 2区）

Context:

Cross-version defect prediction (CVDP) is a practical scenario in which defect prediction models are derived from defect data of historical versions to predict potential defects in the current version. Prior research employed defect data of the latest historical version as the training set using the empirical recommended method, ignoring the concept drift between versions, which undermines the accuracy of CVDP.

Objective:

We customized a Selected Training set and Transfer Learning Framework (ST-TLF) with two objectives: a) to obtain the best training set for the version at hand, proposing an approach to select the training set from the historical data; b) to eliminate the concept drift, designing a transfer strategy for CVDP.

Method:

To evaluate the performance of ST-TLF, we investigated three research problems, covering the generalization of ST-TLF for multiple classifiers, the accuracy of our training set matching methods, and the performance of ST-TLF in CVDP compared against state-of-the-art approaches.

Results:

The results reflect that (a) the eight classifiers we examined are all boosted under our ST-TLF, where SVM improves 49.74% considering MCC, as is similar to others; (b) when performing the best training set matching, the accuracy of the method proposed by us is 82.4%, while the experience recommended method is only 41.2%; (c) comparing the 12 control methods, our ST-TLF (with BayesNet), against the best contrast method P15-NB, improves the average MCC by 18.84%.

Conclusions:

Our framework ST-TLF with various classifiers can work well in CVDP. The training set selection method we proposed can effectively match the best training set for the current version, breaking through the limitation of relying on experience recommendation, which has been ignored in other studies. Also, ST-TLF can efficiently elevate the CVDP performance compared with random forest and 12 control methods.