摘要

With available tools and datasets existing on GitHub ecosystem, researchers have the opportunities to study diverse software engineering problems on a large-scale dataset. However, there are many potential threats when researchers try to directly use large-scale datasets, and one important threat is that GitHub contains many private projects (e.g., homework) and non-development projects (e.g., blog). For researchers who want to study cooperative behavior of developers or development process of projects, their research samples should not contain private projects and non-development projects. To solve this problem, we first analyzed the weaknesses of the base line methods (i.e., selecting top projects) and extended ML-based methods (i.e., training models on a labeled training dataset using ML algorithms, Extended_MLMs for short), and proposed two methods called Enhanced_RFM and Fusion_DL_RFM to address the weaknesses of Extended_RFM (the Extended_MLM that is based on Random Forest and has the best performance among all the Extended_MLMs). The results show that: (1) existing project sample selection methods have a low F-measure and poor generality (i.e., have a bad performance on the testing dataset); (2) Enhanced_RFM outperforms Fusion_DL_RFM on accuracy and stability; and (3) by adopting Enhanced_RFM, the F-measure of Extended_RFM is improved from 0.690 to 0.810 and the precision of Extended_RFM is improved from 0.559 to 0.785 under cross validation, which indicates that the generality of Extended_RFM is significantly improved.

  • 单位
    武汉大学

全文