CASMS: Combining clustering with attention semantic model for identifying security bug reports

作者:Ma, Xiaoxue; Keung, Jacky; Yang, Zhen; Yu, Xiao*; Li, Yishu; Zhang, Hao
来源:INFORMATION AND SOFTWARE TECHNOLOGY, 2022, 147: 106906.
DOI:10.1016/j.infsof.2022.106906

摘要

Context: Inappropriate public disclosure of security bug reports (SBRs) is likely to attract malicious attackers to invade software systems; hence being able to detect SBRs has become increasingly important for software maintenance. Due to the class imbalance problem that the number of non-security bug reports (NSBRs) exceeds the number of SBRs, insufficient training information, and weak performance robustness, the existing techniques for identifying SBRs are still less than desirable. @@@ Objective: This prompted us to overcome the challenges of the most advanced SBR detection methods. @@@ Method: In this work, we propose the CASMS approach to efficiently alleviate the imbalance problem and predict bug reports. CASMS first converts bug reports into weighted word embeddings based on tf - idf and word2vec techniques. Unlike the previous studies selecting the NSBRs that are the most dissimilar to SBRs, CASMS then automatically finds a certain number of diverse NSBRs via the Elbow method and kappa-means clustering algorithm. Finally, the selected NSBRs and all SBRs train an effective Attention CNN-BLSTM model to extract contextual and sequential information. @@@ Results: The experimental results have shown that CASMS is superior to the three baselines (i.e., FARSEC, SMOTUNED, and LTRWES) in assessing the overall performance (g-measure) and correctly identifying SBRs (recall), with improvements of 4.09%-24.26% and 10.33%-36.24%, respectively. The best results are easily obtained under the limited ratio ranges of the two-class training set (1:1 to 3:1), with around 20 experiments for each project. By evaluating the robustness of CASMS via the standard deviation indicator, CASMS is more stable than LTRWES. @@@ Conclusion: Overall, CASMS can alleviate the data imbalance problem and extract more semantic information to improve performance and robustness. Therefore, CASMS is recommended as a practical approach for identifying SBRs.

  • 单位
    武汉理工大学