计算机科学 ›› 2019, Vol. 46 ›› Issue (11): 176-180.doi: 10.11896/jsjkx.180901685
胡梦园1, 黄鸿云2, 丁佐华3
HU Meng-yuan1, HUANG Hong-yun2, DING Zuo-hua3
摘要: 软件缺陷预测的目的是有效地识别出有缺陷的模块。对于类别平衡数据,传统的分类器具有较好的预测效果,但当数据类别比例分布不均衡时,传统的分类器往往偏向于多数类,易使得少数类模块被误分。但是,真实的软件缺陷预测中的数据往往是类别不平衡的。为了处理软件缺陷中的这种类别不平衡问题,文中提出了基于改进的类权自适应、软投票与阈值移动的集成模型,该模型在不改变原始数据集的情况下,从训练阶段和决策阶段同时考虑处理类别不平衡的问题。首先,在类权值学习阶段,通过类权自适应学习得到不同类的最优权值;然后,在训练阶段,使用前一步得到的最优权值训练3个基分类器,并通过软集成的方法组合3个基分类器;最后,在决策阶段,根据阈值移动模型来做出决策,以得到最终预测类别。为了证明所提方法的有效性,实验采用NASA软件缺陷标准数据集和Eclipse软件缺陷标准数据集进行预测,并在相同的数据集上将其与近年提出的几种软件缺陷预测方法在召回率值Pd、假正例率值Pf和F1度量值F-measure方面进行了对比。实验结果表明,所提方法的召回率Pd平均提高了0.09,在F1度量值F-measure上平均提高了0.06。因此,文中提出的处理软件缺陷预测中类别不平衡问题的方法的整体性能优于其他软件缺陷预测方法,具有较好的预测效果。
中图分类号:
[1]BISHNU P S,BHATTACHERJEE V.Software fault prediction using quad tree-based k-means clustering algorithm[J].IEEE Transactions on Knowledge and Data Engineering,2012,24(6):1146-1150. [2]HALL T,BEECHAM S,BOWES D,et al.A Systematic Literature Review on Fault Prediction Performance in Software Engineering[J].IEEE Transactions on Software Engineering,2012,38(6):1276-1304. [3]WANG J,SHEN B,CHEN Y.Compressed C4.5 Models forSoftware Defect Prediction [C]∥International Conference on Quality Software.Xi An China.IEEE,2012:13-16. [4]XING F,GUO P.Support vector regression for software reliability growth modeling and prediction[C]∥International Conference on Advances in Neural Networks.Chongqing China.Springer-Verlag,2005:925-930. [5]ZHENG J.Cost-sensitive boosting neural networks for software defect prediction[J].Expert Systems with Applications,2010,37(6):4537-4543. [6]GAO K,KHOSHGOFTAAR T M,NAPOLITANO A.A Hybrid Approach to Coping with High Dimensionality and Class Imbalance for Software Defect Prediction[C]∥International Conference on Machine Learning and Applications.Atlanta,GA,USA,IEEE,2013:281-288. [7]WANG S,YAO X.Using Class Imbalance Learning for Software Defect Prediction[J].IEEE Transactions on Reliability,2013,62(2):434-443. [8]YU Q,JIANG S J,ZHANG Y M,et al.The Impact Study of Class Imbalance on the Performance of Software Defect Prediction Models[J].Chinese Journal of Computer,2018,41(4):809-822.(in Chinese) 于巧,姜淑娟,张艳梅,等.分类不平衡对软件缺陷预测模型性能的影响研究[J].计算机学报,2018,41(4):809-822. [9]MARUF ÖZTURK M,ZENGIN A.HSDD:A hybrid sampling strategy for class imbalance in defect prediction data sets[C]∥Eleventh International Conference on Digital Information Ma-nagement.Fukuoka,Japan.IEEE,2017:60-69. [10]ZHOU Z H,LIU X Y.Training Cost-Sensitive Neural Networks with Methods Addressing the Class Imbalance Problem[J].IEEE Transactions on Knowledge & Data Engineering,2006,18(1):63-77. [11]WANG S,CHEN H,YAO X.Negative correlation learning for classification ensembles[C]∥International Joint Conference on Neural Networks.San Jose,California:IEEE,2011:1-8. [12]MIAO L,LIU M,ZHANG D.Cost-sensitive feature selectionwith application in software defect prediction[C]∥2012 21st International Conference on Pattern Recognition (ICPR).Portland,Oregon:IEEE,2012:967-970. [13]GALA R,FERNANDE Z,BARRENECHE A,et al.A Review on Ensembles for the Class Imbalance Problem:Bagging-,Boosting-,and Hybrid-Based Approaches[J].IEEE Transactions on Systems Man & Cybernetics Part C Applications & Reviews,2012,42(4):463-484. [14]ELISH K O,ELISH M O.Predicting defect-prone softwaremodules using support vector machines[J].Journal of Systems &Software,2008,81(5):649-660. [15]JIANG Y,LI M,ZHOU Z H.Software Defect Detection with Rocus[J].Journal of Computer Science & Technology,2011,26(2):328-342. [16]ZHANG Z W,JING X Y,WANG T J.Label propagation based semi-supervised learning for software defect prediction[J].Automated Software Engineering,2016,24(1):1-23. [17]JING X Y,YING S,ZHANG Z W,et al.Dictionary learningbased software defect prediction[C]∥Proceedings of the 36th International Conference on Software Engineering.ACM,2014:414-423. [18]LU Q,JU C.Research on Credit Card Fraud Detection Model Based on Class Weighted Support Vector Machine[J].Journal of Convergence Information Technology,2011,6(1):62-68. [19]MÖHLE S,BRÜNDL M,BEIERLE C.Modeling a System for Decision Support in Snow Avalanche Warning Using Balanced Random Forest and Weighted Random Forest[C]∥Internatio-nal Conference on Artificial Intelligence:Methodology,Systems,and Applications.Varna,Bulgaria,Springer/LNAI,2014:80-91. [20]ZHANG Y,ZHANG H,CAI J,et al.A Weighted Voting Classifier Based on Differential Evolution[J].Abstract and Applied Analysis,2014,2014(2):1-6. [21]ZHOU Z H.Ensemble Methods:Foundations and Algorithms[M].London:Taylor & Francis,2012. |
[1] | 林夕, 陈孜卓, 王中卿. 基于不平衡数据与集成学习的属性级情感分类 Aspect-level Sentiment Classification Based on Imbalanced Data and Ensemble Learning 计算机科学, 2022, 49(6A): 144-149. https://doi.org/10.11896/jsjkx.210500205 |
[2] | 康雁, 吴志伟, 寇勇奇, 张兰, 谢思宇, 李浩. 融合Bert和图卷积的深度集成学习软件需求分类 Deep Integrated Learning Software Requirement Classification Fusing Bert and Graph Convolution 计算机科学, 2022, 49(6A): 150-158. https://doi.org/10.11896/jsjkx.210500065 |
[3] | 王宇飞, 陈文. 基于DECORATE集成学习与置信度评估的Tri-training算法 Tri-training Algorithm Based on DECORATE Ensemble Learning and Credibility Assessment 计算机科学, 2022, 49(6): 127-133. https://doi.org/10.11896/jsjkx.211100043 |
[4] | 韩红旗, 冉亚鑫, 张运良, 桂婕, 高雄, 易梦琳. 基于共同子空间分类学习的跨媒体检索研究 Study on Cross-media Information Retrieval Based on Common Subspace Classification Learning 计算机科学, 2022, 49(5): 33-42. https://doi.org/10.11896/jsjkx.210200157 |
[5] | 任首朋, 李劲, 王静茹, 岳昆. 基于集成回归决策树的lncRNA-疾病关联预测方法 Ensemble Regression Decision Trees-based lncRNA-disease Association Prediction 计算机科学, 2022, 49(2): 265-271. https://doi.org/10.11896/jsjkx.201100132 |
[6] | 陈伟, 李杭, 李维华. 核小体定位预测的集成学习方法 Ensemble Learning Method for Nucleosome Localization Prediction 计算机科学, 2022, 49(2): 285-291. https://doi.org/10.11896/jsjkx.201100195 |
[7] | 刘振宇, 宋晓莹. 一种可用于分类型属性数据的多变量回归森林 Multivariate Regression Forest for Categorical Attribute Data 计算机科学, 2022, 49(1): 108-114. https://doi.org/10.11896/jsjkx.201200189 |
[8] | 周新民, 胡宜桂, 刘文洁, 孙荣俊. 基于多模态多层级数据融合方法的城市功能识别研究 Research on Urban Function Recognition Based on Multi-modal and Multi-level Data Fusion Method 计算机科学, 2021, 48(9): 50-58. https://doi.org/10.11896/jsjkx.210500220 |
[9] | 周钢, 郭福亮. 基于特征选择的高维数据集成学习方法研究 Research on Ensemble Learning Method Based on Feature Selection for High-dimensional Data 计算机科学, 2021, 48(6A): 250-254. https://doi.org/10.11896/jsjkx.200700102 |
[10] | 戴宗明, 胡凯, 谢捷, 郭亚. 基于直觉模糊集的集成学习算法 Ensemble Learning Algorithm Based on Intuitionistic Fuzzy Sets 计算机科学, 2021, 48(6A): 270-274. https://doi.org/10.11896/jsjkx.200700036 |
[11] | 郑小萌, 高猛, 滕俊元. 航天器软件缺陷预测数据集构建方法研究 Research on Construction Method of Defect Prediction Dataset for Spacecraft Software 计算机科学, 2021, 48(6A): 575-580. https://doi.org/10.11896/jsjkx.200900133 |
[12] | 滕俊元, 高猛, 郑小萌, 江云松. 噪声可容忍的软件缺陷预测特征选择方法 Noise Tolerable Feature Selection Method for Software Defect Prediction 计算机科学, 2021, 48(12): 131-139. https://doi.org/10.11896/jsjkx.201000168 |
[13] | 郇文明, 林海涛. 基于采样集成算法的入侵检测系统设计 Design of Intrusion Detection System Based on Sampling Ensemble Algorithm 计算机科学, 2021, 48(11A): 705-712. https://doi.org/10.11896/jsjkx.201100101 |
[14] | 刘振鹏, 苏楠, 秦益文, 卢家欢, 李小菲. FS-CRF:基于特征切分与级联随机森林的异常点检测模型 FS-CRF:Outlier Detection Model Based on Feature Segmentation and Cascaded Random Forest 计算机科学, 2020, 47(8): 185-188. https://doi.org/10.11896/jsjkx.190600162 |
[15] | 钟熙, 孙祥娥. 基于Kmeans++聚类的朴素贝叶斯集成方法研究 Research on Naive Bayes Ensemble Method Based on Kmeans++ Clustering 计算机科学, 2019, 46(6A): 439-441. |
|