skip to main content
research-article

Empirical evaluation of the performance of data sampling and feature selection techniques for software fault prediction

Published: 01 August 2023 Publication History

Abstract

Context:

The application of Software Fault Prediction (SFP) in the software development life cycle to predict the faulty class at the early stage has piqued the interest of various scholars. In the SFP domain, during research analysis, it got realized that there has been very little work instigated on addressing both class imbalance and feature redundancy problems jointly to enhance the performance and prediction accuracy of SFP models. It has been perceived in the literature survey the study of droughts with the comprehensive comparative analysis of different sampling and feature selection strategies together.

Objective:

This research builds an extensive assessment of distinct combinations of different feature selection and sampling approaches, to effectively overcome the problems of class overlap, class imbalance, and feature redundancy. The objective is to determine the best combination that will produce results with a higher degree of accuracy and an effective SFP model.

Method:

Considering the above erudition, the study has applied 8 different sampling techniques along with 10 feature selection algorithms against 56 open-source projects. The comparative analysis is performed against 5346 variants of input datasets by applying 8 different classifiers to predict the faulty class. In addition, the research paper presents an intensive assessment and performance of these techniques individually against all the input projects. We have considered accuracy and Area Under the ROC (receiver operating characteristic curve) Curve (AUC) performance metrics to compare the performance of different models developed using the classification algorithm.

Result:

For each project in the proposed work, we evaluated a total of 792 combinations that were produced using 10 feature selection methods, 1 all metrics dataset, 8 sampling methods, 1 original, unsampled dataset, and 8 classifiers. The empirical result indicates that, against 21 projects out of 54 projects, Synthetic Minority Over Sampling Technique Edited (SMOTEE) with correlation-based feature selection (FS2) combination outperformed with the highest AUC value which is 38.89 % of projects. Additionally, according to experimental results, the highest AUC values were attained by 24.07 % of projects using the SMOTEE, FS2, and RF combination.

Conclusion:

The results of the statical analysis test reveal that 93.42 % of the combinational pairs of different sampling and feature selection approaches demonstrated a significant variance in the performance of the distinct combinations of sampling and feature selection techniques. The empirical result indicates the performance of the SFP Model is adversely impacted by class imbalance and irrelevance. The outcome indicates for more than 75% of projects, the performance of trained models improved with an AUC value between a range of 0.805 to 0.99 post-application of sampling and feature selection strategies, in comparison without the use of feature selection and sampling techniques.

References

[1]
Arshad A., Riaz S., Jiao L., Murthy A., Semi-supervised deep fuzzy c-mean clustering for software fault prediction, IEEE Access 6 (2018) 25675–25685.
[2]
Aziz S.R., Khan T., Nadeem A., Experimental validation of inheritance metrics’ impact on software fault prediction, IEEE Access 7 (2019) 85262–85275.
[3]
Balogun A.O., Lafenwa-Balogun F.B., Mojeed H.A., Usman-Hamza F.E., Bajeh A.O., Adeyemo V.E., et al., Data sampling-based feature selection framework for software defect prediction, in: The international conference on emerging applications and technologies for industry 4.0, Springer, 2020, pp. 39–52.
[4]
Basili V.R., Briand L.C., Melo W.L., A validation of object-oriented design metrics as quality indicators, IEEE Transactions on Software Engineering 22 (10) (1996) 751–761.
[5]
Bhandari K., Kumar K., Sangal A.L., A study on modeling techniques in software fault prediction, in: 2021 2nd international conference on secure cyber computing and communications (ICSCCC), IEEE, 2021, pp. 6–11.
[6]
Chen J., Liu S., Liu W., Chen X., Gu Q., Chen D., A two-stage data preprocessing approach for software fault prediction, in: 2014 eighth international conference on software security and reliability (SERE), IEEE, 2014, pp. 20–29.
[7]
Chidamber S.R., Kemerer C.F., A metrics suite for object oriented design, IEEE Transactions on Software Engineering 20 (6) (1994) 476–493.
[8]
Crasso M., Mateos C., Zunino A., Misra S., Polvorín P., Assessing cognitive complexity in java-based object-oriented systems: Metrics and tool support, Computing and Informatics 35 (3) (2016) 497–527.
[9]
Cynthia, S. T., & Ripon, S. H. (2019). Predicting and Classifying Software Faults: A Data Mining Approach. In Proceedings of the 2019 7th international conference on computer and communications management (pp. 143–147).
[10]
Dam H.K., Pham T., Ng S.W., Tran T., Grundy J., Ghose A., et al., A deep tree-based model for software defect prediction, 2018, arXiv preprint arXiv:1802.00921.
[11]
Elahi E., Kanwal S., Asif A.N., A new ensemble approach for software fault prediction, in: 2020 17th international bhurban conference on applied sciences and technology (IBCAST), IEEE, 2020, pp. 407–412.
[12]
Gao K., Khoshgoftaar T.M., Wang H., Seliya N., Choosing software metrics for defect prediction: an investigation on feature selection techniques, Software - Practice and Experience 41 (5) (2011) 579–606.
[13]
Gong L., Jiang S., Jiang L., Tackling class imbalance problem in software defect prediction through cluster-based over-sampling with filtering, IEEE Access 7 (2019) 145725–145737.
[14]
Hall M.A., Holmes G., Benchmarking attribute selection techniques for discrete class data mining, IEEE Transactions on Knowledge and Data Engineering 15 (6) (2003) 1437–1447.
[15]
Hall M.A., et al., Correlation-based feature selection for machine learning, 1999.
[16]
Henderson-Sellers B., The mathematical validity of software metrics, ACM SIGSOFT Software Engineering Notes 21 (5) (1996) 89–94.
[17]
Hosmer D.W. Jr., Lemeshow S., Sturdivant R.X., Applied logistic regression, Vol. 398, John Wiley & Sons, 2013.
[18]
Hou S., Li Y., Short-term fault prediction based on support vector machines with parameter optimization by evolution strategy, Expert Systems with Applications 36 (10) (2009) 12383–12391.
[19]
Ji H., Huang S., Wu Y., Hui Z., Zheng C., A new weighted naive Bayes method based on information diffusion for software defect prediction, Software Quality Journal 27 (3) (2019) 923–968.
[20]
Joon A., Tyagi R.K., Kumar K., Noise filtering and imbalance class distribution removal for optimizing software fault prediction using best software metrics suite, in: 2020 5th international conference on communication and electronics systems (ICCES), IEEE, 2020, pp. 1381–1389.
[21]
Jureczko M., Significance of different software metrics in defect prediction, Software Engineering: An International Journal 1 (1) (2011) 86–95.
[22]
Karim S., Warnars H.L.H.S., Gaol F.L., Abdurachman E., Soewito B., et al., Software metrics for fault prediction using machine learning approaches: A literature review with PROMISE repository dataset, in: 2017 IEEE international conference on cybernetics and computational intelligence (CyberneticsCom), IEEE, 2017, pp. 19–23.
[23]
Khuat T.T., Le M.H., Ensemble learning for software fault prediction problem with imbalanced data, International Journal of Electrical and Computer Engineering 9 (4) (2019) 3241.
[24]
Kondo M., Bezemer C.-P., Kamei Y., Hassan A.E., Mizuno O., The impact of feature reduction techniques on defect prediction models, Empirical Software Engineering 24 (4) (2019) 1925–1963.
[25]
Kulamala V.K., Kumar L., Mohapatra D.P., Software fault prediction using LSSVM with different Kernel functions, Arabian Journal for Science and Engineering 46 (9) (2021) 8655–8664.
[26]
Kumar L., Sripada S.K., Sureka A., Rath S.K., Effective fault prediction model developed using least square support vector machine (LSSVM), Journal of Systems and Software 137 (2018) 686–712.
[27]
Laradji I.H., Alshayeb M., Ghouti L., Software defect prediction using ensemble learning on selected features, Information and Software Technology 58 (2015) 388–402.
[28]
Lessmann S., Baesens B., Mues C., Pietsch S., Benchmarking classification models for software defect prediction: A proposed framework and novel findings, IEEE Transactions on Software Engineering 34 (4) (2008) 485–496.
[29]
Liu W., Liu S., Gu Q., Chen J., Chen X., Chen D., Empirical studies of a two-stage data preprocessing approach for software fault prediction, IEEE Transactions on Reliability 65 (1) (2015) 38–53.
[30]
Liu H., Yu L., Toward integrating feature selection algorithms for classification and clustering, IEEE Transactions on Knowledge and Data Engineering 17 (4) (2005) 491–502.
[31]
Malhotra R., Jain A., Fault prediction using statistical and machine learning methods for improving software quality, Journal of Information Processing Systems 8 (2) (2012) 241–262.
[32]
Malhotra R., Kamal S., An empirical study to investigate oversampling methods for improving software defect prediction using imbalanced data, Neurocomputing 343 (2019) 120–140.
[33]
Mangla M., Sharma N., Mohanty S.N., A sequential ensemble model for software fault prediction, Innovations in Systems and Software Engineering (2021) 1–8.
[34]
McCabe T.J., A complexity measure, IEEE Transactions on Software Engineering (4) (1976) 308–320.
[35]
Menzies T., Milton Z., Turhan B., Cukic B., Jiang Y., Bener A., Defect prediction from static code features: current results, limitations, new approaches, Automated Software Engineering 17 (4) (2010) 375–407.
[36]
Misra S., Adewumi A., Fernandez-Sanz L., Damasevicius R., A suite of object oriented cognitive complexity metrics, IEEE Access 6 (2018) 8782–8796.
[37]
Mohapatra Y., Ray M., Software fault prediction based on GSOGA optimization with kernel based SVM classification, International Journal of Intelligent Systems 5 (11) (2018).
[38]
Nevendra M., Singh P., Software defect prediction by strong machine learning classifier, in: Intelligent computing and communication systems, Springer, 2021, pp. 321–329.
[39]
Pak C., Wang T.T., Su X.H., An empirical study on software defect prediction using over-sampling by SMOTE, International Journal of Software Engineering and Knowledge Engineering 28 (06) (2018) 811–830.
[40]
Pandey S.K., Mishra R.B., Tripathi A.K., Machine learning based methods for software fault prediction: A survey, Expert Systems with Applications 172 (2021).
[41]
Pandey S.K., Tripathi A.K., An empirical study toward dealing with noise and class imbalance issues in software defect prediction, Soft Computing 25 (21) (2021) 13465–13492.
[42]
Putri S.A., et al., Combining integreted sampling technique with feature selection for software defect prediction, in: 2017 5th international conference on cyber and it service management (CITSM), IEEE, 2017, pp. 1–6.
[43]
Ranjan P., Kumar S., Kumar U., Software fault prediction using computational intelligence techniques: A survey, Indian Journal of Science and Technology 10 (18) (2017) 1–9.
[44]
Rathore, S. S., & Gupta, A. (2014). A comparative study of feature-ranking and feature-subset selection techniques for improved fault prediction. In Proceedings of the 7th India software engineering conference (pp. 1–10).
[45]
Rhmann W., Pandey B., Ansari G., Pandey D.K., Software fault prediction based on change metrics using hybrid algorithms: An empirical study, Journal of King Saud University-Computer and Information Sciences 32 (4) (2020) 419–424.
[46]
Rodríguez D., Ruiz R., Cuadrado-Gallego J., Aguilar-Ruiz J., Detecting fault modules applying feature selection to classifiers, in: 2007 IEEE international conference on information reuse and integration, IEEE, 2007, pp. 667–672.
[47]
Shao Y., Liu B., Wang S., Li G., Software defect prediction based on correlation weighted class association rule mining, Knowledge-Based Systems 196 (2020).
[48]
Shatnawi R., The application of ROC analysis in threshold identification, data imbalance and metrics selection for software fault prediction, Innovations in Systems and Software Engineering 13 (2) (2017) 201–217.
[49]
Singh Y., Malhotra R., Object-oriented software engineering, PHI Learning Pvt. Ltd., 2012.
[50]
Son L.H., Pritam N., Khari M., Kumar R., Phuong P.T.M., Thong P.H., et al., Empirical study of software defect prediction: a systematic mapping, Symmetry 11 (2) (2019) 212.
[51]
Tan P.-N., Steinbach M., Kumar V., Introduction to data mining, 1st ed., Addison-Wesley Longman Publishing Co., Inc., USA, 2005.
[52]
Tantithamthavorn C., Hassan A.E., Matsumoto K., The impact of class rebalancing techniques on the performance and interpretation of defect prediction models, IEEE Transactions on Software Engineering 46 (11) (2018) 1200–1219.
[53]
Tantithamthavorn C., Hassan A.E., Matsumoto K., The impact of class rebalancing techniques on the performance and interpretation of defect prediction models, IEEE Transactions on Software Engineering 46 (11) (2020) 1200–1219.
[54]
Thaher T., Mafarja M., Abdalhaq B., Chantar H., Wrapper-based feature selection for imbalanced data using binary queuing search algorithm, in: 2019 2nd international conference on new trends in computing sciences (ICTCS), IEEE, 2019, pp. 1–6.
[55]
Tubishat M., Idris N., Shuib L., Abushariah M.A., Mirjalili S., Improved salp swarm algorithm based on opposition based learning and novel local search algorithm for feature selection, Expert Systems with Applications 145 (2020).
[56]
Turabieh H., Mafarja M., Li X., Iterated feature selection algorithms with layered recurrent neural network for software fault prediction, Expert Systems with Applications 122 (2019) 27–42.
[57]
Wang H., Khoshgoftaar T.M., Napolitano A., A comparative study of ensemble feature selection techniques for software defect prediction, in: 2010 ninth international conference on machine learning and applications, IEEE, 2010, pp. 135–140.
[58]
Weiss G.M., McCarthy K., Zabar B., Cost-sensitive learning vs. sampling: Which is best for handling unbalanced classes with unequal error costs?, Dmin 7 (35–41) (2007) 24.
[59]
Woolson R.F., Wilcoxon signed-rank test, Wiley Encyclopedia of Clinical Trials (2007) 1–3.
[60]
Yohannese C.W., Li T., A combined-learning based framework for improved software fault prediction, International Journal of Computational Intelligence Systems 10 (1) (2017) 647.
[61]
Zhou T., Sun X., Xia X., Li B., Chen X., Improving defect prediction with deep forest, Information and Software Technology 114 (2019) 204–216.
[62]
Zimmerman D.W., Zumbo B.D., Relative power of the Wilcoxon test, the Friedman test, and repeated-measures ANOVA on ranks, The Journal of Experimental Education 62 (1) (1993) 75–86.
[63]
Zong P., Wang Y., Xie F., Embedded software fault prediction based on back propagation neural network, in: 2018 IEEE international conference on software quality, reliability and security companion (QRS-C), IEEE, 2018, pp. 553–558.

Cited By

View all

Recommendations

Comments

Information & Contributors

Information

Published In

cover image Expert Systems with Applications: An International Journal
Expert Systems with Applications: An International Journal  Volume 223, Issue C
Aug 2023
1341 pages

Publisher

Pergamon Press, Inc.

United States

Publication History

Published: 01 August 2023

Author Tags

  1. Data sampling
  2. Feature selection techniques
  3. Fault prediction models
  4. Machine learning

Qualifiers

  • Research-article

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)0
  • Downloads (Last 6 weeks)0
Reflects downloads up to 03 Jan 2025

Other Metrics

Citations

Cited By

View all

View Options

View options

Media

Figures

Other

Tables

Share

Share

Share this Publication link

Share on social media