research-article

Empirical evaluation of the performance of data sampling and feature selection techniques for software fault prediction

Authors:

Sonika Chandrakant Rathi,

Sanjay Misra,

Ricardo Colomo-Palacios,

R. Adarsh,

Lalita Bhanu Murthy Neti,

Lov KumarAuthors Info & Claims

Volume 223, Issue C

https://doi.org/10.1016/j.eswa.2023.119806

Published: 01 August 2023 Publication History

Abstract

Context:

The application of Software Fault Prediction (SFP) in the software development life cycle to predict the faulty class at the early stage has piqued the interest of various scholars. In the SFP domain, during research analysis, it got realized that there has been very little work instigated on addressing both class imbalance and feature redundancy problems jointly to enhance the performance and prediction accuracy of SFP models. It has been perceived in the literature survey the study of droughts with the comprehensive comparative analysis of different sampling and feature selection strategies together.

Objective:

This research builds an extensive assessment of distinct combinations of different feature selection and sampling approaches, to effectively overcome the problems of class overlap, class imbalance, and feature redundancy. The objective is to determine the best combination that will produce results with a higher degree of accuracy and an effective SFP model.

Method:

Considering the above erudition, the study has applied 8 different sampling techniques along with 10 feature selection algorithms against 56 open-source projects. The comparative analysis is performed against 5346 variants of input datasets by applying 8 different classifiers to predict the faulty class. In addition, the research paper presents an intensive assessment and performance of these techniques individually against all the input projects. We have considered accuracy and Area Under the ROC (receiver operating characteristic curve) Curve (AUC) performance metrics to compare the performance of different models developed using the classification algorithm.

Result:

For each project in the proposed work, we evaluated a total of 792 combinations that were produced using 10 feature selection methods, 1 all metrics dataset, 8 sampling methods, 1 original, unsampled dataset, and 8 classifiers. The empirical result indicates that, against 21 projects out of 54 projects, Synthetic Minority Over Sampling Technique Edited (SMOTEE) with correlation-based feature selection (FS2) combination outperformed with the highest AUC value which is 38.89 % of projects. Additionally, according to experimental results, the highest AUC values were attained by 24.07 % of projects using the SMOTEE, FS2, and RF combination.

Conclusion:

The results of the statical analysis test reveal that 93.42 % of the combinational pairs of different sampling and feature selection approaches demonstrated a significant variance in the performance of the distinct combinations of sampling and feature selection techniques. The empirical result indicates the performance of the SFP Model is adversely impacted by class imbalance and irrelevance. The outcome indicates for more than 75% of projects, the performance of trained models improved with an AUC value between a range of 0.805 to 0.99 post-application of sampling and feature selection strategies, in comparison without the use of feature selection and sampling techniques.

References

[1]

Arshad A., Riaz S., Jiao L., Murthy A., Semi-supervised deep fuzzy c-mean clustering for software fault prediction, IEEE Access 6 (2018) 25675–25685.

Abstract

Context:

Objective:

Method:

Result:

Conclusion:

References

Cited By

Index Terms

Recommendations

The impact of feature selection on maintainability prediction of service-oriented applications

A comparative study of feature-ranking and feature-subset selection techniques for improved fault prediction

Attribute Selection and Imbalanced Data: Problems in Software Defect Prediction

Comments

Information

Published In

Publisher

Publication History

Author Tags

Qualifiers

Contributors

Other Metrics

Bibliometrics

Article Metrics

Other Metrics

Citations

Cited By

View options

Figures

Other

Share

Share this Publication link

Share on social media

Affiliations