skip to main content
10.1145/3511430.3511444acmotherconferencesArticle/Chapter ViewAbstractPublication PagesisecConference Proceedingsconference-collections
research-article

Feature Transformation for Improved Software Bug Detection Models

Published: 24 February 2022 Publication History

Abstract

Testing software is considered to be one of the most crucial phases in software development life cycle. Software bug fixing requires a significant amount of time and effort. A rich body of recent research explored ways to predict bugs in software artifacts using machine learning based techniques. For a reliable and trustworthy prediction, it is crucial to also consider the explainability aspects of such machine learning models. In this paper, we show how the feature transformation techniques can significantly improve the prediction accuracy and build confidence in building bug prediction models. We propose a novel approach for improved bug prediction that first extracts the features, then finds a weighted transformation of these features using a genetic algorithm that best separates bugs from non-bugs when plotted in a low-dimensional space, and finally, trains the machine learning model using the transformed dataset. In our experiment with real-life bug datasets, the random forest and k-nearest neighbor classifier models that leveraged feature transformation showed 4.25% improvement in recall values on an average of over 8 software systems when compared to the models built on original data.

References

[1]
[n. d.]. jenkinsci/jenkins: Jenkins automation server. https://github.com/jenkinsci/jenkins
[2]
2020. Novel XGBoost Tuned Machine Learning Model for Software Bug Prediction. In Proceedings of International Conference on Intelligent Engineering and Management, ICIEM 2020. Institute of Electrical and Electronics Engineers Inc., 376–380. https://doi.org/10.1109/ICIEM48762.2020.9160152
[3]
Rahul Adhao and Vinod Pachghare. 2020. Feature selection using principal component analysis and genetic algorithm. Journal of Discrete Mathematical Sciences and Cryptography 23, 2 (feb 2020), 595–602. https://doi.org/10.1080/09720529.2020.1729507
[4]
Miltiadis Allamanis, Henry Jackson-Flux, and Marc Brockschmidt. 2021. Self-Supervised Bug Detection and Repair. (may 2021). arxiv:2105.12787http://arxiv.org/abs/2105.12787
[5]
Anna C. Belkina, Christopher O. Ciccolella, Rina Anno, Richard Halpert, Josef Spidlen, and Jennifer E. Snyder-Cappione. 2019. Automated optimized parameters for T-distributed stochastic neighbor embedding improve visualization and analysis of large datasets. Nature Communications 10, 1 (dec 2019), 1–12. https://doi.org/10.1038/s41467-019-13055-y
[6]
Joseph Berkson. 1944. Application of the Logistic Function to Bio-Assay. J. Amer. Statist. Assoc. 39 (1944), 357–365.
[7]
Markus Borg, Oscar Svensson, Kristian Berg, and Daniel Hansson. 2019. SZZ unleashed: An open implementation of the SZZ algorithm-featuring example usage in a study of just-in-time bug prediction for the jenkins project. In Proceedings of the 3rd ACM SIGSOFT International Workshop on Machine Learning Techniques for Software Quality Evaluation, MaLTeSQuE. Association for Computing Machinery, Inc, 7–12. https://doi.org/10.1145/3340482.3342742
[8]
Leo Breiman. 2001. Random forests. Machine Learning 45, 1 (oct 2001), 5–32. https://doi.org/10.1023/A:1010933404324
[9]
Yanshuai Cao and Luyu Wang. 2017. Automatic Selection of t-SNE Perplexity. (aug 2017). arxiv:1708.03229
[10]
Corinna Cortes and Vladimir Vapnik. 1995. Support-vector networks. Machine learning 20, 3 (1995), 273–297.
[11]
Marco D’Ambros, Michele Lanza, and Martin Pinzger. 2007. “A Bug’s Life” Visualizing a Bug Database. In Proceedings of the 4th IEEE International Workshop on Visualizing Software for Understanding and Analysis, VISSOFT 2007, Banff, Alberta, Canada, June 25-26, 2007, Jonathan I. Maletic, Alexandru C. Telea, and Andrian Marcus (Eds.). IEEE Computer Society, 113–120.
[12]
Binu Melit Devassy, Sony George, and Peter Nussbaum. 2020. Unsupervised clustering of hyperspectral paper data using T-SNE. Journal of Imaging 6, 5 (may 2020), 29. https://doi.org/10.3390/JIMAGING6050029
[13]
Rudolf Ferenc, Péter Gyimesi, Gábor Gyimesi, Zoltán Tóth, and Tibor Gyimóthy. 2020. An Automatically Created Novel Bug Dataset and its Validation in Bug Prediction. Journal of Systems and Software 169 (2020). https://doi.org/10.1016/j.jss.2020.110691
[14]
Rudolf Ferenc, Zoltán Tóth, Gergely Ladányi, István Siket, and Tibor Gyimóthy. 2020. A public unified bug dataset for java and its assessment regarding metrics and bug prediction. Software Quality Journal 28, 4 (dec 2020), 1447–1506. https://doi.org/10.1007/s11219-020-09515-0
[15]
Evelyn Fix and Joseph L. Hodges. 1989. Discriminatory Analysis - Nonparametric Discrimination: Consistency Properties. International Statistical Review 57 (1989), 238.
[16]
Denis Gracanin, Kresimir Matkovic, and Mohamed Eltoweissy. 2005. Software visualization. Innov. Syst. Softw. Eng. 1, 2 (2005), 221–230.
[17]
Heena Gupta and V. Asha. 2020. Impact of Encoding of High Cardinality Categorical Data to Solve Prediction Problems. Journal of Computational and Theoretical Nanoscience 17, 9 (dec 2020), 4197–4201. https://doi.org/10.1166/jctn.2020.9044
[18]
Maen Hammad, Somia Abufakher, and Mustafa Hammad. 2014. A visualization approach for bug reports in software systems. International Journal of Software Engineering and its Applications 8, 10(2014), 37–46. https://doi.org/10.14257/ijseia.2014.8.10.04
[19]
Awni Hammouri, Mustafa Hammad, Mohammad M. Alnabhan, and Fatima Alsarayrah. 2018. Software Bug Prediction using Machine Learning Approach. International Journal of Advanced Computer Science and Applications 9 (2018).
[20]
Steffen Herbold, Alexander Trautsch, and Fabian Trautsch. 2020. On the feasibility of automated prediction of bug and non-bug issues. Empirical Software Engineering 25, 6 (nov 2020), 5333–5369. https://doi.org/10.1007/s10664-020-09885-w
[21]
John H Holland. 1992. Genetic algorithms. Scientific american 267, 1 (1992), 66–73.
[22]
Andre Hora, Nicolas Anquetil, Stephane Ducasse, Muhammad Bhatti, Cesar Couto, Marco Tulio Valente, and Julio Martins. 2012. BugMaps: A tool for the visual exploration and analysis of bugs. In Proceedings of the European Conference on Software Maintenance and Reengineering, CSMR. 523–526. https://doi.org/10.1109/CSMR.2012.68
[23]
Leif Jonsson. 2018. Machine Learning-Based Bug Handling in Large-Scale Software Development. Ph. D. Dissertation. Sweden.
[24]
Neli Kalcheva, Maya Todorova, and Ginka Marinova. 2020. Naive Bayes Classifier, Decision Tree and AdaBoost Ensemble Algorithm–Advantages and Disadvantages. KNOWLEDGE BASED SUSTAINABLE DEVELOPMENT(2020), 153.
[25]
Sourabh Katoch, Sumit Singh Chauhan, and Vijay Kumar. 2021. A review on genetic algorithm: past, present, and future. Multimedia Tools and Applications 80, 5 (feb 2021), 8091–8126. https://doi.org/10.1007/s11042-020-10139-6
[26]
Kaitlin Kirasich, T. Smith, and Bivin Sadler. 2018. Random Forest vs Logistic Regression: Binary Classification for Heterogeneous Datasets.
[27]
Wentian Li, Jane E. Cerise, Yaning Yang, and Henry Han. 2017. Application of t-SNE to human genetic data. Journal of Bioinformatics and Computational Biology 15, 4 (aug 2017). https://doi.org/10.1142/S0219720017500172
[28]
Firoz Mahmud, Md Enamul Haque, Syed Tauhid Zuhori, and Biprodip Pal. 2014. Human face recognition using PCA based Genetic Algorithm. In Proc. of the 1st International Conference on Electrical Engineering and Information and Communication Technology, ICEEICT. Institute of Electrical and Electronics Engineers Inc.https://doi.org/10.1109/ICEEICT.2014.6919046
[29]
Ruchika Malhotra. 2015. A systematic review of machine learning techniques for software fault prediction. Applied Soft Computing Journal 27 (feb 2015), 504–518. https://doi.org/10.1016/j.asoc.2014.11.023
[30]
K. F. Man, K. S. Tang, and S. Kwong. 1996. Genetic algorithms: Concepts and applications. IEEE Transactions on Industrial Electronics 43, 5 (1996), 519–534. https://doi.org/10.1109/41.538609
[31]
Aimin Miao, Jiajun Zhuang, Yu Tang, Yong He, Xuan Chu, and Shaoming Luo. 2018. Hyperspectral Image-Based Variety Classification of Waxy Maize Seeds by the t-SNE Model and Procrustes Analysis. Sensors 18, 12 (2018). https://doi.org/10.3390/s18124391
[32]
Sushruta Mishra, Pradeep Kumar Mallick, Lambodar Jena, and Gyoo-Soo Chae. 2020. Optimization of Skewed Data Using Sampling-Based Preprocessing Approach. Frontiers in Public Health 8 (2020), 274. https://doi.org/10.3389/fpubh.2020.00274
[33]
Debajyoti Mondal, Manishankar Mondal, Chanchal K. Roy, Kevin A. Schneider, Yukun Li, and Shisong Wang. 2019. Clone-World: A visual analytic system for large scale software clones. Vis. Informatics 3, 1 (2019), 18–26.
[34]
Golam Mostaeen, Jeffrey Svajlenko, Banani Roy, Chanchal K. Roy, and Kevin A. Schneider. 2018. On the Use of Machine Learning Techniques Towards the Design of Cloud Based Automatic Code Clone Validation Tools. In 18th IEEE International Working Conference on Source Code Analysis and Manipulation, SCAM 2018, Madrid, Spain, September 23-24, 2018. IEEE Computer Society, 155–164.
[35]
Golam Mostaeen, Jeffrey Svajlenko, Banani Roy, Chanchal K. Roy, and Kevin A. Schneider. 2019. CloneCognition: machine learning based code clone validation tool. In Proceedings of the ACM Joint Meeting on European Software Engineering Conference and Symposium on the Foundations of Software Engineering, ESEC/SIGSOFT FSE. ACM, 1105–1109.
[36]
Sakib Mostafa and Debajyoti Mondal. 2021. On the Evolution of Neuron Communities in a Deep Learning Architecture. CoRR abs/2106.04693(2021). arxiv:2106.04693
[37]
Sammar Moustafa, Mustafa Y. ElNainay, Nagwa El Makky, and Mohamed S. Abougabal. 2018. Software bug prediction using weighted majority voting techniques. Alexandria Engineering Journal 57, 4 (dec 2018), 2763–2774. https://doi.org/10.1016/j.aej.2018.01.003
[38]
Behzad Soleimani Neysiani and Seyed Morteza Babamir. 2020. Automatic Duplicate Bug Report Detection using Information Retrieval-based versus Machine Learning-based Approaches. In Proc. of the 6th International Conference on Web Research (ICWR). 288–293. https://doi.org/10.1109/ICWR49608.2020.9122288
[39]
C. Lakshmi Prabha and N. Shivakumar. 2020. Software Defect Prediction Using Machine Learning Techniques. In Proceedings of the 4th International Conference on Trends in Electronics and Informatics, ICOEI 2020. Institute of Electrical and Electronics Engineers Inc., 728–733. https://doi.org/10.1109/ICOEI48184.2020.9142909
[40]
Rakesh Rana, Miroslaw Staron, Christian Berger, Jörgen Hansson, Martin Nilsson, and Wilhelm Meding. 2014. The Adoption of Machine Learning Techniques for Software Defect Prediction: An Initial Industrial Validation. In Knowledge-Based Software Engineering - 11th Joint Conference, JCKBSE 2014, Volgograd, Russia, September 17-20, 2014. Proceedings(Communications in Computer and Information Science, Vol. 466), Alla G. Kravets, Maxim Shcherbakov, Marina V. Kultsova, and Tadashi Iijima (Eds.). Springer, 270–285.
[41]
Zeeshan Ali Rana, M. Awais Mian, and Shafay Shamail. 2015. Improving Recall of software defect prediction models using association mining. Knowledge-Based Systems 90 (dec 2015), 1–13. https://doi.org/10.1016/j.knosys.2015.10.009
[42]
Stefan Strüder, Mukelabai Mukelabai, Daniel Strüber, and Thorsten Berger. 2020. Feature-oriented defect prediction. In Proc. of the 24th ACM International Systems and Software Product Line Conference, Roberto Erick Lopez-Herrejon (Ed.). ACM, 21:1–21:12.
[43]
Jacek undefinedliwerski, Thomas Zimmermann, and Andreas Zeller. 2005. When Do Changes Induce Fixes?SIGSOFT Softw. Eng. Notes 30, 4 (may 2005), 1–5. https://doi.org/10.1145/1082983.1083147
[44]
Laurens van der Maaten and Geoffrey Hinton. 2008. Visualizing Data using t-SNE. Journal of Machine Learning Research 9, 86 (2008), 2579–2605.
[45]
Geoffrey I Webb, Eamonn Keogh, and Risto Miikkulainen. 2010. Naïve Bayes.Encyclopedia of machine learning 15 (2010), 713–714.
[46]
Shamima Yeasmin, Chanchal K. Roy, and Kevin A. Schneider. 2014. Interactive visualization of bug reports using topic evolution and extractive summaries. In Proceedings - 30th International Conference on Software Maintenance and Evolution, ICSME 2014. Institute of Electrical and Electronics Engineers Inc., 421–425. https://doi.org/10.1109/ICSME.2014.66
[47]
Yongli Zhang. 2012. Support vector machine classification algorithm and its application. In Communications in Computer and Information Science, Vol. 308 CCIS. Springer, Berlin, Heidelberg, 179–186. https://doi.org/10.1007/978-3-642-34041-3_27

Cited By

View all

Recommendations

Comments

Information & Contributors

Information

Published In

cover image ACM Other conferences
ISEC '22: Proceedings of the 15th Innovations in Software Engineering Conference
February 2022
235 pages
ISBN:9781450396189
DOI:10.1145/3511430
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 24 February 2022

Permissions

Request permissions for this article.

Check for updates

Author Tags

  1. genetic algorithm
  2. machine learning
  3. software bug
  4. t-SNE

Qualifiers

  • Research-article
  • Research
  • Refereed limited

Conference

ISEC 2022

Acceptance Rates

Overall Acceptance Rate 76 of 315 submissions, 24%

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)35
  • Downloads (Last 6 weeks)5
Reflects downloads up to 25 Dec 2024

Other Metrics

Citations

Cited By

View all

View Options

Login options

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

HTML Format

View this article in HTML Format.

HTML Format

Media

Figures

Other

Tables

Share

Share

Share this Publication link

Share on social media