skip to main content
10.1145/3412841.3442020acmconferencesArticle/Chapter ViewAbstractPublication PagessacConference Proceedingsconference-collections
research-article

A comparison of similarity based instance selection methods for cross project defect prediction

Published: 22 April 2021 Publication History

Abstract

Context: Previous studies have shown that training data instance selection based on nearest neighborhood (NN) information can lead to better performance in cross project defect prediction (CPDP) by reducing heterogeneity in training datasets. However, neighborhood calculation is computationally expensive and approximate methods such as Locality Sensitive Hashing (LSH) can be as effective as exact methods. Aim: We aim at comparing instance selection methods for CPDP, namely LSH, NN-filter, and Genetic Instance Selection (GIS). Method: We conduct experiments with five base learners, optimizing their hyper parameters, on 13 datasets from PROMISE repository in order to compare the performance of LSH with benchmark instance selection methods NN-Filter and GIS. Results: The statistical tests show six distinct groups for F-measure performance. The top two group contains only LSH and GIS benchmarks whereas the bottom two groups contain only NN-Filter variants. LSH and GIS favor recall more than precision. In fact, for precision performance only three significantly distinct groups are detected by the tests where the top group is comprised of NN-Filter variants only. Recall wise, 16 different groups are identified where the top three groups contain only LSH methods, four of the next six are GIS only and the bottom five contain only NN-Filter. Finally, NN-Filter benchmarks never outperform the LSH counterparts with the same base learner, tuned or non-tuned. Further, they never even belong to the same rank group, meaning that LSH is always significantly better than NN-Filter with the same learner and settings. Conclusions: The increase in performance and the decrease in computational overhead and runtime make LSH a promising approach. However, the performance of LSH is based on high recall and in environments where precision is considered more important NN-Filter should be considered.

References

[1]
Thomas Zimmermann, Nachiappan Nagappan, Harald Gall, Emanuel Giger, and Brendan Murphy. Cross-project defect prediction: a large scale experiment on data vs. domain vs. process. In Proceedings of the the 7th joint meeting of the European software engineering conference and the ACM SIGSOFT symposium on The foundations of software engineering, pages 91--100. ACM, 2009.
[2]
Burak Turhan, Tim Menzies, Ayşe B Bener, and Justin Di Stefano. On the relative value of cross-company and within-company data for defect prediction. Empirical Software Engineering, 14(5):540--578, 2009.
[3]
Seyedrebvar Hosseini, Burak Turhan, and Dimuthu Gunarathna. A systematic literature review and meta-analysis on cross project defect prediction. Software Engineering, IEEE Transactions on, 2017.
[4]
Steffen Herbold, Alexander Trautsch, and Jens Grabowski. A Comparative Study to Benchmark Cross-project Defect Prediction Approaches. IEEE Transactions on Software Engineering, 2017. Online First.
[5]
S. Hosseini, B. Turhan, and M. Mäntylä. Search based training data selection for cross project defect prediction. In Proceedings of the The 12th International Conference on Predictive Models and Data Analytics in Software Engineering, PROMISE 2016, pages 3:1--3:10. ACM, 2016.
[6]
S. Hosseini, B. Turhan, and M. Mäntylä. A benchmark study on the effectiveness of search-based data selection and feature selection for cross project defect prediction. Information and Software Technology, Jun 2017.
[7]
Zhimin He, Fengdi Shu, Ye Yang, Mingshu Li, and Qing Wang. An investigation on the feasibility of cross-project defect prediction. Automated Software Engineering, 19(2):167--199, 2012.
[8]
Steffen Herbold. Training data selection for cross-project defect prediction. In Proceedings of the 9th International Conference on Predictive Models in Software Engineering, page 6. ACM, 2013.
[9]
Burak Turhan, Ayşe Tosun Mısırlı, and Ayşe Bener. Empirical evaluation of the effects of mixed project data on learning defect predictors. Information and Software Technology, 55(6):1101--1118, 2013.
[10]
Duksan Ryu, Jong-In Jang, and Jongmoon Baik. A hybrid instance selection using nearest-neighbor for cross-project defect prediction. Journal of Computer Science and Technology, 30(5):969--980, 2015.
[11]
Martin J. Shepperd, Qinbao Song, Zhongbin Sun, and Carolyn Mair. Data quality: Some comments on the NASA software defect datasets. IEEE Trans. Software Eng., 39(9):1208--1215, 2013.
[12]
Burak Turhan. On the dataset shift problem in software engineering prediction models. Empirical Software Engineering, 17(1-2):62--74, 2012.
[13]
Piotr Indyk and Rajeev Motwani. Approximate nearest neighbors: towards removing the curse of dimensionality. In Proceedings of the thirtieth annual ACM symposium on Theory of computing, pages 604--613. ACM, 1998.
[14]
Gurmeet Singh Manku, Arvind Jain, and Anish Das Sarma. Detecting near-duplicates for web crawling. In Proceedings of the 16th international conference on World Wide Web, pages 141--150. ACM, 2007.
[15]
Hisashi Koga, Tetsuo Ishibashi, and Toshinori Watanabe. Fast agglomerative hierarchical clustering algorithm using locality-sensitive hashing. Knowledge and Information Systems, 12(1):25--53, 2007.
[16]
Yushi Jing and Shumeet Baluja. Visualrank: Applying pagerank to large-scale image search. IEEE Transactions on Pattern Analysis and Machine Intelligence, 30(11):1877--1890, 2008.
[17]
Thierry Bertin-Mahieux and Daniel PW Ellis. Large-scale cover song recognition using hashed chroma landmarks. In Applications of Signal Processing to Audio and Acoustics (WASPAA), 2011 IEEE Workshop on, pages 117--120. IEEE, 2011.
[18]
Jure Leskovec, Anand Rajaraman, and Jeffrey David Ullman. Mining of massive datasets. Cambridge university press, 2014.
[19]
Alexandr Andoni and Piotr Indyk. Near-optimal hashing algorithms for approximate nearest neighbor in high dimensions. In Foundations of Computer Science, 2006. FOCS'06. 47th Annual IEEE Symposium on, pages 459--468. IEEE, 2006.
[20]
Moses S Charikar. Similarity estimation techniques from rounding algorithms. In Proceedings of the thiry-fourth annual ACM symposium on Theory of computing, pages 380--388. ACM, 2002.
[21]
Jianqiu Ji, Jianmin Li, Shuicheng Yan, Bo Zhang, and Qi Tian. Super-bit locality-sensitive hashing. In Advances in Neural Information Processing Systems, pages 108--116, 2012.
[22]
Duksan Ryu, Jong-In Jang, and Jongmoon Baik. A transfer cost-sensitive boosting approach for cross-project defect prediction. Software Quality Journal, pages 1--38, 2015.
[23]
Marian Jureczko and Lech Madeyski. Towards identifying software project clusters with regard to defect prediction. In Proceedings of the 6th International Conference on Predictive Models in Software Engineering, page 9. ACM, 2010.
[24]
Marian Jureczko and Diomidis Spinellis. Using object-oriented design metrics to predict software defects. Models and Methods of System Dependability. Oficyna Wydawnicza Politechniki Wrocławskiej, pages 69--81, 2010.
[25]
Seyedrebvar Hosseini and Burak Turhan. An exploratory study of search based training data selection for cross project defect prediction. In 2018 44th Euromicro Conference on Software Engineering and Advanced Applications (SEAA), pages 244--251. IEEE, 2018.
[26]
Chakkrit Tantithamthavorn, Shane McIntosh, Ahmed E Hassan, and Kenichi Matsumoto. The impact of automated parameter optimization on defect prediction models. IEEE Transactions on Software Engineering, 2018.
[27]
Seyedrebvar Hosseini and Burak Turhan. Replication Package - A Comparison of Similarity Based Instance Selection Methods for Cross Project Defect Prediction. 2020.
[28]
Guillermo Macbeth, Eugenia Razumiejczyk, and Rubén Daniel Ledesma. Cliff's delta calculator: A non-parametric effect size program for two groups of observations. Universitas Psychologica, 10(2):545--555, 2011.
[29]
Tim Menzies, Jeremy Greenwald, and Art Frank. Data mining static code attributes to learn defect predictors. Software Engineering, IEEE Transactions on, 33(1):2--13, 2007.
[30]
Tim Menzies, Burak Turhan, Ayse Bener, Gregory Gay, Bojan Cukic, and Yue Jiang. Implications of ceiling effects in defect predictors. In Proceedings of the 4th international workshop on Predictor models in software engineering, pages 47--54. ACM, 2008.
[31]
Stefan Lessmann, Bart Baesens, Christophe Mues, and Swantje Pietsch. Benchmarking classification models for software defect prediction: A proposed framework and novel findings. Software Engineering, IEEE Transactions on, 34(4):485--496, 2008.
[32]
Claes Wohlin, Per Runeson, Martin Höst, Magnus C Ohlsson, Björn Regnell, and Anders Wesslén. Experimentation in software engineering. Springer Science & Business Media, 2012.
[33]
Victor R Basili, Forrest Shull, and Filippo Lanubile. Building knowledge through families of experiments. Software Engineering, IEEE Transactions on, 25(4):456--473, 1999.

Cited By

View all

Recommendations

Comments

Information & Contributors

Information

Published In

cover image ACM Conferences
SAC '21: Proceedings of the 36th Annual ACM Symposium on Applied Computing
March 2021
2075 pages
ISBN:9781450381048
DOI:10.1145/3412841
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Sponsors

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 22 April 2021

Permissions

Request permissions for this article.

Check for updates

Author Tags

  1. approximate near neighbour
  2. cross project defect prediction
  3. instance selection
  4. locality sensitive hashing
  5. search based optimisation

Qualifiers

  • Research-article

Conference

SAC '21
Sponsor:
SAC '21: The 36th ACM/SIGAPP Symposium on Applied Computing
March 22 - 26, 2021
Virtual Event, Republic of Korea

Acceptance Rates

Overall Acceptance Rate 1,650 of 6,669 submissions, 25%

Upcoming Conference

SAC '25
The 40th ACM/SIGAPP Symposium on Applied Computing
March 31 - April 4, 2025
Catania , Italy

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)8
  • Downloads (Last 6 weeks)0
Reflects downloads up to 18 Jan 2025

Other Metrics

Citations

Cited By

View all

View Options

Login options

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Media

Figures

Other

Tables

Share

Share

Share this Publication link

Share on social media