skip to main content
10.1145/2884781.2884877acmconferencesArticle/Chapter ViewAbstractPublication PagesicseConference Proceedingsconference-collections
research-article

SourcererCC: scaling code clone detection to big-code

Published: 14 May 2016 Publication History

Abstract

Despite a decade of active research, there has been a marked lack in clone detection techniques that scale to large repositories for detecting near-miss clones. In this paper, we present a token-based clone detector, SourcererCC, that can detect both exact and near-miss clones from large inter-project repositories using a standard workstation. It exploits an optimized inverted-index to quickly query the potential clones of a given code block. Filtering heuristics based on token ordering are used to significantly reduce the size of the index, the number of code-block comparisons needed to detect the clones, as well as the number of required token-comparisons needed to judge a potential clone. We evaluate the scalability, execution time, recall and precision of SourcererCC, and compare it to four publicly available and state-of-the-art tools. To measure recall, we use two recent benchmarks: (1) a big benchmark of real clones, BigCloneBench, and (2) a Mutation/Injection-based framework of thousands of fine-grained artificial clones. We find SourcererCC has both high recall and precision, and is able to scale to a large inter-project repository (25K projects, 250MLOC) using a standard workstation.

References

[1]
Software clone detection: A systematic review. Information and Software Technology, 55(7):1165--1199, 2013.
[2]
Cloc: Count lines of code. http://cloc.sourceforge.net, 2015.
[3]
Ambient Software Evoluton Group. IJaDataset 2.0. http://secold.org/projects/seclone, January 2013.
[4]
B. Baker. A program for identifying duplicated code. Computing Science and Statistics, pages 24--49, 1992.
[5]
S. Bellon, R. Koschke, G. Antoniol, J. Krinke, and E. Merlo. Comparison and evaluation of clone detection tools. Software Engineering, IEEE Transactions on, 33(9):577--591, Sept 2007.
[6]
A. Charpentier, J.-R. Falleri, D. Lo, and L. Réveillère. An empirical assessment of bellon's clone benchmark. In Proceedings of the 19th International Conference on Evaluation and Assessment in Software Engineering, EASE '15, pages 20:1--20:10, New York, NY, USA, 2015. ACM.
[7]
S. Chaudhuri, V. Ganti, and R. Kaushik. A primitive operator for similarity joins in data cleaning. In Proceedings of the 22Nd International Conference on Data Engineering, ICDE '06, pages 5--, Washington, DC, USA, 2006. IEEE Computer Society.
[8]
K. Chen, P. Liu, and Y. Zhang. Achieving accuracy and scalability simultaneously in detecting application clones on android markets. In Proceedings of the 36th International Conference on Software Engineering, ICSE 2014, pages 175--186, New York, NY, USA, 2014. ACM.
[9]
J. Cordy. The txl programming language. http://www.txl.ca/.
[10]
J. R. Cordy and C. K. Roy. The nicad clone detector. In Proceedings of the 2011 IEEE 19th International Conference on Program Comprehension, ICPC '11, pages 219--220, Washington, DC, USA, 2011. IEEE Computer Society.
[11]
J. Davies, D. German, M. Godfrey, and A. Hindle. Software Bertillonage: finding the provenance of an entity. In Proceedings of MSR, 2011.
[12]
D. M. German, M. D. Penta, Y. gal Guhneuc, and G. Antoniol. Code siblings: technical and legal implications of copying code between applications. In Mining Software Repositories, 2009. MSR '09. 6th IEEE International Working Conference on, 2009.
[13]
N. Gode and R. Koschke. Incremental clone detection. In Software Maintenance and Reengineering, 2009. CSMR '09. 13th European Conference on, pages 219--228, March 2009.
[14]
A. Hemel and R. Koschke. Reverse engineering variability in source code using clone detection: A case study for linux variants of consumer electronic devices. In Proceedings of Working Conference on Reverse Engineering, pages 357--366, 2012.
[15]
A. Hindle, E. T. Barr, Z. Su, M. Gabel, and P. Devanbu. On the naturalness of software. In Proceedings of the 34th International Conference on Software Engineering, ICSE '12, pages 837--847, Piscataway, NJ, USA, 2012. IEEE Press.
[16]
B. Hummel, E. Juergens, L. Heinemann, and M. Conradt. Index-based code clone detection:incremental, distributed, scalable. In Proceedings of ICSM, 2010.
[17]
T. Ishihara, K. Hotta, Y. Higo, H. Igaki, and S. Kusumoto. Inter-project functional clone detection toward building libraries - an empirical study on 13,000 projects. In Reverse Engineering (WCRE), 2012 19th Working Conference on, pages 387--391, Oct 2012.
[18]
L. Jiang, G. Misherghi, Z. Su, and S. Glondu. Deckard: Scalable and accurate tree-based detection of code clones. In Software Engineering, 2007. ICSE 2007. 29th International Conference on, pages 96--105, May 2007.
[19]
T. Kamiya, S. Kusumoto, and K. Inoue. Ccfinder: a multilinguistic token-based code clone detection system for large scale source code. Software Engineering, IEEE Transactions on, 28(7):654--670, Jul 2002.
[20]
S. Kawaguchi, T. Yamashina, H. Uwano, K. Fushida, Y. Kamei, M. Nagura, and H. Iida. Shinobi: A tool for automatic code clone detection in the ide. volume 0, pages 313--314, Los Alamitos, CA, USA, 2009. IEEE Computer Society.
[21]
I. Keivanloo, J. Rilling, and P. Charland. Internet-scale real-time code clone search via multi-level indexing. In Proceedings of WCRE, 2011.
[22]
R. Koschke. Large-scale inter-system clone detection using suffix trees. In Proceedings of CSMR, pages 309--318, 2012.
[23]
M.-W. Lee, J.-W. Roh, S.-w. Hwang, and S. Kim. Instant code clone search. In Proceedings of the Eighteenth ACM SIGSOFT International Symposium on Foundations of Software Engineering, FSE '10, pages 167--176, New York, NY, USA, 2010. ACM.
[24]
S. Livieri, Y. Higo, M. Matsushita, and K. Inoue. Very-large scale code clone analysis and visualization of open source programs using distributed ccfinder: D-ccfinder. In Proceedings of ICSE, 2007.
[25]
D. Rattan, R. Bhatia, and M. Singh. Software clone detection: A systematic review. Information and Software Technology, 55(7):1165--1199, 2013.
[26]
C. Roy and J. Cordy. A mutation/injection-based automatic framework for evaluating code clone detection tools. In Software Testing, Verification and Validation Workshops, 2009. ICSTW '09. International Conference on, pages 157--166, April 2009.
[27]
C. Roy, M. Zibran, and R. Koschke. The vision of software clone management: Past, present, and future (keynote paper). In Software Maintenance, Reengineering and Reverse Engineering (CSMR-WCRE), 2014 Software Evolution Week - IEEE Conference on, pages 18--33, Feb 2014.
[28]
C. K. Roy and J. R. Cordy. A survey on software clone detection research. (TR 2007-541), 2007. 115 pp.
[29]
C. K. Roy and J. R. Cordy. Near-miss function clones in open source software: An empirical study. J. Softw. Maint. Evol., 22(3):165--189, Apr. 2010.
[30]
C. K. Roy, J. R. Cordy, and R. Koschke. Comparison and evaluation of code clone detection techniques and tools: A qualitative approach. Sci. of Comput. Program., pages 577--591, 2009.
[31]
H. Sajnani and C. Lopes. A parallel and efficient approach to large scale code clone detection. In Proceedings of International Workshop on Software Clones, 2013.
[32]
J. Svajlenko, J. F. Islam, I. Keivanloo, C. K. Roy, and M. M. Mia. Towards a big data curated benchmark of inter-project code clones. In Proceedings of the 2014 IEEE International Conference on Software Maintenance and Evolution, ICSME '14, pages 476--480, Washington, DC, USA, 2014. IEEE Computer Society.
[33]
J. Svajlenko, I. Keivanloo, and C. Roy. Scaling classical clone detection tools for ultra-large datasets: An exploratory study. In Software Clones (IWSC), 2013 7th International Workshop on, pages 16--22, May 2013.
[34]
J. Svajlenko, I. Keivanloo, and C. K. Roy. Big data clone detection using classical detectors: an exploratory study. Journal of Software: Evolution and Process, 27(6):430--464, 2015.
[35]
J. Svajlenko and C. K. Roy. Evaluating modern clone detection tools. In ICSME, 2014. 10 pp.
[36]
J. Svajlenko and C. K. Roy. Evaluating clone detection tools with bigclonebench. In Proceedings of the 2015 IEEE International Conference on Software Maintenance and Evolution, ICSME '15, page 10, 2015.
[37]
J. Svajlenko, C. K. Roy, and J. R. Cordy. A mutation analysis based benchmarking framework for clone detectors. In Proceedings of the 7th International Workshop on Software Clones, IWSC '13, pages 8--9, 2013.
[38]
A. Walenstein, N. Jyoti, J. Li, Y. Yang, and A. Lakhotia. Problems creating task-relevant clone detection reference data. In WCRE, pages 285--294, 2003.
[39]
T. Wang, M. Harman, Y. Jia, and J. Krinke. Searching for better configurations: A rigorous approach to clone evaluation. In ESEC/FSE, pages 455--465, 2013.
[40]
C. Xiao, W. Wang, X. Lin, and J. X. Yu. Efficient similarity joins for near duplicate detection. In Proceedings of the 17th International Conference on World Wide Web, WWW '08, pages 131--140, New York, NY, USA, 2008. ACM.
[41]
Y. Zhang, R. Jin, and Z.-H. Zhou. Understanding bag-of-words model: a statistical framework. International Journal of Machine Learning and Cybernetics, 1(1-4):43--52, 2010.
[42]
G. K. Zipf. Selective Studies and the Principle of Relative Frequency in Language.

Cited By

View all

Recommendations

Comments

Information & Contributors

Information

Published In

cover image ACM Conferences
ICSE '16: Proceedings of the 38th International Conference on Software Engineering
May 2016
1235 pages
ISBN:9781450339001
DOI:10.1145/2884781
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

Sponsors

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 14 May 2016

Permissions

Request permissions for this article.

Check for updates

Qualifiers

  • Research-article

Conference

ICSE '16
Sponsor:

Acceptance Rates

Overall Acceptance Rate 276 of 1,856 submissions, 15%

Upcoming Conference

ICSE 2025

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)195
  • Downloads (Last 6 weeks)38
Reflects downloads up to 24 Dec 2024

Other Metrics

Citations

Cited By

View all

View Options

Login options

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Media

Figures

Other

Tables

Share

Share

Share this Publication link

Share on social media