research-article

Public Access

Code relatives: detecting similarly behaving software

Authors:

Fang-Hsiang Su,

Kenneth Harvey,

Simha Sethumadhavan,

Tony JebaraAuthors Info & Claims

FSE 2016: Proceedings of the 2016 24th ACM SIGSOFT International Symposium on Foundations of Software Engineering

Pages 702 - 714

https://doi.org/10.1145/2950290.2950321

Published: 01 November 2016 Publication History

Abstract

Detecting “similar code” is useful for many software engineering tasks. Current tools can help detect code with statically similar syntactic and–or semantic features (code clones) and with dynamically similar functional input/output (simions). Unfortunately, some code fragments that behave similarly at the finer granularity of their execution traces may be ignored. In this paper, we propose the term “code relatives” to refer to code with similar execution behavior. We define code relatives and then present DyCLINK, our approach to detecting code relatives within and across codebases. DyCLINK records instruction-level traces from sample executions, organizes the traces into instruction-level dynamic dependence graphs, and employs our specialized subgraph matching algorithm to efficiently compare the executions of candidate code relatives. In our experiments, DyCLINK analyzed 422+ million prospective subgraph matches in only 43 minutes. We compared DyCLINK to one static code clone detector from the community and to our implementation of a dynamic simion detector. The results show that DyCLINK effectively detects code relatives with a reasonable analysis time.

References

[1]

Dyclink github page. https: //github.com/Programming-Systems-Lab/dyclink.

[2]

Asm framework. http://asm.ow2.org/index.html.

[3]

V. Avdiienko, K. Kuznetsov, A. Gorla, A. Zeller, S. Arzt, S. Rasthofer, and E. Bodden. Mining apps for abnormal usage of sensitive data. In 2015 International Conference on Software Engineering (ICSE), ICSE ’15, pages 426–436, 2015.

Digital Library

[4]

B. S. Baker. A program for identifying duplicated code. In Computer Science and Statistics: Proc. Symp. on the Interface, pages 49–57, 1992.

[5]

V. Bauer, T. Völke, and E. Jürgens. A novel approach to detect unintentional re-implementations. In Proceedings of the 2014 IEEE International Conference on Software Maintenance and Evolution, ICSME ’14, pages 491–495, Washington, DC, USA, 2014. IEEE Computer Society.

Digital Library

[6]

I. D. Baxter, A. Yahin, L. Moura, M. Sant’Anna, and L. Bier. Clone detection using abstract syntax trees. In Proceedings of the International Conference on Software Maintenance, ICSM ’98, pages 368–377, 1998.

Digital Library

[7]

J. F. Bowring, J. M. Rehg, and M. J. Harrold. Active learning for automatic classification of software behavior. In Proceedings of the 2004 ACM SIGSOFT International Symposium on Software Testing and Analysis, ISSTA ’04, pages 195–205, 2004.

Digital Library

[8]

S. Brin and L. Page. The anatomy of a large-scale hypertextual web search engine. In Proceedings of the Seventh International Conference on World Wide Web 7, WWW7, pages 107–117, 1998.

Digital Library

[9]

G. Canfora, L. Cerulo, and M. D. Penta. Tracking your changes: A language-independent approach. IEEE Software, 26(1):50–57, 2009.

Digital Library

[10]

W. W. Cohen, P. Ravikumar, and S. E. Fienberg. A comparison of string distance metrics for name-matching tasks. In Proceedings of IJCAI-03 Workshop on Information Integration, pages 73–78, 2003.

[11]

F. Deissenboeck, L. Heinemann, B. Hummel, and S. Wagner. Challenges of the dynamic detection of functionally similar code fragments. In Software Maintenance and Reengineering (CSMR), 2012 16th European Conference on, pages 299–308, March 2012.

Digital Library

[12]

J. Demme and S. Sethumadhavan. Approximate graph clustering for program characterization. ACM Trans. Archit. Code Optim., 8(4):21:1–21:21, Jan. 2012.

Digital Library

[13]

N. DiGiuseppe and J. A. Jones. Software behavior and failure clustering: An empirical study of fault causality. In Proceedings of the 2012 IEEE Fifth International Conference on Software Testing, Verification and Validation, ICST ’12, pages 191–200, 2012.

Digital Library

[14]

M. Egele, M. Woo, P. Chapman, and D. Brumley. Blanket execution: Dynamic similarity testing for program binaries and components. In 23rd USENIX Security Symposium (USENIX Security 14), pages 303–317, 2014.

Digital Library

[15]

R. Elva and G. T. Leavens. Semantic clone detection using method ioe-behavior. In Proceedings of the 6th International Workshop on Software Clones, IWSC ’12, pages 80–81, 2012.

Digital Library

[16]

G. Fraser and A. Arcuri. Evosuite: Automatic test suite generation for object-oriented software. In Proceedings of the 19th ACM SIGSOFT Symposium and the 13th European Conference on Foundations of Software Engineering, ESEC/FSE ’11, pages 416–419, New York, NY, USA, 2011. ACM.

Digital Library

[17]

M. Gabel, L. Jiang, and Z. Su. Scalable detection of semantic clones. In Proceedings of the 30th International Conference on Software Engineering, ICSE ’08, pages 321–330, 2008.

Digital Library

[18]

Google code jam. https://code.google.com/codejam.

[19]

C. Hammer and G. Snelting. An improved slicer for java. In Proceedings of the 5th ACM SIGPLAN-SIGSOFT Workshop on Program Analysis for Software Tools and Engineering, PASTE ’04, pages 17–22, New York, NY, USA, 2004. ACM.

Digital Library

[20]

Oracle jdk 7. http://www.oracle.com/technetwork/ java/javase/downloads/jdk7-downloads-1880260.html.

[21]

L. Jiang, G. Misherghi, Z. Su, and S. Glondu. Deckard: Scalable and accurate tree-based detection of code clones. In Proceedings of the 29th International Conference on Software Engineering, ICSE ’07, pages 96–105, 2007.

Digital Library

[22]

L. Jiang and Z. Su. Automatic mining of functionally equivalent code fragments via random testing. In Proceedings of the Eighteenth International Symposium on Software Testing and Analysis, ISSTA ’09, pages 81–92, 2009.

Digital Library

[23]

E. Juergens, F. Deissenboeck, and B. Hummel. Code similarities beyond copy & paste. In Proceedings of the 2010 14th European Conference on Software Maintenance and Reengineering, CSMR ’10, pages 78–87, Washington, DC, USA, 2010. IEEE Computer Society.

Digital Library

[24]

Java virutal machine speicification. http://docs.oracle.com/javase/specs/jvms/se7/html/. Accessed: 2015-02-04.

[25]

T. Kamiya, S. Kusumoto, and K. Inoue. Ccfinder: A multilinguistic token-based code clone detection system for large scale source code. IEEE Trans. Softw. Eng., 28(7):654–670, July 2002.

Digital Library

[26]

H. Kim, Y. Jung, S. Kim, and K. Yi. Mecc: Memory comparison-based clone detector. ICSE ’11.

Digital Library

[27]

R. Komondoor and S. Horwitz. Using slicing to identify duplication in source code. In Proceedings of the 8th International Symposium on Static Analysis, SAS ’01, pages 40–56, 2001.

Digital Library

[28]

R. Koschke, R. Falke, and P. Frenzel. Clone detection using abstract syntax suffix trees. In Proceedings of the 13th Working Conference on Reverse Engineering, WCRE ’06, pages 253–262, 2006.

Digital Library

[29]

S. Kpodjedo, F. Ricca, P. Galinier, G. Antoniol, and Y.-G. Gueheneuc. Madmatch: Many-to-many approximate diagram matching for design comparison. IEEE Transactions on Software Engineering, 39(8):1090–1111, 2013.

Digital Library

[30]

J. Krinke. Identifying similar code with program dependence graphs. In Proceedings of the 8th Working Conference on Reverse Engineering, pages 301–309, 2001.

Digital Library

[31]

D. E. Krutz and E. Shihab. Cccd: Concolic code clone detection. In Reverse Engineering (WCRE), 2013 20th Working Conference on, pages 489–490, Oct 2013.

[32]

A. Kuhn, S. Ducasse, and T. G´ırba. Semantic clustering: Identifying topics in source code. Inf. Softw. Technol., 49(3):230–243, Mar. 2007.

Digital Library

[33]

P. Lawrence, B. Sergey, R. Motwani, and T. Winograd. The pagerank citation ranking: Bringing order to the web. Technical report, Stanford University, 1998.

[34]

S. Li, X. Xiao, B. Bassett, T. Xie, and N. Tillmann. Measuring code behavioral similarity for programming and software engineering education. In Proceedings of the 38th International Conference on Software Engineering Companion, ICSE ’16, pages 501–510, 2016.

Digital Library

[35]

Z. Li, S. Lu, S. Myagmar, and Y. Zhou. Cp-miner: A tool for finding copy-paste and related bugs in operating system code. In Proceedings of the 6th Conference on Symposium on Opearting Systems Design & Implementation - Volume 6, OSDI’04, pages 176–192, 2004.

Digital Library

[36]

M. Linares-Vásquez, C. Mcmillan, D. Poshyvanyk, and M. Grechanik. On using machine learning to automatically classify software applications into domain categories. Empirical Softw. Engg., 19(3):582–618, June 2014.

Digital Library

[37]

C. Liu, C. Chen, J. Han, and P. S. Yu. Gplag: Detection of software plagiarism by program dependence graph analysis. In Proceedings of the 12th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, KDD ’06, pages 872–881, 2006.

Digital Library

[38]

J. I. Maletic and N. Valluri. Automatic software clustering via latent semantic analysis. In Proceedings of the 14th IEEE International Conference on Automated Software Engineering, ASE ’99, pages 251–, 1999.

Digital Library

[39]

A. Marcus and J. I. Maletic. Identification of high-level concept clones in source code. In Proceedings of the 16th IEEE International Conference on Automated Software Engineering, ASE ’01, pages 107–114, 2001.

Digital Library

[40]

Apache maven. https://maven.apache.org.

[41]

C. McMillan, M. Grechanik, and D. Poshyvanyk. Detecting similar software applications. In Proceedings of the 34th International Conference on Software Engineering, ICSE ’12, pages 364–374, 2012.

Digital Library

[42]

Mysql database. https://www.mysql.com.

[43]

C. Pacheco, S. K. Lahiri, M. D. Ernst, and T. Ball. Feedback-directed random test generation. In Proceedings of the 29th International Conference on Software Engineering, ICSE ’07, pages 75–84, Washington, DC, USA, 2007. IEEE Computer Society.

Digital Library

[44]

K. Riesen, X. Jiang, and H. Bunke. Exact and inexact graph matching: Methodology and applications. In Managing and Mining Graph Data, volume 40 of Advances in Database Systems, pages 217–247. Springer, 2010.

[45]

C. K. Roy, J. R. Cordy, and R. Koschke. Comparison and evaluation of code clone detection techniques and tools: A qualitative approach. Sci. Comput. Program., 74(7):470–495, May 2009.

Digital Library

[46]

H. Sajnani, V. Saini, J. Svajlenko, C. Roy, and C. Lopes. SourcererCC: Scaling Code Clone Detection to Big Code. ICSE ’16.

Digital Library

[47]

D. Schuler, V. Dallmeier, and C. Lindig. A dynamic birthmark for java. In Proceedings of the Twenty-second IEEE/ACM International Conference on Automated Software Engineering, ASE ’07, pages 274–283, New York, NY, USA, 2007. ACM.

Digital Library

[48]

F.-H. Su, J. Bell, G. Kaiser, and S. Sethumadhavan. Identifying functionally similar code in complex codebases. In Proceedings of the 24th IEEE International Conference on Program Comprehension, ICPC 2016, 2016.

[49]

J. Svajlenko, J. F. Islam, I. Keivanloo, C. K. Roy, and M. M. Mia. Towards a Big Data Curated Benchmark of Inter-project Code Clones. ICSME ’14.

Digital Library

[50]

H. Tamada, M. Nakamura, and A. Monden. Design and evaluation of birthmarks for detecting theft of java programs. In Proc. IASTED International Conference on Software Engineering, pages 569–575, 2004.

[51]

W. Yang, X. Xiao, B. Andow, S. Li, T. Xie, and W. Enck. Appcontext: Differentiating malicious and benign mobile app behaviors using context. In Proceedings of the 37th International Conference on Software Engineering, ICSE ’15, pages 303–313, 2015.

Digital Library

Cited By

Saieva AChakraborty SKaiser G(2024)Reinforest: Reinforcing Semantic Code Similarity for Cross-Lingual Code Search Models2024 IEEE International Conference on Source Code Analysis and Manipulation (SCAM)10.1109/SCAM63643.2024.00026(177-188)Online publication date: 7-Oct-2024
https://doi.org/10.1109/SCAM63643.2024.00026
Kim KGhatpande SKim DZhou XLiu KBissyandé TKlein JLe Traon Y(2023)Big Code Search: A BibliographyACM Computing Surveys10.1145/360490556:1(1-49)Online publication date: 26-Aug-2023
https://dl.acm.org/doi/10.1145/3604905
Martínez PWoodruff JArmengol-Estapé JBernabé GGarcía JO’Boyle MVerbrugge CLhoták OShen X(2023)Matching Linear Algebra and Tensor Code to Specialized Hardware AcceleratorsProceedings of the 32nd ACM SIGPLAN International Conference on Compiler Construction10.1145/3578360.3580262(85-97)Online publication date: 17-Feb-2023
https://dl.acm.org/doi/10.1145/3578360.3580262
Show More Cited By

Index Terms

Code relatives: detecting similarly behaving software
1. Information systems
  1. Information systems applications
    1. Data mining
      1. Nearest-neighbor search
2. Software and its engineering
  1. Software notations and tools
    1. General programming languages
      1. Language types
        Assembly languages
    2. Software maintenance tools

Recommendations

An empirical study on bug propagation through code cloning
Highlights
- Around 18.42% of the buggy code clones are involved with bug propagation.
- Near-...
Abstract
Code clones are identical or nearly similar code fragments in a code-base. According to the existing studies, code clones are directly related to bugs. Code cloning, creating code clones, is suspected to propagate temporarily hidden ...
Predicting Buggy Code Clones through Machine Learning
CASCON '22: Proceedings of the 32nd Annual International Conference on Computer Science and Software Engineering
Code clones (similar code fragments in a code-base} often have negative impacts on the maintenance and evolution of software systems. According to the existing studies, code clones may contain bugs or inconsistencies that can cause an increased ...
A Comparative Study of Bug Patterns in Java Cloned and Non-cloned Code
SCAM '14: Proceedings of the 2014 IEEE 14th International Working Conference on Source Code Analysis and Manipulation

Code cloning via copy-and-paste is a common practice in software engineering. Traditionally, this practice has been considered harmful, and a symptom that some important design abstraction is being ignored. As such, many previous studies suggest ...

Comments

Information & Contributors

Information

Published In

cover image ACM Conferences

FSE 2016: Proceedings of the 2016 24th ACM SIGSOFT International Symposium on Foundations of Software Engineering

November 2016

1156 pages

ISBN:9781450342186

DOI:10.1145/2950290

General Chair:
Thomas Zimmermann
Microsoft Research, USA
,
Program Chairs:
Jane Cleland-Huang
University of Notre Dame, USA
,
Zhendong Su
University of California at Davis, USA

Copyright © 2016 ACM.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Sponsors

SIGSOFT: ACM Special Interest Group on Software Engineering

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 01 November 2016

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Author Tags

Qualifiers

Research-article

Funding Sources

National Science Foundation

Conference

FSE'16

Sponsor:

SIGSOFT

FSE'16: 24nd ACM SIGSOFT International Symposium on the Foundations of Software Engineering

November 13 - 18, 2016

WA, Seattle, USA

Acceptance Rates

Overall Acceptance Rate 17 of 128 submissions, 13%

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

41
Total Citations
View Citations
1,222
Total Downloads

Downloads (Last 12 months)134
Downloads (Last 6 weeks)25

Reflects downloads up to 15 Jan 2025

Other Metrics

View Author Metrics

Citations

Cited By

Saieva AChakraborty SKaiser G(2024)Reinforest: Reinforcing Semantic Code Similarity for Cross-Lingual Code Search Models2024 IEEE International Conference on Source Code Analysis and Manipulation (SCAM)10.1109/SCAM63643.2024.00026(177-188)Online publication date: 7-Oct-2024
https://doi.org/10.1109/SCAM63643.2024.00026
Kim KGhatpande SKim DZhou XLiu KBissyandé TKlein JLe Traon Y(2023)Big Code Search: A BibliographyACM Computing Surveys10.1145/360490556:1(1-49)Online publication date: 26-Aug-2023
https://dl.acm.org/doi/10.1145/3604905
Martínez PWoodruff JArmengol-Estapé JBernabé GGarcía JO’Boyle MVerbrugge CLhoták OShen X(2023)Matching Linear Algebra and Tensor Code to Specialized Hardware AcceleratorsProceedings of the 32nd ACM SIGPLAN International Conference on Compiler Construction10.1145/3578360.3580262(85-97)Online publication date: 17-Feb-2023
https://dl.acm.org/doi/10.1145/3578360.3580262
Zeng CYu YLi SXia XWang ZGeng MBai LDong WLiao X(2023)deGraphCS: Embedding Variable-based Flow Graph for Neural Code SearchACM Transactions on Software Engineering and Methodology10.1145/354606632:2(1-27)Online publication date: 30-Mar-2023
https://dl.acm.org/doi/10.1145/3546066
Glani YPing LLin KShah S(2023)AyatDroid: A Lightweight Code Cloning Technique Using Different Static Features2023 IEEE 3rd International Conference on Software Engineering and Artificial Intelligence (SEAI)10.1109/SEAI59139.2023.10217577(17-21)Online publication date: 16-Jun-2023
https://doi.org/10.1109/SEAI59139.2023.10217577
Roy PAlam AAl-omari FRoy BRoy CSchneider K(2023)Unveiling the Potential of Large Language Models in Generating Semantic and Cross-Language Clones2023 IEEE 17th International Workshop on Software Clones (IWSC)10.1109/IWSC60764.2023.00011(22-28)Online publication date: 1-Oct-2023
https://doi.org/10.1109/IWSC60764.2023.00011
Alam ARoy PAl-Omari FRoy CRoy BSchneider K(2023)GPTCloneBench: A comprehensive benchmark of semantic clones and cross-language clones using GPT-3 model and SemanticCloneBench2023 IEEE International Conference on Software Maintenance and Evolution (ICSME)10.1109/ICSME58846.2023.00013(1-13)Online publication date: 1-Oct-2023
https://doi.org/10.1109/ICSME58846.2023.00013
Liu JZeng JWang XLiang Z(2023)Learning Graph-based Code Representations for Source-level Functional Similarity Detection2023 IEEE/ACM 45th International Conference on Software Engineering (ICSE)10.1109/ICSE48619.2023.00040(345-357)Online publication date: May-2023
https://doi.org/10.1109/ICSE48619.2023.00040
Zakeri-Nasrabadi MParsa SRamezani MRoy CEkhtiarzadeh M(2023)A systematic literature review on source code similarity measurement and clone detectionJournal of Systems and Software10.1016/j.jss.2023.111796204:COnline publication date: 20-Sep-2023
https://dl.acm.org/doi/10.1016/j.jss.2023.111796
Sun WFang CChen YTao GHan TZhang QDwyer MDamian DZeller A(2022)Code search based on context-aware code translationProceedings of the 44th International Conference on Software Engineering10.1145/3510003.3510140(388-400)Online publication date: 21-May-2022
https://dl.acm.org/doi/10.1145/3510003.3510140
Show More Cited By

View Options

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

Media

Figures

Other

Tables

View Table of Contents