skip to main content
10.1145/2950290.2950321acmconferencesArticle/Chapter ViewAbstractPublication PagesfseConference Proceedingsconference-collections
research-article
Public Access

Code relatives: detecting similarly behaving software

Published: 01 November 2016 Publication History

Abstract

Detecting “similar code” is useful for many software engineering tasks. Current tools can help detect code with statically similar syntactic and–or semantic features (code clones) and with dynamically similar functional input/output (simions). Unfortunately, some code fragments that behave similarly at the finer granularity of their execution traces may be ignored. In this paper, we propose the term “code relatives” to refer to code with similar execution behavior. We define code relatives and then present DyCLINK, our approach to detecting code relatives within and across codebases. DyCLINK records instruction-level traces from sample executions, organizes the traces into instruction-level dynamic dependence graphs, and employs our specialized subgraph matching algorithm to efficiently compare the executions of candidate code relatives. In our experiments, DyCLINK analyzed 422+ million prospective subgraph matches in only 43 minutes. We compared DyCLINK to one static code clone detector from the community and to our implementation of a dynamic simion detector. The results show that DyCLINK effectively detects code relatives with a reasonable analysis time.

References

[1]
Dyclink github page. https: //github.com/Programming-Systems-Lab/dyclink.
[2]
Asm framework. http://asm.ow2.org/index.html.
[3]
V. Avdiienko, K. Kuznetsov, A. Gorla, A. Zeller, S. Arzt, S. Rasthofer, and E. Bodden. Mining apps for abnormal usage of sensitive data. In 2015 International Conference on Software Engineering (ICSE), ICSE ’15, pages 426–436, 2015.
[4]
B. S. Baker. A program for identifying duplicated code. In Computer Science and Statistics: Proc. Symp. on the Interface, pages 49–57, 1992.
[5]
V. Bauer, T. Völke, and E. Jürgens. A novel approach to detect unintentional re-implementations. In Proceedings of the 2014 IEEE International Conference on Software Maintenance and Evolution, ICSME ’14, pages 491–495, Washington, DC, USA, 2014. IEEE Computer Society.
[6]
I. D. Baxter, A. Yahin, L. Moura, M. Sant’Anna, and L. Bier. Clone detection using abstract syntax trees. In Proceedings of the International Conference on Software Maintenance, ICSM ’98, pages 368–377, 1998.
[7]
J. F. Bowring, J. M. Rehg, and M. J. Harrold. Active learning for automatic classification of software behavior. In Proceedings of the 2004 ACM SIGSOFT International Symposium on Software Testing and Analysis, ISSTA ’04, pages 195–205, 2004.
[8]
S. Brin and L. Page. The anatomy of a large-scale hypertextual web search engine. In Proceedings of the Seventh International Conference on World Wide Web 7, WWW7, pages 107–117, 1998.
[9]
G. Canfora, L. Cerulo, and M. D. Penta. Tracking your changes: A language-independent approach. IEEE Software, 26(1):50–57, 2009.
[10]
W. W. Cohen, P. Ravikumar, and S. E. Fienberg. A comparison of string distance metrics for name-matching tasks. In Proceedings of IJCAI-03 Workshop on Information Integration, pages 73–78, 2003.
[11]
F. Deissenboeck, L. Heinemann, B. Hummel, and S. Wagner. Challenges of the dynamic detection of functionally similar code fragments. In Software Maintenance and Reengineering (CSMR), 2012 16th European Conference on, pages 299–308, March 2012.
[12]
J. Demme and S. Sethumadhavan. Approximate graph clustering for program characterization. ACM Trans. Archit. Code Optim., 8(4):21:1–21:21, Jan. 2012.
[13]
N. DiGiuseppe and J. A. Jones. Software behavior and failure clustering: An empirical study of fault causality. In Proceedings of the 2012 IEEE Fifth International Conference on Software Testing, Verification and Validation, ICST ’12, pages 191–200, 2012.
[14]
M. Egele, M. Woo, P. Chapman, and D. Brumley. Blanket execution: Dynamic similarity testing for program binaries and components. In 23rd USENIX Security Symposium (USENIX Security 14), pages 303–317, 2014.
[15]
R. Elva and G. T. Leavens. Semantic clone detection using method ioe-behavior. In Proceedings of the 6th International Workshop on Software Clones, IWSC ’12, pages 80–81, 2012.
[16]
G. Fraser and A. Arcuri. Evosuite: Automatic test suite generation for object-oriented software. In Proceedings of the 19th ACM SIGSOFT Symposium and the 13th European Conference on Foundations of Software Engineering, ESEC/FSE ’11, pages 416–419, New York, NY, USA, 2011. ACM.
[17]
M. Gabel, L. Jiang, and Z. Su. Scalable detection of semantic clones. In Proceedings of the 30th International Conference on Software Engineering, ICSE ’08, pages 321–330, 2008.
[18]
Google code jam. https://code.google.com/codejam.
[19]
C. Hammer and G. Snelting. An improved slicer for java. In Proceedings of the 5th ACM SIGPLAN-SIGSOFT Workshop on Program Analysis for Software Tools and Engineering, PASTE ’04, pages 17–22, New York, NY, USA, 2004. ACM.
[20]
Oracle jdk 7. http://www.oracle.com/technetwork/ java/javase/downloads/jdk7-downloads-1880260.html.
[21]
L. Jiang, G. Misherghi, Z. Su, and S. Glondu. Deckard: Scalable and accurate tree-based detection of code clones. In Proceedings of the 29th International Conference on Software Engineering, ICSE ’07, pages 96–105, 2007.
[22]
L. Jiang and Z. Su. Automatic mining of functionally equivalent code fragments via random testing. In Proceedings of the Eighteenth International Symposium on Software Testing and Analysis, ISSTA ’09, pages 81–92, 2009.
[23]
E. Juergens, F. Deissenboeck, and B. Hummel. Code similarities beyond copy & paste. In Proceedings of the 2010 14th European Conference on Software Maintenance and Reengineering, CSMR ’10, pages 78–87, Washington, DC, USA, 2010. IEEE Computer Society.
[24]
Java virutal machine speicification. http://docs.oracle.com/javase/specs/jvms/se7/html/. Accessed: 2015-02-04.
[25]
T. Kamiya, S. Kusumoto, and K. Inoue. Ccfinder: A multilinguistic token-based code clone detection system for large scale source code. IEEE Trans. Softw. Eng., 28(7):654–670, July 2002.
[26]
H. Kim, Y. Jung, S. Kim, and K. Yi. Mecc: Memory comparison-based clone detector. ICSE ’11.
[27]
R. Komondoor and S. Horwitz. Using slicing to identify duplication in source code. In Proceedings of the 8th International Symposium on Static Analysis, SAS ’01, pages 40–56, 2001.
[28]
R. Koschke, R. Falke, and P. Frenzel. Clone detection using abstract syntax suffix trees. In Proceedings of the 13th Working Conference on Reverse Engineering, WCRE ’06, pages 253–262, 2006.
[29]
S. Kpodjedo, F. Ricca, P. Galinier, G. Antoniol, and Y.-G. Gueheneuc. Madmatch: Many-to-many approximate diagram matching for design comparison. IEEE Transactions on Software Engineering, 39(8):1090–1111, 2013.
[30]
J. Krinke. Identifying similar code with program dependence graphs. In Proceedings of the 8th Working Conference on Reverse Engineering, pages 301–309, 2001.
[31]
D. E. Krutz and E. Shihab. Cccd: Concolic code clone detection. In Reverse Engineering (WCRE), 2013 20th Working Conference on, pages 489–490, Oct 2013.
[32]
A. Kuhn, S. Ducasse, and T. G´ırba. Semantic clustering: Identifying topics in source code. Inf. Softw. Technol., 49(3):230–243, Mar. 2007.
[33]
P. Lawrence, B. Sergey, R. Motwani, and T. Winograd. The pagerank citation ranking: Bringing order to the web. Technical report, Stanford University, 1998.
[34]
S. Li, X. Xiao, B. Bassett, T. Xie, and N. Tillmann. Measuring code behavioral similarity for programming and software engineering education. In Proceedings of the 38th International Conference on Software Engineering Companion, ICSE ’16, pages 501–510, 2016.
[35]
Z. Li, S. Lu, S. Myagmar, and Y. Zhou. Cp-miner: A tool for finding copy-paste and related bugs in operating system code. In Proceedings of the 6th Conference on Symposium on Opearting Systems Design & Implementation - Volume 6, OSDI’04, pages 176–192, 2004.
[36]
M. Linares-Vásquez, C. Mcmillan, D. Poshyvanyk, and M. Grechanik. On using machine learning to automatically classify software applications into domain categories. Empirical Softw. Engg., 19(3):582–618, June 2014.
[37]
C. Liu, C. Chen, J. Han, and P. S. Yu. Gplag: Detection of software plagiarism by program dependence graph analysis. In Proceedings of the 12th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, KDD ’06, pages 872–881, 2006.
[38]
J. I. Maletic and N. Valluri. Automatic software clustering via latent semantic analysis. In Proceedings of the 14th IEEE International Conference on Automated Software Engineering, ASE ’99, pages 251–, 1999.
[39]
A. Marcus and J. I. Maletic. Identification of high-level concept clones in source code. In Proceedings of the 16th IEEE International Conference on Automated Software Engineering, ASE ’01, pages 107–114, 2001.
[40]
Apache maven. https://maven.apache.org.
[41]
C. McMillan, M. Grechanik, and D. Poshyvanyk. Detecting similar software applications. In Proceedings of the 34th International Conference on Software Engineering, ICSE ’12, pages 364–374, 2012.
[42]
Mysql database. https://www.mysql.com.
[43]
C. Pacheco, S. K. Lahiri, M. D. Ernst, and T. Ball. Feedback-directed random test generation. In Proceedings of the 29th International Conference on Software Engineering, ICSE ’07, pages 75–84, Washington, DC, USA, 2007. IEEE Computer Society.
[44]
K. Riesen, X. Jiang, and H. Bunke. Exact and inexact graph matching: Methodology and applications. In Managing and Mining Graph Data, volume 40 of Advances in Database Systems, pages 217–247. Springer, 2010.
[45]
C. K. Roy, J. R. Cordy, and R. Koschke. Comparison and evaluation of code clone detection techniques and tools: A qualitative approach. Sci. Comput. Program., 74(7):470–495, May 2009.
[46]
H. Sajnani, V. Saini, J. Svajlenko, C. Roy, and C. Lopes. SourcererCC: Scaling Code Clone Detection to Big Code. ICSE ’16.
[47]
D. Schuler, V. Dallmeier, and C. Lindig. A dynamic birthmark for java. In Proceedings of the Twenty-second IEEE/ACM International Conference on Automated Software Engineering, ASE ’07, pages 274–283, New York, NY, USA, 2007. ACM.
[48]
F.-H. Su, J. Bell, G. Kaiser, and S. Sethumadhavan. Identifying functionally similar code in complex codebases. In Proceedings of the 24th IEEE International Conference on Program Comprehension, ICPC 2016, 2016.
[49]
J. Svajlenko, J. F. Islam, I. Keivanloo, C. K. Roy, and M. M. Mia. Towards a Big Data Curated Benchmark of Inter-project Code Clones. ICSME ’14.
[50]
H. Tamada, M. Nakamura, and A. Monden. Design and evaluation of birthmarks for detecting theft of java programs. In Proc. IASTED International Conference on Software Engineering, pages 569–575, 2004.
[51]
W. Yang, X. Xiao, B. Andow, S. Li, T. Xie, and W. Enck. Appcontext: Differentiating malicious and benign mobile app behaviors using context. In Proceedings of the 37th International Conference on Software Engineering, ICSE ’15, pages 303–313, 2015.

Cited By

View all

Recommendations

Comments

Information & Contributors

Information

Published In

cover image ACM Conferences
FSE 2016: Proceedings of the 2016 24th ACM SIGSOFT International Symposium on Foundations of Software Engineering
November 2016
1156 pages
ISBN:9781450342186
DOI:10.1145/2950290
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Sponsors

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 01 November 2016

Permissions

Request permissions for this article.

Check for updates

Author Tags

  1. Code relatives
  2. code clones
  3. link analysis
  4. runtime behavior
  5. subgraph matching

Qualifiers

  • Research-article

Funding Sources

Conference

FSE'16
Sponsor:

Acceptance Rates

Overall Acceptance Rate 17 of 128 submissions, 13%

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)134
  • Downloads (Last 6 weeks)25
Reflects downloads up to 15 Jan 2025

Other Metrics

Citations

Cited By

View all

View Options

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Login options

Media

Figures

Other

Tables

Share

Share

Share this Publication link

Share on social media