skip to main content
10.1145/3428029.3428065acmotherconferencesArticle/Chapter ViewAbstractPublication Pageskoli-callingConference Proceedingsconference-collections
research-article

Preprocessing for Source Code Similarity Detection in Introductory Programming

Published: 22 November 2020 Publication History

Abstract

It is well documented that some students either work together on programming assessments when required to work individually (collusion) or make unauthorised use of existing code from external sources (plagiarism). One approach used in the detection of these violations of academic integrity is source code similarity detection, the automatic checking of student programs for unduly high levels of similarity. Preprocessing of source code files has the potential to increase the effectiveness, the efficiency, or both, of the source code comparison process. There are many possible steps in the preprocessing, and examination of the literature suggests that these steps are selected and implemented without any empirical evidence as to their value. This paper lists 19 preprocessing steps that have been used in code similarity detection, and assesses the effectiveness and the efficiency of 16 of these steps on data sets of student programs from introductory programming courses. The results should help researchers to decide what preprocessing steps to include when designing source code similarity detection techniques or software. According to the study, identifier removal increases both effectiveness and efficiency. Token renaming and syntax tree linearisation increase effectiveness at a cost of efficiency. Other preprocessing steps are dependent upon characteristics of the data set and should ideally be empirically tested before being applied. The paper should also help alert programming educators to the sorts of disguise that students can apply to copied programs.

References

[1]
Giovanni Acampora and Georgina Cosma. 2015. A fuzzy-based approach to programming language independent source-code plagiarism detection. In International Conference on Fuzzy Systems. IEEE, 1–8. https://doi.org/10.1109/FUZZ-IEEE.2015.7337935
[2]
Alireza Ahadi and Luke Mathieson. 2019. A comparison of three popular source code similarity tools for detecting student plagiarism. In 21st Australasian Computing Education Conference(ACE 2019). 112–117. https://doi.org/10.1145/3286960.3286974
[3]
Ken Arnold, James Gosling, and David Holmes. 2005. The Java programming language. Addison Wesley Professional.
[4]
Andrés M Bejarano, Lucy E García, and Eduardo E Zurek. 2015. Detection of source code similitude in academic environments. Computer Applications in Engineering Education 23, 1 (Jan 2015), 13–22. https://doi.org/10.1002/cae.21571
[5]
Georgina Cosma and Mike Joy. 2012. Evaluating the performance of LSA for source code plagiarism detection. Informatica 36, 4 (2012), 409–424.
[6]
W Bruce Croft, Donald Metzler, and Trevor Strohman. 2010. Search Engines: Information Retrieval in Practice. Addison-Wesley.
[7]
Zoran Đurić and Dragan Gašević. 2013. A source code similarity system for plagiarism detection. Computer Journal 56, 1 (Jan 2013), 70–86. https://doi.org/10.1093/comjnl/bxs018
[8]
Christian Domin, Henning Pohl, and Markus Krause. 2016. Improving plagiarism detection in coding assignments by dynamic removal of common ground. In CHI Conference Extended Abstracts on Human Factors in Computing Systems. 1173–1179. https://doi.org/10.1145/2851581.2892512
[9]
JAW Faidhi and SK Robinson. 1987. An empirical approach for detecting program similarity and plagiarism within a university programming environment. Computers & Education 11, 1 (1987), 11–19. https://doi.org/10.1016/0360-1315(87)90042-X
[10]
Enrique Flores, Paolo Rosso, Lidia Moreno, and Esaú Villatoro-Tello. 2014. On the detection of Source Code re-use. In Forum for Information Retrieval Evaluation. 21–30. https://doi.org/10.1145/2824864.2824878
[11]
David Gitchell and Nicholas Tran. 1999. Sim: a utility for detecting similarity in computer programs. In 30th SIGCSE Technical Symposium on Computer Science Education(SIGCSE 1999). 266–270. https://doi.org/10.1145/299649.299783
[12]
Jurriaan Hage, Peter Rademaker, and Niké van Vugt. 2011. Plagiarism detection for Java: a tool comparison. In 2011 Computer Science Education Research Conference. 33–46.
[13]
Daniël Heres and Jurriaan Hage. 2017. A quantitative comparison of program plagiarism detection tools. In Sixth Computer Science Education Research Conference(CSERC 2017). 73–82. https://doi.org/10.1145/3162087.3162101
[14]
Yoon-Chan Jhi, Xiaoqi Jia, Xinran Wang, Sencun Zhu, Peng Liu, and Dinghao Wu. 2015. Program characterization using runtime values and its application to software plagiarism detection. IEEE Transactions on Software Engineering 41, 9 (Sep 2015), 925–943. https://doi.org/10.1109/TSE.2015.2418777
[15]
Oscar Karnalim. 2016. Detecting source code plagiarism on introductory programming course assignments using a bytecode approach. In 10th International Conference on Information & Communication Technology and Systems. IEEE, 63–68. https://doi.org/10.1109/ICTS.2016.7910274
[16]
Oscar Karnalim and Setia Budi. 2018. The effectiveness of low-level structure-based approach toward source code plagiarism level taxonomy. In Sixth International Conference on Information and Communication Technology. IEEE, 130–134. https://doi.org/10.1109/ICoICT.2018.8528768
[17]
Oscar Karnalim, Setia Budi, Hapnes Toba, and Mike Joy. 2019. Source code plagiarism detection in academia with information retrieval: dataset and the observation. Informatics in Education 18, 2 (Nov 2019), 321–344. https://doi.org/10.15388/infedu.2019.15
[18]
Oscar Karnalim, Simon, and William Chivers. 2019. Similarity detection techniques for academic source code plagiarism and collusion: a review. In IEEE International Conference on Engineering, Technology and Education. https://doi.org/10.1109/TALE48000.2019.9225953
[19]
Hiroshi Kikuchi, Takaaki Goto, Mitsuo Wakatsuki, and Tetsuro Nishino. 2014. A source code plagiarism detecting method using alignment with abstract syntax tree elements. In 15th IEEE/ACIS International Conference on Software Engineering, Artificial Intelligence, Networking and Parallel/Distributed Computing. IEEE, 1–6. https://doi.org/10.1109/SNPD.2014.6888733
[20]
Phatludi Modiba, Vreda Pieterse, and Bertram Haskins. 2016. Evaluating plagiarism detection software for introductory programming assignments. In Computer Science Education Research Conference(CSERC 2016). 37–46. https://doi.org/10.1145/2998551.2998558
[21]
Lefteris Moussiades and Athena Vakali. 2005. PDetect: a clustering approach for detecting plagiarism in source code datasets. Computer Journal 48, 6 (Nov 2005), 651–661. https://doi.org/10.1093/comjnl/bxh119
[22]
Matija Novak, Mike Joy, and Dragutin Kermek. 2019. Source-code similarity detection and detection tools used in academia: a systematic review. ACM Transactions on Computing Education 19, 3, Article 27 (May 2019), 37 pages. https://doi.org/10.1145/3313290
[23]
Terence Parr. 2013. The Definitive ANTLR 4 Reference. Pragmatic Bookshelf.
[24]
Lutz Prechelt, Guido Malpohl, and Michael Philippsen. 2002. Finding plagiarisms among a set of programs with JPlag. Journal of Universal Computer Science 8, 11 (2002), 1016–1038.
[25]
Chaiyong Ragkhitwetsagul, Jens Krinke, and David Clark. 2018. A comparison of code similarity analysers. Empirical Software Engineering 23, 4 (2018), 2464–2519. https://doi.org/10.1007/s10664-017-9564-7
[26]
G Sidorov, M Ibarra Romero, I Markov, R Guzman-Cabrera, L Chanona-Hernández, and F Velásquez. 2017. Measuring similarity between Karel programs using character and word n-grams. Programming and Computer Software 43, 1 (Jan 2017), 47–50. https://doi.org/10.1134/S0361768817010066
[27]
Simon, Beth Cook, Judy Sheard, Angela Carbone, and Chris Johnson. 2013. Academic integrity: differences between computing assessments and essays. In 13th Koli Calling International Conference on Computing Education Research. 23–32. https://doi.org/10.1145/2526968.2526971
[28]
Simon, Raina Mason, Tom Crick, James H. Davenport, and Ellen Murphy. 2018. Language choice in introductory programming courses at Australasian and UK universities. In 49th ACM Technical Symposium on Computer Science Education. ACM Press, Baltimore, 852–857. https://doi.org/10.1145/3159450.3159547
[29]
Lisan Sulistiani and Oscar Karnalim. 2019. ES-Plag: efficient and sensitive source code plagiarism detection tool for academic environment. Computer Applications in Engineering Education 27, 1 (2019), 166–182. https://doi.org/10.1002/cae.22066
[30]
Zhenzhou Tian, Qinghua Zheng, Ting Liu, Ming Fan, Eryue Zhuang, and Zijiang Yang. 2015. Software plagiarism detection with birthmarks based on dynamic key instruction sequences. IEEE Transactions on Software Engineering 41, 12 (Dec 2015), 1217–1235. https://doi.org/10.1109/TSE.2015.2454508
[31]
Farhan Ullah, Junfeng Wang, Muhammad Farhan, Sohail Jabbar, Zhiming Wu, and Shehzad Khalid. 2018. Plagiarism detection in students’ programming assignments based on semantics: multimedia e-learning based smart assessment methodology. Multimedia Tools and Applications (Mar 2018). https://doi.org/10.1007/s11042-018-5827-6
[32]
Guido Van Rossum and Fred L Drake. 2009. Python 3 Reference Manual. CreateSpace, Scotts Valley, CA.
[33]
Kristina L Verco and Michael J Wise. 1996. Plagiarism à la mode: a comparison of automated systems for detecting suspected plagiarism. Computer Journal 39, 9 (Sep 1996), 741–750. https://doi.org/10.1093/comjnl/39.9.741
[34]
Kristina L Verco and Michael J Wise. 1996. Software for detecting suspected plagiarism: comparing structure and attribute-counting systems. In First Australasian Computer Science Education Conference(ACE 1996). 81–88. https://doi.org/10.1145/369585.369598
[35]
G Whale. 1990. Identification of program similarity in large populations. Computer Journal 33, 2 (1990), 140–146. https://doi.org/10.1093/comjnl/33.2.140
[36]
Michael J Wise. 1996. YAP3: improved detection of similarities in computer program and other texts. In 27th SIGCSE Technical Symposium on Computer Science Education(SIGCSE 1996). 130–134. https://doi.org/10.1145/236452.236525

Cited By

View all

Recommendations

Comments

Information & Contributors

Information

Published In

cover image ACM Other conferences
Koli Calling '20: Proceedings of the 20th Koli Calling International Conference on Computing Education Research
November 2020
295 pages
ISBN:9781450389211
DOI:10.1145/3428029
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 22 November 2020

Permissions

Request permissions for this article.

Check for updates

Author Tags

  1. collusion
  2. computing education
  3. plagiarism
  4. programming
  5. source code similarity detection

Qualifiers

  • Research-article
  • Research
  • Refereed limited

Conference

Koli Calling '20

Acceptance Rates

Overall Acceptance Rate 80 of 182 submissions, 44%

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)10
  • Downloads (Last 6 weeks)0
Reflects downloads up to 24 Jan 2025

Other Metrics

Citations

Cited By

View all

View Options

Login options

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

HTML Format

View this article in HTML Format.

HTML Format

Figures

Tables

Media

Share

Share

Share this Publication link

Share on social media