skip to main content
research-article

Docode 5

Published: 01 September 2017 Publication History

Abstract

Plagiarism refers to the appropriation of someone elses ideas and expression. Its ubiquity makes it necessary to counter it, and invites the development of commercial systems to do so. In this document we introduce Docode 5, a system for plagiarism detection that can perform analyses on the World Wide Web and on user-defined collections, and can be used as a decision support system. Our contribution in this document is to present this system in all its range of components, from the algorithms used in it to the user interfaces, and the issues with deployment on a commercial scale at an algorithmic and architectural level. We ran performance tests on the plagiarism detection algorithm showing an acceptable performance from an academic and commercial point of view, and load tests on the deployed system, showing that we can benefit from a distributed deployment. With this, we conclude we can adapt algorithms made for small-scale plagiarism detection to a commercial-scale system.

References

[1]
Apache Solr -, URL http://lucene.apache.org/solr/ (accessed: 13.06.17).
[2]
Apache Tika - Apache Tika URL https://tika.apache.org/ (accessed: 13.06.17).
[3]
A. Abdi, N. Idris, R.M. Alguliyev, R.M. Aliguliyev, PDLK: Plagiarism detection using linguistic knowledge, Expert Syst. Appl., 42 (2015) 8936-8946.
[4]
R. Baeza-Yates, B. Ribeiro-Neto, ACM press New York, 1999.
[5]
A.Z. Broder, On the resemblance and containment of documents, in: Compression and Complexity of Sequences 1997. Proceedings, IEEE, 1997, pp. 21-29.
[6]
S. Burrows, S.M. Tahaghoghi, J. Zobel, Efficient plagiarism detection for large code repositories, Softw. Pract. Exp., 37 (2007) 151-176.
[7]
Clough, P., et al. 2003. Old and new challenges in automatic plagiarism detection, in: National Plagiarism Advisory Service, Citeseer, 2003; URL http://ir.shef.ac.uk/cloughie/index.html.
[8]
Custom Search Google Developers URL https://developers.google.com/custom-search/ (accessed: 13.06.17).
[9]
J. Dean, S. Ghemawat, MapReduce: simplified data processing on large clusters, Commun. ACM, 51 (2008) 107-113.
[10]
T. Elsayed, J. Lin, D.W. Oard, Pairwise document similarity in large collections with MapReduce, in: Proceedings of the 46th Annual Meeting of the Association for Computational Linguistics on Human Language Technologies: Short Papers, Association for Computational Linguistics, 2008, pp. 265-268.
[11]
D. Gildea, D. Jurafsky, Automatic labeling of semantic roles, Comput. Linguist., 28 (2002) 245-288.
[12]
Grammarly - plagiarism checker URL https://www.grammarly.com/plagiarism-checker (accessed: 13.06.17).
[13]
B. Gipp, N. Meuschke, C. Breitinger, Citation-based plagiarism detection: Practicability on a large-scale scientific corpus, J. Assoc. Informat. Sci. Technol., 65 (2014) 1527-1540.
[14]
Hagen, M., Potthast, M., Stein, B., Source Retrieval for Plagiarism Detection from Large Web Corpora: Recent Approaches.
[15]
S. Hannabuss, Contested texts: issues of plagiarism, Lib. Manag., 22 (2001) 311-318.
[16]
J. Helfman, Dotplot patterns: a literal look at pattern languages, TAPOS, 2 (1996) 31-41.
[17]
iThenticate URL http://www.ithenticate.com/ (accessed: 13.06.17).
[18]
Jayapal, A., 2012. Similarity overlap metric and greedy string tiling at pan 2012: Plagiarism detection, in: CLEF (Online Working Notes/Labs/Workshop).
[19]
T. Kakkonen, M. Mozgovoy, Hermetic and web plagiarism detection systems for student essaysan evaluation of the state-of-the-art, J. Educ. Comput. Res., 42 (2010) 135-159.
[20]
N. Meuschke, B. Gipp, State-of-the-art in detecting academic plagiarism, Int. J. Educ. Integr., 9 (2013).
[21]
G.A. Miller, WordNet: a lexical database for English, Commun. ACM, 38 (1995) 39-41.
[22]
F. Molina, J.D. Velsquez, S. Ros, P.A. Calfucoy, M. Cocia, El fenmeno del plagio en documentos digitales: un anlisis de la situacin actual en el sistema educacional chileno, Rev. Ing. Sistemas, XXV (2011).
[23]
K. Monostori, A. Zaslavsky, H. Schmidt, Document overlap detection system for distributed digital libraries, in: Proceedings of the Fifth ACM Conference on Digital Libraries, ACM, 2000, pp. 226-227.
[24]
Oberreuter, G., Carrillo-Cisneros, D., Scherson, I.D., Velsquez, J.D., 2014. Submission to the 4th International Competition on Plagiarism Detection, Notebook Papers of PAN CLEF.
[25]
G. Oberreuter, J.D. Velsquez, Text mining applied to plagiarism detection: The use of words for detecting deviations in the writing style, Expert Syst. Appl., 40 (2013) 3756-3763.
[26]
C. Olston, M. Najork, Web crawling, Found. Trends Inf. Retr., 4 (2010) 175-246.
[27]
M. Paul, S. Jamal, An improved SRL based plagiarism detection technique using sentence ranking, Procedia Comput. Sci., 46 (2015) 223-230.
[28]
Potthast, M., Barrn-cedeo, A., Eiselt, A., Stein, B., Rosso, P., 2010a. Overview of the 2nd international competition on plagiarism detection, in: In Proceedings of the SEPLN10 Workshop on Uncovering Plagiarism, Authorship and Social Software Misuse.
[29]
Potthast, M., Gollub, T., Hagen, M., Graegger, J., Kiesel, J., Michel, M., Oberlnder, A., Tippmann, M., Barrn-Cedeo, A., Gupta, P., Rosso, P., Stein, B., 2012. Overview of the 4th International Competition on Plagiarism Detection, in: P. Forner, J. Karlgren, C. Womser-Hacker, (Eds.), Working Notes Papers of the CLEF 2012 Evaluation Labs ISSN: 2038-4963, ISBN: 978-88-904810-3-1 URL http://www.clef-initiative.eu/publication/working-notes.
[30]
Potthast, M., Gollub, T., Hagen, M., Tippmann, M., Kiesel, J., Rosso, P., Stamatatos, E., Stein, B., 2013. Overview of the 5th International Competition on Plagiarism Detection, in: P. Forner, R. Navigli, D. Tufis, (Eds.), Working Notes Papers of the CLEF 2013 Evaluation Labs, ISSN:2038-4963, ISBN: 978-88-904810-3-1 URL http://www.clef-initiative.eu/publication/working-notes.
[31]
Potthast, M., Hagen, M., Beyer, A., Busse, M., Tippmann, M., Rosso, P., Stein, B., 2014. Overview of the 6th International Competition on Plagiarism Detection, in: L. Cappellato, N. Ferro, M. Halvey, W. Kraaij, (Eds.), Working Notes Papers of the CLEF 2014 Evaluation Labs, CEUR Workshop Proceedings, CLEF and CEUR-WS.org. ISSN: 1613-0073 URL http://www.clef-initiative.eu/publication/working-notes.
[32]
Potthast, M., Stein, B., Barrn-Cedeo, A., Rosso, P., 2010b. An evaluation framework for plagiarism detection, in: Proceedings of the 23rd International Conference on Computational Linguistics: Posters, Association for Computational Linguistics, pp. 9971005.
[33]
Potthast, M., Stein, B., Eiselt, A., universitt Weimar, B., Barrn-cedeo, A., Rosso, P., 2009. Overview of the 1st International competition on plagielarism detection, in: SEPLN 2009 Workshop on Uncovering Plagiarism, Authorship, and Social Software Misuse PAN 09, CEUR-WS. Org, pp. 19.
[34]
M. Sahi, V. Gupta, Efficiency comparison of various plagiarism detection techniques, in: International Conference on Electrical, Electronics, and Optimization Techniques (ICEEOT), IEEE, 2016, pp. 2974-2978.
[35]
A. Si, H.V. Leong, R.W. Lau, Check: a document plagiarism detection system, in: Proceedings of the 1997 ACM Symposium on Applied Computing, ACM, 1997, pp. 70-77.
[36]
Stein, B., Barrn Cedeo, L.A., Eiselt, A., Potthast, M., Rosso, P., 2011. Overview of the 3rd International Competition on Plagiarism Detection, in: CEUR Workshop Proceedings, CEUR Workshop Proceedings.
[37]
K. Toutanova, D. Klein, C.D. Manning, Y. Singer, Feature-rich part-of-speech tagging with a cyclic dependency network, in: Proceedings of the 2003 Conference of the North American Chapter of the Association for Computational Linguistics on Human Language Technology-Volume 1, Association for Computational Linguistics, 2003, pp. 173-180.
[38]
Turnitin - technology to improve student writing, URL http://turnitin.com/ (accessed: 13.06.17).
[39]
K. Vani, D. Gupta, Detection of idea plagiarism using syntaxSemantic concept extractions with genetic algorithm, Expert Syst. Appl., 73 (2017) 11-26.
[40]
J.D. Velsquez, Y. Covacevich, F. Molina, E. Marrese-Taylor, C. Rodrguez, F. Bravo-Marquez, DOCODE 3.0 (DOcument COpy DEtector): A system for plagiarism detection by applying an information fusion process from multiple documental data sources, Inf. Fusion, 27 (2016) 64-75.
[41]
VeriGuide - Plagiarism Detection and Document Analysis URL http://veriguide1.cse.cuhk.edu.hk/portal/plagiarism_detection/index.jsp (accessed: 13.06.17).
[42]
C. Wang, J. Wang, X. Lin, W. Wang, H. Wang, H. Li, W. Tian, J. Xu, R. Li, MapDupReducer: detecting near duplicates over massive datasets, in: Proceedings of the 2010 ACM SIGMOD International Conference on Management of Data, ACM, 2010, pp. 1119-1122.
[43]
Welcome to Apache HadoopURL http://hadoop.apache.org/ (accessed: 13.06.17).
[44]
Williams, K., Chen, H.-H., Giles, C., 2014. Supervised Ranking for Plagiarism Source RetrievalNotebook for PAN at CLEF 2014, in: L. Cappellato, N. Ferro, M. Halvey, W. Kraaij, (Eds.), CLEF 2014 Evaluation Labs and Workshop Working Notes Papers, 15-18 September, Sheffield, UK, CEUR-WS.org ISSN: 1613-0073.
[45]
Wu, Y., Zhang, Q., Huang, X., 2011. Efficient near-duplicate detection for Q&A forum, in: IJCNLP pp. 10011009.
[46]
F. Xu, Q. Zhu, P. Li, Detecting text similarity over chinese research papers using mapreduce, in: Software Engineering, Artificial Intelligence, 2011 12th ACIS International Conference on Networking and Parallel/Distributed Computing (SNPD), IEEE, 2011, pp. 197-202.
[47]
Q. Zhang, Y. Zhang, H. Yu, X. Huang, Efficient partial-duplicate detection based on sequence matching, in: Proceedings of the 33rd International ACM SIGIR Conference on Research and Development in Information Retrieval, ACM, 2010, pp. 675-682.

Cited By

View all

Recommendations

Comments

Information & Contributors

Information

Published In

cover image Engineering Applications of Artificial Intelligence
Engineering Applications of Artificial Intelligence  Volume 64, Issue C
September 2017
294 pages

Publisher

Pergamon Press, Inc.

United States

Publication History

Published: 01 September 2017

Author Tags

  1. MapReduce
  2. Plagiarism detection
  3. Software engineering

Qualifiers

  • Research-article

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)0
  • Downloads (Last 6 weeks)0
Reflects downloads up to 27 Jan 2025

Other Metrics

Citations

Cited By

View all

View Options

View options

Figures

Tables

Media

Share

Share

Share this Publication link

Share on social media