Authors:
Victor U. Thompson
;
Christo Panchev
and
Michael Michael Oakes
Affiliation:
University of Sunderland, United Kingdom
Keyword(s):
Candidate Selection, Information Retrieval, Similarity Measures, Textual Similarity.
Related
Ontology
Subjects/Areas/Topics:
Artificial Intelligence
;
Information Extraction
;
Knowledge Discovery and Information Retrieval
;
Knowledge-Based Systems
;
Symbolic Systems
Abstract:
Many Information Retrieval (IR) and Natural language processing (NLP) systems require textual similarity
measurement in order to function, and do so with the help of similarity measures. Similarity measures
function differently, some measures which work better on highly similar texts do not always do so well on
highly dissimilar texts. In this paper, we evaluated the performances of eight popular similarity measures on
four levels (degree) of textual similarity using a corpus of plagiarised texts. The evaluation was carried out
in the context of candidate selection for plagiarism detection. Performance was measured in terms of recall,
and the best performed similarity measure(s) for each degree of textual similarity was identified. Results
from our Experiments show that the performances of most of the measures were equal on highly similar
texts, with the exception of Euclidean distance and Jensen-Shannon divergence which had poorer
performances. Cosine similarity and Bhattacharryan c
oefficient performed best on lightly reviewed text, and
on heavily reviewed texts, Cosine similarity and Pearson Correlation performed best and next best
respectively. Pearson Correlation had the best performance on highly dissimilar texts. The results also show
term weighing methods and n-gram document representations that best optimises the performance of each
of the similarity measures on a particular level of intertextual similarity.
(More)