research-article

Efficient similarity joins for near duplicate detection

Authors: Chuan Xiao, Wei Wang, Xuemin Lin, Jeffrey Xu YuAuthors Info & Claims

WWW '08: Proceedings of the 17th international conference on World Wide Web

Pages 131 - 140

https://doi.org/10.1145/1367497.1367516

Published: 21 April 2008 Publication History

Abstract

With the increasing amount of data and the need to integrate data from multiple data sources, a challenging issue is to find near duplicate records efficiently. In this paper, we focus on efficient algorithms to find pairs of records such that their similarities are above a given threshold. Several existing algorithms rely on the prefix filtering principle to avoid computing similarity values for all possible pairs of records. We propose new filtering techniques by exploiting the ordering information; they are integrated into the existing methods and drastically reduce the candidate sizes and hence improve the efficiency. Experimental results show that our proposed algorithms can achieve up to 2.6x - 5x speed-up over previous algorithms on several real datasets and provide alternative solutions to the near duplicate Web page detection problem.

References

[1]

A. Arasu, V. Ganti, and R. Kaushik. Efficient exact set-similarity joins. In VLDB, 2006.

Digital Library

[2]

R. Baeza-Yates and B. Ribeiro-Neto. Modern Information Retrieval. Addison Wesley, 1st edition edition, May 1999.

Digital Library

[3]

R. J. Bayardo, Y. Ma, and R. Srikant. Scaling up all pairs similarity search. In WWW, 2007.

Digital Library

[4]

M. Bilenko, R. J. Mooney, W. W. Cohen, P. Ravikumar, and S. E. Fienberg. Adaptive name matching in information integration. IEEE Intelligent Sys., 18(5):16--23, 2003.

Digital Library

[5]

A. Z. Broder. On the resemblance and containment of documents. In SEQS, 1997.

Digital Library

[6]

A. Z. Broder, S. C. Glassman, M. S. Manasse, and G. Zweig. Syntactic clustering of the web. Computer Networks, 29(8-13):1157--1166, 1997.

Digital Library

[7]

M. Charikar. Similarity estimation techniques from rounding algorithms. In STOC, 2002.

Digital Library

[8]

S. Chaudhuri, V. Ganti, and R. Kaushik. A primitive operator for similarity joins in data cleaning. In ICDE, 2006.

Digital Library

[9]

J. Cho, N. Shivakumar, and H. Garcia-Molina. Finding replicated web collections. In SIGMOD, 2000.

Digital Library

[10]

A. Chowdhury, O. Frieder, D. A. Grossman, and M. C. McCabe. Collection statistics for fast duplicate document detection. ACM Trans. Inf. Syst., 20(2):171--191, 2002.

Digital Library

[11]

J. G. Conrad, X. S. Guo, and C. P. Schriber. Online duplicate document detection: signature reliability in a dynamic retrieval environment. In CIKM, 2003.

Digital Library

[12]

A. K. Elmagarmid, P. G. Ipeirotis, and V. S. Verykios. Duplicate record detection: A survey. TKDE, 19(1):1--16, 2007.

Digital Library

[13]

R. Fagin, R. Kumar, and D. Sivakumar. Efficient similarity search and classification via rank aggregation. In SIGMOD, 2003.

Digital Library

[14]

D. Fetterly, M. Manasse, and M. Najork. On the evolution of clusters of near-duplicate web pages. In LA-WEB, 2003.

Digital Library

[15]

D. Gibson, R. Kumar, and A. Tomkins. Discovering large dense subgraphs in massive graphs. In VLDB, 2005.

Digital Library

[16]

A. Gionis, P. Indyk, and R. Motwani. Similarity search in high dimensions via hashing. In VLDB, 1999.

Digital Library

[17]

L. Gravano, P. G. Ipeirotis, H. V. Jagadish, N. Koudas, S. Muthukrishnan, and D. Srivastava. Approximate string joins in a database (almost) for free. In VLDB, 2001.

Digital Library

[18]

M. R. Henzinger. Finding near-duplicate web pages: a large-scale evaluation of algorithms. In SIGIR, 2006.

Digital Library

[19]

M. A. Hernández and S. J. Stolfo. Real-world data is dirty: Data cleansing and the merge/purge problem. Data Mining and Knowledge Discovery, 2(1):9--37, 1998.

Digital Library

[20]

T. C. Hoad and J. Zobel. Methods for identifying versioned and plagiarized documents. JASIST, 54(3):203--215, 2003.

Digital Library

[21]

P. Indyk and R. Motwani. Approximate nearest neighbors: Towards removing the curse of dimensionality. In STOC, 1998.

Digital Library

[22]

R. C. Russell. Index, U.S. patent 1,261,167, April 1918.

[23]

S. Sarawagi and A. Bhamidipaty. Interactive deduplication using active learning. In KDD, 2002.

Digital Library

[24]

S. Sarawagi and A. Kirpal. Efficient set joins on similarity predicates. In SIGMOD, 2004.

Digital Library

[25]

E. Spertus, M. Sahami, and O. Buyukkokten. Evaluating similarity measures: a large-scale study in the orkut social network. In KDD, 2005.

Digital Library

[26]

E. Ukkonen. On approximate string matching. In FCT, 1983.

Digital Library

[27]

W. E. Winkler. The state of record linkage and current research problems. Technical report, U.S. Bureau of the Census, 1999.

Cited By

Jia LTang JLi MLi RDing JChen Y(2023)A Trie Based Set Similarity Query AlgorithmMathematics10.3390/math1101022911:1(229)Online publication date: 2-Jan-2023
https://doi.org/10.3390/math11010229
Yang CChen LWang HShang SMao RZhang X(2023)Dynamic Set Similarity Join: An Update Log Based ApproachIEEE Transactions on Knowledge and Data Engineering10.1109/TKDE.2021.312663135:4(3727-3741)Online publication date: 1-Apr-2023
https://doi.org/10.1109/TKDE.2021.3126631
Xiong X(2023)GrassJoin: Distributed Set Similarity Join Based on Graph Partitioning Model2023 2nd International Conference on Sensing, Measurement, Communication and Internet of Things Technologies (SMC-IoT)10.1109/SMC-IoT62253.2023.00020(67-72)Online publication date: 29-Dec-2023
https://doi.org/10.1109/SMC-IoT62253.2023.00020
Show More Cited By

Index Terms

Efficient similarity joins for near duplicate detection
1. Information systems
  1. Information retrieval
    1. Information retrieval query processing
    2. Retrieval tasks and goals
      1. Clustering and classification
  2. Information systems applications
    1. Data mining
      1. Clustering

Recommendations

Efficient similarity joins for near-duplicate detection

With the increasing amount of data and the need to integrate data from multiple data sources, one of the challenging issues is to identify near-duplicate records efficiently. In this article, we focus on efficient algorithms to find a pair of records ...
Near duplicate detection in an academic digital library
DocEng '13: Proceedings of the 2013 ACM symposium on Document engineering

The detection and potential removal of duplicates is desirable for a number of reasons, such as to reduce the need for unnecessary storage and computation, and to provide users with uncluttered search results. This paper describes an investigation into ...
Efficient duplicate record detection based on similarity estimation
WAIM'10: Proceedings of the 11th international conference on Web-age information management

In information integration systems, duplicate records bring problems in data processing and analysis. To represent the similarity between two records from different data sources with different schema, the optimal bipartite graph matching is adopted on ...

Comments

Information & Contributors

Information

Published In

cover image ACM Conferences

WWW '08: Proceedings of the 17th international conference on World Wide Web

April 2008

1326 pages

ISBN:9781605580852

DOI:10.1145/1367497

General Chairs:
Jinpeng Huai
Beihang University, China
,
Robin Chen
AT&T Labs, USA
,
Hsiao-Wuen Hon
Microsoft Research Asia, China
,
Yunhao Liu
HK University of Science and Technology, Hong Kong
,
Program Chairs:
Wei-Ying Ma
Microsoft Research Asia, China
,
Andrew Tomkins
Yahoo! Research, USA
,
Xiaodong Zhang
The Ohio State University, USA

Copyright © 2008 ACM.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Sponsors

ACM: Association for Computing Machinery

In-Cooperation

SIGWEB: ACM Special Interest Group on Hypertext, Hypermedia, and Web

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 21 April 2008

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Author Tags

Qualifiers

Research-article

Conference

WWW '08

Sponsor:

ACM

WWW '08: The 17th International World Wide Web Conference

April 21 - 25, 2008

Beijing, China

Acceptance Rates

Overall Acceptance Rate 1,899 of 8,196 submissions, 23%

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

269
Total Citations
View Citations
1,294
Total Downloads

Downloads (Last 12 months)0
Downloads (Last 6 weeks)0

Reflects downloads up to 06 Nov 2024

Other Metrics

View Author Metrics

Citations

Cited By

Jia LTang JLi MLi RDing JChen Y(2023)A Trie Based Set Similarity Query AlgorithmMathematics10.3390/math1101022911:1(229)Online publication date: 2-Jan-2023
https://doi.org/10.3390/math11010229
Yang CChen LWang HShang SMao RZhang X(2023)Dynamic Set Similarity Join: An Update Log Based ApproachIEEE Transactions on Knowledge and Data Engineering10.1109/TKDE.2021.312663135:4(3727-3741)Online publication date: 1-Apr-2023
https://doi.org/10.1109/TKDE.2021.3126631
Xiong X(2023)GrassJoin: Distributed Set Similarity Join Based on Graph Partitioning Model2023 2nd International Conference on Sensing, Measurement, Communication and Internet of Things Technologies (SMC-IoT)10.1109/SMC-IoT62253.2023.00020(67-72)Online publication date: 29-Dec-2023
https://doi.org/10.1109/SMC-IoT62253.2023.00020
Widmoser MKocher DAugsten NMann W(2023)MetricJoin: Leveraging Metric Properties for Robust Exact Set Similarity Joins2023 IEEE 39th International Conference on Data Engineering (ICDE)10.1109/ICDE55515.2023.00085(1045-1058)Online publication date: Apr-2023
https://doi.org/10.1109/ICDE55515.2023.00085
HU BZHANG JZHANG L(2022)A plagiarism detection approach for Chinese documents based on semantic textual similarityJournal of Shenzhen University Science and Engineering10.3724/SP.J.1249.2020.9910737:Z1(107-111)Online publication date: 14-Oct-2022
https://doi.org/10.3724/SP.J.1249.2020.99107
Wang ZZuo CDeng DIves ZBonifati AEl Abbadi A(2022)TxtAlign: Efficient Near-Duplicate Text Alignment Search via Bottom-k Sketches for Plagiarism DetectionProceedings of the 2022 International Conference on Management of Data10.1145/3514221.3526178(1146-1159)Online publication date: 10-Jun-2022
https://dl.acm.org/doi/10.1145/3514221.3526178
Li LJiang XGao G(2022)Secure KNN Set Similarity Search in Outsourced Cloud Environments2022 IEEE/ACM 7th Symposium on Edge Computing (SEC)10.1109/SEC54971.2022.00072(474-479)Online publication date: Dec-2022
https://doi.org/10.1109/SEC54971.2022.00072
Xiao GWang JLin CZaniolo C(2022)Highly Efficient String Similarity Search and Join over Compressed Indexes2022 IEEE 38th International Conference on Data Engineering (ICDE)10.1109/ICDE53745.2022.00022(232-244)Online publication date: May-2022
https://doi.org/10.1109/ICDE53745.2022.00022
Wang ZWang SLi JYuan CGu RHuang Y(2022)VSIMJournal of Parallel and Distributed Computing10.1016/j.jpdc.2021.07.009158:C(29-46)Online publication date: 22-Apr-2022
https://dl.acm.org/doi/10.1016/j.jpdc.2021.07.009
Padmanabhan M(2022)Rapid medical guideline systems for COVID-19 using database-centric modeling and validation of cyber-physical systemsCyber-Physical Systems10.1016/B978-0-12-824557-6.00012-1(161-170)Online publication date: 2022
https://doi.org/10.1016/B978-0-12-824557-6.00012-1
Show More Cited By

View Options

Get Access

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Media

Figures

Other

Tables

View Table of Contents