research-article

Efficient parallel set-similarity joins using MapReduce

Authors:

Michael J. Carey,

Chen LiAuthors Info & Claims

SIGMOD '10: Proceedings of the 2010 ACM SIGMOD International Conference on Management of data

Pages 495 - 506

https://doi.org/10.1145/1807167.1807222

Published: 06 June 2010 Publication History

Abstract

In this paper we study how to efficiently perform set-similarity joins in parallel using the popular MapReduce framework. We propose a 3-stage approach for end-to-end set-similarity joins. We take as input a set of records and output a set of joined records based on a set-similarity condition. We efficiently partition the data across nodes in order to balance the workload and minimize the need for replication. We study both self-join and R-S join cases, and show how to carefully control the amount of data kept in main memory on each node. We also propose solutions for the case where, even if we use the most fine-grained partitioning, the data still does not fit in the main memory of a node. We report results from extensive experiments on real datasets, synthetically increased in size, to evaluate the speedup and scaleup properties of the proposed algorithms using Hadoop.

References

[1]

Apache Hadoop. http://hadoop.apache.org.

[2]

Apache Hive. http://hadoop.apache.org/hive.

[3]

A. Arasu, V. Ganti, and R. Kaushik. Efficient exact set-similarity joins. In VLDB, pages 918--929, 2006.

Digital Library

[4]

R. J. Bayardo, Y. Ma, and R. Srikant. Scaling up all pairs similarity search. In WWW, pages 131--140, 2007.

Digital Library

[5]

A. Z. Broder, S. C. Glassman, M. S. Manasse, and G. Zweig. Syntactic clustering of the web. Computer Networks, 29(8-13):1157--1166, 1997.

Digital Library

[6]

S. Chaudhuri, V. Ganti, and R. Kaushik. A primitive operator for similarity joins in data cleaning. In ICDE, page 5, 2006.

Digital Library

[7]

J. Dean and S. Ghemawat. MapReduce: simplified data processing on large clusters. Commun. ACM, 51(1):107--113, 2008.

Digital Library

[8]

D. J. DeWitt and J. Gray. Parallel database systems: The future of high performance database systems. Commun. ACM, 35(6):85--98, 1992.

Digital Library

[9]

D. J. DeWitt, J. F. Naughton, and D. A. Schneider. An evaluation of non-equijoin algorithms. In VLDB, pages 443--452, 1991.

Digital Library

[10]

A. Gates, O. Natkovich, S. Chopra, P. Kamath, S. Narayanam, C. Olston, B. Reed, S. Srinivasan, and U. Srivastava. Building a highlevel dataflow system on top of MapReduce: the Pig experience. PVLDB, 2(2):1414--1425, 2009.

Digital Library

[11]

Genbank. http://www.ncbi.nlm.nih.gov/Genbank.

[12]

A. Gionis, P. Indyk, and R. Motwani. Similarity search in high dimensions via hashing. In VLDB, pages 518--529, 1999.

Digital Library

[13]

L. Gravano, P. G. Ipeirotis, H. V. Jagadish, N. Koudas, S. Muthukrishnan, and D. Srivastava. Approximate string joins in a database (almost) for free. In VLDB, pages 491--500, 2001.

Digital Library

[14]

M. R. Henzinger. Finding near-duplicate web pages: a large-scale evaluation of algorithms. In SIGIR, pages 284-291, 2006.

Digital Library

[15]

T. C. Hoad and J. Zobel. Methods for identifying versioned and plagiarized documents. JASIST, 54(3):203--215, 2003.

Digital Library

[16]

Jaql. http://www.jaql.org.

[17]

Jaql - Fuzzy join tutorial. http://code.google.com/p/jaql/wiki/fuzzyJoinTutorial.

[18]

M. Kitsuregawa and Y. Ogawa. Bucket spreading parallel hash: A new, robust, parallel hash join method for data skew in the super database computer (sdc). In VLDB, pages 210--221, 1990.

Digital Library

[19]

M. Kitsuregawa, H. Tanaka, and T. Moto-Oka. Application of hash to data base machine and its architecture. New Generation Comput., 1(1):63--74, 1983.

Digital Library

[20]

A. Metwally, D. Agrawal, and A. E. Abbadi. Detectives: detecting coalition hit inflation attacks in advertising networks streams. In WWW, pages 241-250, 2007.

Digital Library

[21]

A. Pavlo, E. Paulson, A. Rasin, D. J. Abadi, D. J. DeWitt, S. Madden, and M. Stonebraker. A comparison of approaches to large-scale data analysis. In SIGMOD Conference, pages 165--178, 2009.

Digital Library

[22]

M. Sahami and T. D. Heilman. A web-based kernel function for measuring the similarity of short text snippets. In WWW, pages 377--386, 2006.

Digital Library

[23]

S. Sarawagi and A. Kirpal. Efficient set joins on similarity predicates. In SIGMOD Conference, pages 743--754, 2004.

Digital Library

[24]

D. A. Schneider and D. J. DeWitt. A performance evaluation of four parallel join algorithms in a shared-nothing multiprocessor environment. In SIGMOD Conference, pages 110--121, 1989.

Digital Library

[25]

E. Spertus, M. Sahami, and O. Buyukkokten. Evaluating similarity measures: a large-scale study in the orkut social network. In KDD, pages 678--684, 2005.

Digital Library

[26]

R. Vernica, M. Carey, and C. Li. Efficient parallel set-similarity joins using MapReduce. Technical report, Department of Computer Science, UC Irvine, March 2010. http://asterix.ics.uci.edu.

[27]

Web 1t 5-gram version 1. http://www.ldc.upenn.edu/Catalog/CatalogEntry.jsp?catalogId=LDC2006T13.

[28]

C. Xiao, W. Wang, and X. Lin. Ed-join: An efficient algorithm for similarity joins with edit distance constraints. In VLDB, 2008.

Digital Library

[29]

C. Xiao, W. Wang, X. Lin, and J. X. Yu. Efficient similarity joins for near duplicate detection. In WWW, pages 131--140, 2008.

Digital Library

[30]

H. Yang, A. Dasdan, R.-L. Hsiao, and D. S. P. Jr. Map-Reduce-Merge: simplified relational data processing on large clusters. In SIGMOD Conference, pages 1029--1040, 2007.

Digital Library

Cited By

Elmougy YHayashi ASarkar V(2024)Asynchronous Distributed Actor-Based Approach to Jaccard Similarity for Genome ComparisonsISC High Performance 2024 Research Paper Proceedings (39th International Conference)10.23919/ISC.2024.10528922(1-11)Online publication date: May-2024
https://doi.org/10.23919/ISC.2024.10528922
Sevim AEldawy ACarman ECarey MTsotras V(2024)FUDJ: Flexible User-Defined Distributed Joins2024 IEEE 40th International Conference on Data Engineering (ICDE)10.1109/ICDE60146.2024.00320(4194-4207)Online publication date: 13-May-2024
https://doi.org/10.1109/ICDE60146.2024.00320
Neuhof FFisichella MPapadakis GNikoletos KAugsten NNejdl WKoubarakis M(2024)Open benchmark for filtering techniques in entity resolutionThe VLDB Journal10.1007/s00778-024-00868-733:5(1671-1696)Online publication date: 9-Jul-2024
https://doi.org/10.1007/s00778-024-00868-7
Show More Cited By

Index Terms

Efficient parallel set-similarity joins using MapReduce
1. Information systems
  1. Data management systems
    1. Database management system engines
      1. Database query processing
      2. Parallel and distributed DBMSs
2. Theory of computation
  1. Theory and algorithms for application domains
    1. Database theory
      1. Database query processing and optimization (theory)

Recommendations

Supporting set-valued joins in NoSQL using MapReduce

NoSQL systems are increasingly adopted for Web applications requiring scalability that relational database systems cannot meet. Although NoSQL systems have not been designed to support joins, as they are applied to a wide variety of applications, the ...
On Spatial Joins in MapReduce
SIGSPATIAL '17: Proceedings of the 25th ACM SIGSPATIAL International Conference on Advances in Geographic Information Systems

This paper provides the first attempt for a full-fledged query optimizer for MapReduce-based spatial join algorithms. The optimizer develops its own taxonomy that covers almost all possible ways of doing a spatial join for any two input datasets. The ...
Exploiting MapReduce-based similarity joins
SIGMOD '12: Proceedings of the 2012 ACM SIGMOD International Conference on Management of Data

Cloud enabled systems have become a crucial component to efficiently process and analyze massive amounts of data. One of the key data processing and analysis operations is the Similarity Join, which retrieves all data pairs whose distances are smaller ...

Comments

Information & Contributors

Information

Published In

cover image ACM Conferences

SIGMOD '10: Proceedings of the 2010 ACM SIGMOD International Conference on Management of data

June 2010

1286 pages

ISBN:9781450300322

DOI:10.1145/1807167

General Chair:
Ahmed Elmagarmid
Purdue University, USA
,
Program Chair:
Divyakant Agrawal
University of California at Santa Barbara, USA

Copyright © 2010 ACM.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Sponsors

SIGMOD: ACM Special Interest Group on Management of Data

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 06 June 2010

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Badges

Results Reproduced / v1.1

Author Tags

Qualifiers

Research-article

Conference

SIGMOD/PODS '10

Sponsor:

SIGMOD

SIGMOD/PODS '10: International Conference on Management of Data

June 6 - 10, 2010

Indiana, Indianapolis, USA

Acceptance Rates

Overall Acceptance Rate 785 of 4,003 submissions, 20%

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

345
Total Citations
View Citations
2,522
Total Downloads

Downloads (Last 12 months)34
Downloads (Last 6 weeks)1

Reflects downloads up to 14 Sep 2024

Other Metrics

View Author Metrics

Citations

Cited By

Elmougy YHayashi ASarkar V(2024)Asynchronous Distributed Actor-Based Approach to Jaccard Similarity for Genome ComparisonsISC High Performance 2024 Research Paper Proceedings (39th International Conference)10.23919/ISC.2024.10528922(1-11)Online publication date: May-2024
https://doi.org/10.23919/ISC.2024.10528922
Sevim AEldawy ACarman ECarey MTsotras V(2024)FUDJ: Flexible User-Defined Distributed Joins2024 IEEE 40th International Conference on Data Engineering (ICDE)10.1109/ICDE60146.2024.00320(4194-4207)Online publication date: 13-May-2024
https://doi.org/10.1109/ICDE60146.2024.00320
Neuhof FFisichella MPapadakis GNikoletos KAugsten NNejdl WKoubarakis M(2024)Open benchmark for filtering techniques in entity resolutionThe VLDB Journal10.1007/s00778-024-00868-733:5(1671-1696)Online publication date: 9-Jul-2024
https://doi.org/10.1007/s00778-024-00868-7
Rozinek OBorkovcova MMares J(2024)Scalable Similarity Joins for Fast and Accurate Record Deduplication in Big DataGood Practices and New Perspectives in Information Systems and Technologies10.1007/978-3-031-60328-0_18(181-191)Online publication date: 16-May-2024
https://doi.org/10.1007/978-3-031-60328-0_18
Jia LTang JLi MLi RDing JChen Y(2023)A Trie Based Set Similarity Query AlgorithmMathematics10.3390/math1101022911:1(229)Online publication date: 2-Jan-2023
https://doi.org/10.3390/math11010229
Siachamis GPsarakis KFragkoulis MPapapetrou Ovan Deursen AKatsifodimos AKemme BRiviere ESchiavoni VPasin M(2023)Adaptive Distributed Streaming Similarity JoinsProceedings of the 17th ACM International Conference on Distributed and Event-based Systems10.1145/3583678.3596891(25-36)Online publication date: 27-Jun-2023
https://dl.acm.org/doi/10.1145/3583678.3596891
Yang CChen LWang HShang SMao RZhang X(2023)Dynamic Set Similarity Join: An Update Log Based ApproachIEEE Transactions on Knowledge and Data Engineering10.1109/TKDE.2021.312663135:4(3727-3741)Online publication date: 1-Apr-2023
https://doi.org/10.1109/TKDE.2021.3126631
Xiong X(2023)GrassJoin: Distributed Set Similarity Join Based on Graph Partitioning Model2023 2nd International Conference on Sensing, Measurement, Communication and Internet of Things Technologies (SMC-IoT)10.1109/SMC-IoT62253.2023.00020(67-72)Online publication date: 29-Dec-2023
https://doi.org/10.1109/SMC-IoT62253.2023.00020
Widmoser MKocher DAugsten NMann W(2023)MetricJoin: Leveraging Metric Properties for Robust Exact Set Similarity Joins2023 IEEE 39th International Conference on Data Engineering (ICDE)10.1109/ICDE55515.2023.00085(1045-1058)Online publication date: Apr-2023
https://doi.org/10.1109/ICDE55515.2023.00085
Uhartegaray RD'Orazio LDamigos MKalogeros E(2023)Scalable Computation of Fuzzy Joins Over Large Collections of JSON Data2023 IEEE International Conference on Fuzzy Systems (FUZZ)10.1109/FUZZ52849.2023.10309759(01-06)Online publication date: 13-Aug-2023
https://doi.org/10.1109/FUZZ52849.2023.10309759
Show More Cited By

View Options

Get Access

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Media

Figures

Other

Tables

View Table of Contents