research-article

A Comprehensive Benchmark Framework for Active Learning Methods in Entity Matching

Authors:

Venkata Vamsikrishna Meduri,

Prithviraj Sen,

Mohamed SarwatAuthors Info & Claims

SIGMOD '20: Proceedings of the 2020 ACM SIGMOD International Conference on Management of Data

Pages 1133 - 1147

https://doi.org/10.1145/3318464.3380597

Published: 31 May 2020 Publication History

Abstract

Entity Matching (EM) is a core data cleaning task, aiming to identify different mentions of the same real-world entity. Active learning is one way to address the challenge of scarce labeled data in practice, by dynamically collecting the necessary examples to be labeled by an Oracle and refining the learned model (classifier) upon them. In this paper, we build a unified active learning benchmark framework for EM that allows users to easily combine different learning algorithms with applicable example selection algorithms. The goal of the framework is to enable concrete guidelines for practitioners as to what active learning combinations will work well for EM. Towards this, we perform comprehensive experiments on publicly available EM datasets from product and publication domains to evaluate active learning methods, using a variety of metrics including EM quality, #labels and example selection latencies. Our most surprising result finds that active learning with fewer labels can learn a classifier of comparable quality as supervised learning. In fact, for several of the datasets, we show that there is an active learning combination that beats the state-of-the-art supervised learning result. Our framework also includes novel optimizations that improve the quality of the learned model by roughly 9% in terms of F1-score and reduce example selection latencies by up to 10× without affecting the quality of the model.

Supplementary Material

MP4 File (3318464.3380597.mp4)

Presentation Video

Download
118.42 MB

References

[1]

A. Arasu, M. Götz, and R. Kaushik. 2010. On Active Learning of Record Matching Packages. In SIGMOD. 783--794.

[2]

K. Bellare, S. Iyengar, A. Parameswaran, and V. Rastogi. 2012. Active Sampling for Entity Matching. In KDD. 1131--1139.

[3]

Alina Beygelzimer, Sanjoy Dasgupta, and John Langford. 2009. Importance Weighted Active Learning. In ICML. 49--56.

[4]

Chengliang Chai, Guoliang Li, Jian Li, Dong Deng, and Jianhua Feng. 2016. Cost-Effective Crowdsourced Entity Resolution: A Partial-Order Approach. In Proceedings of the 2016 International Conference on Management of Data (San Francisco, California, USA) (SIGMOD '16). ACM, New York, NY, USA, 969--984. https://doi.org/10.1145/2882903.2915252

Digital Library

[5]

David Cohn, Les Atlas, and Richard Ladner. 1994. Improving Generalization with Active Learning. Machine Learning (1994), 201--221.

[6]

Sanjib Das, AnHai Doan, Paul Suganthan G. C., Chaitanya Gokhale, Pradap Konda, Yash Govind, and Derek Paulsen. [n.d.]. The Magellan Data Repository. https://sites.google.com/site/anhaidgroup/projects/data.

[7]

Sanjib Das, Paul Suganthan G.C., AnHai Doan, Jeffrey F. Naughton, Ganesh Krishnan, Rohit Deep, Esteban Arcaute, Vijay Raghavendra, and Youngchoon Park. 2017. Falcon: Scaling Up Hands-Off Crowdsourced Entity Matching to Build Cloud Services. In Proceedings of the 2017 ACM International Conference on Management of Data (Chicago, Illinois, USA) (SIGMOD '17). ACM, New York, NY, USA, 1431--1446. https://doi.org/10.1145/3035918.3035960

Digital Library

[8]

Sanjoy Dasgupta. 2011. Two Faces of Active Learning. Theor. Comput. Sci., Vol. 412, 19 (2011), 1767--1781.

Digital Library

[9]

B. Efron and R. Tibshirani. 1993. An Introduction to the Bootstrap.

[10]

Y. Freund, H. Seung, E. Shamir, and N. Tishby. 1997. Selective Sampling Using the Query by Committee Algorithm. Machine Learning, Vol. 28, 2--3 (1997), 133--168.

Digital Library

[11]

Chaitanya Gokhale, Sanjib Das, AnHai Doan, Jeffrey F. Naughton, Narasimhan Rampalli, Jude Shavlik, and Xiaojin Zhu. 2014. Corleone: Hands-off Crowdsourcing for Entity Matching. In SIGMOD. 601--612.

Digital Library

[12]

Alon Gonen, Sivan Sabato, and Shai Shalev-Shwartz. 2013. Efficient Active Learning of Halfspaces: An Aggressive Approach. JMLR, Vol. 14, 1 (2013), 2583--2615.

[13]

Sergey Ioffe and Christian Szegedy. 2015. Batch Normalization: Accelerating Deep Network Training by Reducing Internal Covariate Shift. In ICML. 448--456.

Digital Library

[14]

P. Jain, S. Vijayanarasimhan, and K. Grauman. 2010. Hashing Hyperplane Queries to Near Points with Applications to Large-Scale Active Learning. In NIPS. 928--936.

[15]

Jungo Kasai, Kun Qian, Sairam Gurajada, Yunyao li, and Lucian Popa. 2019. Low-resource Deep Entity Resolution with Transfer and Active Learning. In ACL.

[16]

Asif R. Khan and H. Garcia-Molina. 2016. Attribute-based Crowd Entity Resolution. In CIKM.

[17]

Pradap Konda, Sanjib Das, Paul Suganthan G. C., AnHai Doan, Adel Ardalan, Jeffrey R. Ballard, Han Li, Fatemah Panahi, Haojun Zhang, Jeffrey F. Naughton, Shishir Prasad, Ganesh Krishnan, Rohit Deep, and Vijay Raghavendra. 2016. Magellan: Toward Building Entity Matching Management Systems. PVLDB, Vol. 9, 12 (2016), 1197--1208. https://doi.org/10.14778/2994509.2994535

[18]

Hanna Köpcke, Andreas Thor, and Erhard Rahm. 2010. Evaluation of Entity Resolution Approaches on Real-world Match Problems. PVLDB, Vol. 3, 1 (2010), 484--493.

Digital Library

[19]

Barzan Mozafari, Purna Sarkar, Michael Franklin, Michael Jordan, and Samuel Madden. 2014. Scaling Up Crowd-sourcing to Very Large Datasets: A Case for Active Learning. PVLDB, Vol. 8, 2 (2014), 125--136.

[20]

Sidharth Mudgal, Han Li, Theodoros Rekatsinas, AnHai Doan, Youngchoon Park, Ganesh Krishnan, Rohit Deep, Esteban Arcaute, and Vijay Raghavendra. 2018. Deep Learning for Entity Matching: A Design Space Exploration. In Proceedings of the 2018 International Conference on Management of Data (Houston, TX, USA) (SIGMOD '18). ACM, New York, NY, USA, 19--34. https://doi.org/10.1145/3183713.3196926

Digital Library

[21]

Tan T. Nguyen and Scott Sanner. 2013. Algorithms for Direct 0--1 Loss Optimization in Binary Classification. In Proceedings of the 30th International Conference on International Conference on Machine Learning - Volume 28 (Atlanta, GA, USA) (ICML'13). JMLR.org, III--1085--III--1093. http://dl.acm.org/citation.cfm?id=3042817.3043058

[22]

Kun Qian, Lucian Popa, and Prithviraj Sen. 2017. Active Learning for Large-Scale Entity Resolution. In CIKM. 1379--1388.

[23]

Sunita Sarawagi and Anuradha Bhamidipaty. 2002. Interactive Deduplication Using Active Learning. In KDD. 269--278.

[24]

Burr Settles. 2009. Active Learning Literature Survey. Technical Report. University of Wisconsin-Madison.

[25]

H. Seung, M. Opper, and H. Sompolinsky. 1992. Query by committee. In Workshop on COLT. 287--294.

[26]

Rohit Singh, Venkata Vamsikrishna Meduri, Ahmed K. Elmagarmid, Samuel Madden, Paolo Papotti, Jorge-Arnulfo Quiané -Ruiz, Armando Solar-Lezama, and Nan Tang. 2017. Synthesizing Entity Matching Rules by Examples. PVLDB, Vol. 11, 2 (2017), 189--202. https://doi.org/10.14778/3149193.3149199

Digital Library

[27]

S. Tejada, C. Knoblock, and S. Minton. 2001. Learning Object Identification Rules for Information Integration. Inf. Syst., Vol. 26, 8 (2001), 607--633.

Digital Library

[28]

Simon Tong and Daphne Koller. 2001. Support Vector Machine Active Learning with Applications to Text Classification. JMLR, Vol. 2 (2001), 45--66.

Digital Library

[29]

V. Verroios, H. Garcia-Molina, and Y. Papakonstantinou. 2017. Waldo: An adaptive human interface for crowd entity resolution. In SIGMOD.

Digital Library

[30]

N. Vesdapunt, K. Bellare, and N. Dalvi. 2014. Crowdsourcing algorithms for entity resolution. In PVLDB.

[31]

Jiannan Wang, Tim Kraska, Michael J. Franklin, and Jianhua Feng. 2012. CrowdER: Crowdsourcing Entity Resolution. PVLDB, Vol. 5, 11 (2012), 1483--1494.

[32]

J. Wang, G. Li, T. Kraska, M. J. Franklin, and J. Feng. 2013. Leveraging transitive relations for crowdsourced joins. In SIGMOD.

[33]

S. Wang, X. Xiao, and C. Lee. 2015. Crowd-Based Deduplication: An Adaptive Approach. In SIGMOD.

[34]

S. Whang, P. Lofgren, and H. Garcia-Molina. 2013. Question selection for crowd entity resolution. In VLDB.

Cited By

Zha DBhat ZLai KYang FJiang ZZhong SHu X(2025)Data-centric Artificial Intelligence: A SurveyACM Computing Surveys10.1145/371111857:5(1-42)Online publication date: 24-Jan-2025
https://dl.acm.org/doi/10.1145/3711118
Nie TMao HLiu XYu S(2024)Fine-Grained Tasks for Crowdsourced Entity ResolutionApplied Sciences10.3390/app1501000415:1(4)Online publication date: 24-Dec-2024
https://doi.org/10.3390/app15010004
Shah VParashos TKumar A(2024)How Do Categorical Duplicates Affect ML? A New Benchmark and Empirical AnalysesProceedings of the VLDB Endowment10.14778/3648160.364817817:6(1391-1404)Online publication date: 1-Feb-2024
https://dl.acm.org/doi/10.14778/3648160.3648178
Show More Cited By

Recommendations

An active learning ensemble method for regression tasks

Active learning is a typical approach for learning from both labeled and unlabeled examples aiming to build efficient and accurate predictive models at minimum expense under an expert’s guidance. Since there is a lack of labeled data in many ...
Active Learning for Microarray based Leukemia Classification
ICBBE '21: Proceedings of the 2021 8th International Conference on Biomedical and Bioinformatics Engineering

In machine learning, data labeling is assumed to be easy and cheap. However, in real word cases especially clinical field, data sets are rare and expensive to obtain. Active learning is an approach that can query the most informative data for the ...
SAREM: Semi-supervised Active Heterogeneous Entity Matching Framework
Web Information Systems and Applications
Abstract
Entity matching is a key technique in data quality research, which refers to the identification of records that refer to the same real-world entity in different data sources. This paper introduces SAREM, a semi-supervised entity matching framework ...

Comments

Information & Contributors

Information

Published In

cover image ACM Conferences

SIGMOD '20: Proceedings of the 2020 ACM SIGMOD International Conference on Management of Data

June 2020

2925 pages

ISBN:9781450367356

DOI:10.1145/3318464

General Chairs:
David Maier
Portland State University, USA
,
Rachel Pottinger
University of British Columbia, Canada
,
Program Chairs:
AnHai Doan
University of Wisconsin, USA
,
Wang-Chiew Tan
Megagon Labs, USA
,
Publications Chairs:
Abdussalam Alawini
University of Illinois at Urbana-Champaign, USA
,
Hung Q. Ngo
RelationalAI, USA

Copyright © 2020 ACM.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Sponsors

SIGMOD: ACM Special Interest Group on Management of Data

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 31 May 2020

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Author Tags

Qualifiers

Research-article

Conference

SIGMOD/PODS '20

Sponsor:

SIGMOD

SIGMOD/PODS '20: International Conference on Management of Data

June 14 - 19, 2020

OR, Portland, USA

Acceptance Rates

Overall Acceptance Rate 785 of 4,003 submissions, 20%

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

42
Total Citations
View Citations
730
Total Downloads

Downloads (Last 12 months)71
Downloads (Last 6 weeks)5

Reflects downloads up to 04 Feb 2025

Other Metrics

View Author Metrics

Citations

Cited By

Zha DBhat ZLai KYang FJiang ZZhong SHu X(2025)Data-centric Artificial Intelligence: A SurveyACM Computing Surveys10.1145/371111857:5(1-42)Online publication date: 24-Jan-2025
https://dl.acm.org/doi/10.1145/3711118
Nie TMao HLiu XYu S(2024)Fine-Grained Tasks for Crowdsourced Entity ResolutionApplied Sciences10.3390/app1501000415:1(4)Online publication date: 24-Dec-2024
https://doi.org/10.3390/app15010004
Shah VParashos TKumar A(2024)How Do Categorical Duplicates Affect ML? A New Benchmark and Empirical AnalysesProceedings of the VLDB Endowment10.14778/3648160.364817817:6(1391-1404)Online publication date: 1-Feb-2024
https://dl.acm.org/doi/10.14778/3648160.3648178
Zhang WWang YYou ZLi YCao GYang ZCui B(2024)NC-ALG: Graph-Based Active Learning Under Noisy Crowd2024 IEEE 40th International Conference on Data Engineering (ICDE)10.1109/ICDE60146.2024.00210(2681-2694)Online publication date: 13-May-2024
https://doi.org/10.1109/ICDE60146.2024.00210
Jabrane MTabbaa HHadri AHafidi I(2024)Enhancing Entity Resolution with a hybrid Active Machine Learning frameworkInformation Systems10.1016/j.is.2024.102410125:COnline publication date: 1-Nov-2024
https://dl.acm.org/doi/10.1016/j.is.2024.102410
Basile ACrupi RGrasso MMercanti ARegoli DScarsi SYang SCosentini A(2024)Disambiguation of company names via deep recurrent networksExpert Systems with Applications: An International Journal10.1016/j.eswa.2023.122035238:PCOnline publication date: 27-Feb-2024
https://dl.acm.org/doi/10.1016/j.eswa.2023.122035
Mourad JHiba TYassir RImad H(2024)ERABQS: entity resolution based on active machine learning and balancing query strategyJournal of Intelligent Information Systems10.1007/s10844-024-00853-062:5(1347-1373)Online publication date: 1-Oct-2024
https://dl.acm.org/doi/10.1007/s10844-024-00853-0
Côté PNikanjam AAhmed NHumeniuk DKhomh F(2024)Data cleaning and machine learning: a systematic literature reviewAutomated Software Engineering10.1007/s10515-024-00453-w31:2Online publication date: 11-Jun-2024
https://doi.org/10.1007/s10515-024-00453-w
Meduri VQuamar ALei CQin XReinwald B(2024)Alfa: active learning for graph neural network-based semantic schema alignmentThe VLDB Journal — The International Journal on Very Large Data Bases10.1007/s00778-023-00822-z33:4(981-1011)Online publication date: 1-Jul-2024
https://dl.acm.org/doi/10.1007/s00778-023-00822-z
O’Reilly-Morgan DTragos EDuriakova EDu HHurley NLawlor A(2024)Entity Matching with Large Language Models as Weak and Strong LabellersNew Trends in Database and Information Systems10.1007/978-3-031-70421-5_6(58-67)Online publication date: 14-Nov-2024
https://doi.org/10.1007/978-3-031-70421-5_6
Show More Cited By

View Options

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Figures

Tables

Media

View Table of Conten