skip to main content
10.1145/3318464.3380597acmconferencesArticle/Chapter ViewAbstractPublication PagesmodConference Proceedingsconference-collections
research-article

A Comprehensive Benchmark Framework for Active Learning Methods in Entity Matching

Published: 31 May 2020 Publication History

Abstract

Entity Matching (EM) is a core data cleaning task, aiming to identify different mentions of the same real-world entity. Active learning is one way to address the challenge of scarce labeled data in practice, by dynamically collecting the necessary examples to be labeled by an Oracle and refining the learned model (classifier) upon them. In this paper, we build a unified active learning benchmark framework for EM that allows users to easily combine different learning algorithms with applicable example selection algorithms. The goal of the framework is to enable concrete guidelines for practitioners as to what active learning combinations will work well for EM. Towards this, we perform comprehensive experiments on publicly available EM datasets from product and publication domains to evaluate active learning methods, using a variety of metrics including EM quality, #labels and example selection latencies. Our most surprising result finds that active learning with fewer labels can learn a classifier of comparable quality as supervised learning. In fact, for several of the datasets, we show that there is an active learning combination that beats the state-of-the-art supervised learning result. Our framework also includes novel optimizations that improve the quality of the learned model by roughly 9% in terms of F1-score and reduce example selection latencies by up to 10× without affecting the quality of the model.

Supplementary Material

MP4 File (3318464.3380597.mp4)
Presentation Video

References

[1]
A. Arasu, M. Götz, and R. Kaushik. 2010. On Active Learning of Record Matching Packages. In SIGMOD. 783--794.
[2]
K. Bellare, S. Iyengar, A. Parameswaran, and V. Rastogi. 2012. Active Sampling for Entity Matching. In KDD. 1131--1139.
[3]
Alina Beygelzimer, Sanjoy Dasgupta, and John Langford. 2009. Importance Weighted Active Learning. In ICML. 49--56.
[4]
Chengliang Chai, Guoliang Li, Jian Li, Dong Deng, and Jianhua Feng. 2016. Cost-Effective Crowdsourced Entity Resolution: A Partial-Order Approach. In Proceedings of the 2016 International Conference on Management of Data (San Francisco, California, USA) (SIGMOD '16). ACM, New York, NY, USA, 969--984. https://doi.org/10.1145/2882903.2915252
[5]
David Cohn, Les Atlas, and Richard Ladner. 1994. Improving Generalization with Active Learning. Machine Learning (1994), 201--221.
[6]
Sanjib Das, AnHai Doan, Paul Suganthan G. C., Chaitanya Gokhale, Pradap Konda, Yash Govind, and Derek Paulsen. [n.d.]. The Magellan Data Repository. https://sites.google.com/site/anhaidgroup/projects/data.
[7]
Sanjib Das, Paul Suganthan G.C., AnHai Doan, Jeffrey F. Naughton, Ganesh Krishnan, Rohit Deep, Esteban Arcaute, Vijay Raghavendra, and Youngchoon Park. 2017. Falcon: Scaling Up Hands-Off Crowdsourced Entity Matching to Build Cloud Services. In Proceedings of the 2017 ACM International Conference on Management of Data (Chicago, Illinois, USA) (SIGMOD '17). ACM, New York, NY, USA, 1431--1446. https://doi.org/10.1145/3035918.3035960
[8]
Sanjoy Dasgupta. 2011. Two Faces of Active Learning. Theor. Comput. Sci., Vol. 412, 19 (2011), 1767--1781.
[9]
B. Efron and R. Tibshirani. 1993. An Introduction to the Bootstrap.
[10]
Y. Freund, H. Seung, E. Shamir, and N. Tishby. 1997. Selective Sampling Using the Query by Committee Algorithm. Machine Learning, Vol. 28, 2--3 (1997), 133--168.
[11]
Chaitanya Gokhale, Sanjib Das, AnHai Doan, Jeffrey F. Naughton, Narasimhan Rampalli, Jude Shavlik, and Xiaojin Zhu. 2014. Corleone: Hands-off Crowdsourcing for Entity Matching. In SIGMOD. 601--612.
[12]
Alon Gonen, Sivan Sabato, and Shai Shalev-Shwartz. 2013. Efficient Active Learning of Halfspaces: An Aggressive Approach. JMLR, Vol. 14, 1 (2013), 2583--2615.
[13]
Sergey Ioffe and Christian Szegedy. 2015. Batch Normalization: Accelerating Deep Network Training by Reducing Internal Covariate Shift. In ICML. 448--456.
[14]
P. Jain, S. Vijayanarasimhan, and K. Grauman. 2010. Hashing Hyperplane Queries to Near Points with Applications to Large-Scale Active Learning. In NIPS. 928--936.
[15]
Jungo Kasai, Kun Qian, Sairam Gurajada, Yunyao li, and Lucian Popa. 2019. Low-resource Deep Entity Resolution with Transfer and Active Learning. In ACL.
[16]
Asif R. Khan and H. Garcia-Molina. 2016. Attribute-based Crowd Entity Resolution. In CIKM.
[17]
Pradap Konda, Sanjib Das, Paul Suganthan G. C., AnHai Doan, Adel Ardalan, Jeffrey R. Ballard, Han Li, Fatemah Panahi, Haojun Zhang, Jeffrey F. Naughton, Shishir Prasad, Ganesh Krishnan, Rohit Deep, and Vijay Raghavendra. 2016. Magellan: Toward Building Entity Matching Management Systems. PVLDB, Vol. 9, 12 (2016), 1197--1208. https://doi.org/10.14778/2994509.2994535
[18]
Hanna Köpcke, Andreas Thor, and Erhard Rahm. 2010. Evaluation of Entity Resolution Approaches on Real-world Match Problems. PVLDB, Vol. 3, 1 (2010), 484--493.
[19]
Barzan Mozafari, Purna Sarkar, Michael Franklin, Michael Jordan, and Samuel Madden. 2014. Scaling Up Crowd-sourcing to Very Large Datasets: A Case for Active Learning. PVLDB, Vol. 8, 2 (2014), 125--136.
[20]
Sidharth Mudgal, Han Li, Theodoros Rekatsinas, AnHai Doan, Youngchoon Park, Ganesh Krishnan, Rohit Deep, Esteban Arcaute, and Vijay Raghavendra. 2018. Deep Learning for Entity Matching: A Design Space Exploration. In Proceedings of the 2018 International Conference on Management of Data (Houston, TX, USA) (SIGMOD '18). ACM, New York, NY, USA, 19--34. https://doi.org/10.1145/3183713.3196926
[21]
Tan T. Nguyen and Scott Sanner. 2013. Algorithms for Direct 0--1 Loss Optimization in Binary Classification. In Proceedings of the 30th International Conference on International Conference on Machine Learning - Volume 28 (Atlanta, GA, USA) (ICML'13). JMLR.org, III--1085--III--1093. http://dl.acm.org/citation.cfm?id=3042817.3043058
[22]
Kun Qian, Lucian Popa, and Prithviraj Sen. 2017. Active Learning for Large-Scale Entity Resolution. In CIKM. 1379--1388.
[23]
Sunita Sarawagi and Anuradha Bhamidipaty. 2002. Interactive Deduplication Using Active Learning. In KDD. 269--278.
[24]
Burr Settles. 2009. Active Learning Literature Survey. Technical Report. University of Wisconsin-Madison.
[25]
H. Seung, M. Opper, and H. Sompolinsky. 1992. Query by committee. In Workshop on COLT. 287--294.
[26]
Rohit Singh, Venkata Vamsikrishna Meduri, Ahmed K. Elmagarmid, Samuel Madden, Paolo Papotti, Jorge-Arnulfo Quiané -Ruiz, Armando Solar-Lezama, and Nan Tang. 2017. Synthesizing Entity Matching Rules by Examples. PVLDB, Vol. 11, 2 (2017), 189--202. https://doi.org/10.14778/3149193.3149199
[27]
S. Tejada, C. Knoblock, and S. Minton. 2001. Learning Object Identification Rules for Information Integration. Inf. Syst., Vol. 26, 8 (2001), 607--633.
[28]
Simon Tong and Daphne Koller. 2001. Support Vector Machine Active Learning with Applications to Text Classification. JMLR, Vol. 2 (2001), 45--66.
[29]
V. Verroios, H. Garcia-Molina, and Y. Papakonstantinou. 2017. Waldo: An adaptive human interface for crowd entity resolution. In SIGMOD.
[30]
N. Vesdapunt, K. Bellare, and N. Dalvi. 2014. Crowdsourcing algorithms for entity resolution. In PVLDB.
[31]
Jiannan Wang, Tim Kraska, Michael J. Franklin, and Jianhua Feng. 2012. CrowdER: Crowdsourcing Entity Resolution. PVLDB, Vol. 5, 11 (2012), 1483--1494.
[32]
J. Wang, G. Li, T. Kraska, M. J. Franklin, and J. Feng. 2013. Leveraging transitive relations for crowdsourced joins. In SIGMOD.
[33]
S. Wang, X. Xiao, and C. Lee. 2015. Crowd-Based Deduplication: An Adaptive Approach. In SIGMOD.
[34]
S. Whang, P. Lofgren, and H. Garcia-Molina. 2013. Question selection for crowd entity resolution. In VLDB.

Cited By

View all

Recommendations

Comments

Information & Contributors

Information

Published In

cover image ACM Conferences
SIGMOD '20: Proceedings of the 2020 ACM SIGMOD International Conference on Management of Data
June 2020
2925 pages
ISBN:9781450367356
DOI:10.1145/3318464
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Sponsors

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 31 May 2020

Permissions

Request permissions for this article.

Check for updates

Author Tags

  1. SVM
  2. blocking dimensions
  3. ensembles
  4. entity matching
  5. example selectors
  6. learner-agnostic selectors
  7. learner-aware selectors
  8. margin
  9. neural networks
  10. perfect and noisy oracles
  11. query by committee
  12. random forests
  13. rule-based models
  14. unified active learning

Qualifiers

  • Research-article

Conference

SIGMOD/PODS '20
Sponsor:

Acceptance Rates

Overall Acceptance Rate 785 of 4,003 submissions, 20%

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)71
  • Downloads (Last 6 weeks)5
Reflects downloads up to 04 Feb 2025

Other Metrics

Citations

Cited By

View all

View Options

Login options

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Figures

Tables

Media

Share

Share

Share this Publication link

Share on social media