chapter

Decision models for record linkage

Authors:

Rohan BaxterAuthors Info & Claims

Data Mining: theory, Methodology, Techniques, and Applications

January 2006

Pages 146 - 160

Published: 01 January 2006 Publication History

Abstract

The process of identifying record pairs that represent the same real-world entity in multiple databases, commonly known as record linkage, is one of the important steps in many data mining applications. In this paper, we address one of the sub-tasks in record linkage, i.e., the problem of assigning record pairs with an appropriate matching status. Techniques for solving this problem are referred to as decision models. Most existing decision models rely on good training data, which is, however, not commonly available in real-world applications. Decision models based on unsupervised machine learning techniques have recently been proposed. In this paper, we review several existing decision models and then propose an enhancement to cluster-based decision models. Experimental results show that our proposed decision model achieves the same accuracy of existing models while significantly reducing the number of record pairs required for manual review. The proposed model also provides a mechanism to trade off the accuracy with the number of record pairs required for clerical review.

References

[1]

Fayyad, U., Piatesky-Shapiro, G., Smith, P.: From Data Mining to Knowledge Discovery in Databases (a Survey). AI Magazine 17 (1996) 37-54

[2]

Fellegi, L., Sunter, A.: A Theory for Record Linkage. Journal of the American Statistical Society 64 (1969) 1183-1210

[3]

Winkler, W.: The State of Record Linkage and Current Research Problems. Technical Report RR/1999/04, US Bureau of the Census (1999)

[4]

Jaro, M.: Software Demonstrations. In: Proc. of an International Workshop and Exposition - Record Linkage Techniques, Arlington, VA, USA (1997)

[5]

Gill, L.: Methods for Automatic Record Matching and Linking and their Use in National Statistics. Technical Report National Statistics Methodological Series No. 25, National Statistics, London (2001)

[6]

Copas, J., Hilton, F.: Record Linkage: Statistical Models for Matching Computer Records. Journal of the Royal Statistical Society Series A 153 (1990) 287-320

[7]

Elfeky, M., Verykios, V., Elmagarmid, A.: TAILOR: A Record Linkage Toolbox. In: Proc. of the 18th Int. Conf. on Data Engineering, IEEE (2002)

Digital Library

[8]

Christen, P., Churches, T., Hegland, M.: Febrl - A Parallel Open Source Data Linkage System. In: Proc. of the 8th Pacific-Asia Conference on Knowledge Discovery and Data Mining (PAKDD'04), Sydney, Australia (2004) 638-647

[9]

Elfeky, M., Verykios, V.: On Search Enhancement of the Record Linkage Process. In: Proc. of ACM SIGKDD'03 Workshop on Data Cleaning, Record Linkage, and Object Consolidation, Washington DC, USA (2003) 31-33

[10]

Gu, L., Baxter, R.: Adaptive Filtering for Efficient Record Linkage. In: Proc. of the SIAM Data Mining Conference. (2004) 477-481

[11]

Bilenko, M., Mooney, R.: Adaptive Duplicate Detection Using Learnable String Similarity Measures. In: Proc. of ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (KDD'03), Washington DC, USA (2003)

Digital Library

[12]

Jin, L., Li, C., Mehrotra, S.: Efficient Record Linkage in Large Data Sets. In: Proc. of the International Conference on Database Systems for Advanced Applications (DASFAA'03), Kyoto, Japan (2003)

Digital Library

[13]

Verykios, V., Moustakides, G., Elfeky, M.: A Bayesian Decision Model for Cost Optimal Record Matching. The VLDB Journal (2002)

Digital Library

[14]

Winkler, W.: Using the EM Algorithm for Weight Computation in the Fellegi-Sunter Model of Record Linkage. In: Proc. of the Section on Survey Research Methods. (1988) 667-671

[15]

Hartigan, J., Wong, M.: A k-means Clustering Algorithm. Applied Statistics 28 (1979) 100-108

[16]

Fraley, C., Raftery, A.: Model-Based Clustering, Density Estimation and Discriminant Analysis. Journal of the American Statistical Association 97 (2002) 611-631

[17]

Christen, P.: Probabilistic Data Generation for Deduplication and Data Linkage. In: Proc. of the 6th International Conference on Intelligent Data Engineering and Automated Learning (IDEAL'05), Brisbane, Australia (2005) 109-116

Digital Library

[18]

Baxter, R., Christen, P., Churches, T.: A Comparison of Fast Blocking Methods for Record Linkage. In: Proc. of ACM SIGKDD'03 Workshop on Data Cleaning, Record Linkage, and Object Consolidation, Washington DC, USA (2003) 25-27

[19]

Venables, W., Smith, D.: An Introduction to R (http://www.r-project.org). (2003)

[20]

Baeza-Yates, R., Ribeiro-Neto, B.: Modern Information Retrieval. Addison Wesley Professional (1999)

Digital Library

Cited By

Vatsalan DChristen PVerykios V(2013)A taxonomy of privacy-preserving record linkage techniquesInformation Systems10.1016/j.is.2012.11.00538:6(946-969)Online publication date: 1-Sep-2013
https://dl.acm.org/doi/10.1016/j.is.2012.11.005
Sariyar MBorg A(2012)Bagging, bumping, multiview, and active learning for record linkage with empirical results on patient identity dataComputer Methods and Programs in Biomedicine10.1016/j.cmpb.2012.08.003108:3(1160-1169)Online publication date: 1-Dec-2012
https://dl.acm.org/doi/10.1016/j.cmpb.2012.08.003
Köpcke HThor ARahm E(2010)Evaluation of entity resolution approaches on real-world match problemsProceedings of the VLDB Endowment10.14778/1920841.19209043:1-2(484-493)Online publication date: 1-Sep-2010
https://dl.acm.org/doi/10.14778/1920841.1920904
Show More Cited By

Index Terms

Decision models for record linkage
1. Computing methodologies
  1. Machine learning
    1. Learning paradigms
      1. Unsupervised learning
        Cluster analysis

Recommendations

Iterative record linkage for cleaning and integration
DMKD '04: Proceedings of the 9th ACM SIGMOD workshop on Research issues in data mining and knowledge discovery

Record linkage, the problem of determining when two records refer to the same entity, has applications for both data cleaning (deduplication) and for integrating data from multiple sources. Traditional approaches use a similarity measure that compares ...
Subsequent patient visit detection in a high volume OPD using record linkage techniques
COMPUTE '10: Proceedings of the Third Annual ACM Bangalore Conference

Record or data linkage techniques are used to link records which represent the same entity (e.g. patient, customer, citation, etc.) in one or more data sets where a unique identifier for each entity is not available in all or any of the data sets to be ...
Document clustering as a record linkage problem
DocEng '18: Proceedings of the ACM Symposium on Document Engineering 2018

This work examines document clustering as a record linkage problem, focusing on named-entities and frequent terms, using several vector and graph-based document representation methods and k-means clustering with different similarity measures. The JedAI ...

Comments

Information & Contributors

Information

Published In

cover image Guide books

Data Mining: theory, Methodology, Techniques, and Applications

January 2006

329 pages

ISBN:3540325476

Editors:
Graham J. Williams
The Australian Taxation Office
,
Simeon J. Simoff
School of Computing and Mathematics, University of Western Sydney, Sydney, NSW, Australia

Publisher

Springer-Verlag

Berlin, Heidelberg

Publication History

Published: 01 January 2006

Author Tags

Qualifiers

Chapter

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

8
Total Citations
View Citations
0
Total Downloads

Downloads (Last 12 months)0
Downloads (Last 6 weeks)0

Reflects downloads up to 06 Nov 2024

Other Metrics

View Author Metrics

Citations

Cited By

Vatsalan DChristen PVerykios V(2013)A taxonomy of privacy-preserving record linkage techniquesInformation Systems10.1016/j.is.2012.11.00538:6(946-969)Online publication date: 1-Sep-2013
https://dl.acm.org/doi/10.1016/j.is.2012.11.005
Sariyar MBorg A(2012)Bagging, bumping, multiview, and active learning for record linkage with empirical results on patient identity dataComputer Methods and Programs in Biomedicine10.1016/j.cmpb.2012.08.003108:3(1160-1169)Online publication date: 1-Dec-2012
https://dl.acm.org/doi/10.1016/j.cmpb.2012.08.003
Köpcke HThor ARahm E(2010)Evaluation of entity resolution approaches on real-world match problemsProceedings of the VLDB Endowment10.14778/1920841.19209043:1-2(484-493)Online publication date: 1-Sep-2010
https://dl.acm.org/doi/10.14778/1920841.1920904
Christen P(2009)Development and user experiences of an open source data cleaning, deduplication and record linkage systemACM SIGKDD Explorations Newsletter10.1145/1656274.165628211:1(39-48)Online publication date: 16-Nov-2009
https://dl.acm.org/doi/10.1145/1656274.1656282
Christen PGayler R(2008)Towards scalable real-time entity resolution using a similarity-aware inverted index approachProceedings of the 7th Australasian Data Mining Conference - Volume 8710.5555/2449288.2449299(51-60)Online publication date: 27-Nov-2008
https://dl.acm.org/doi/10.5555/2449288.2449299
Christen P(2008)FebrlProceedings of the second Australasian workshop on Health data and knowledge management - Volume 8010.5555/1385089.1385094(17-25)Online publication date: 1-Jan-2008
https://dl.acm.org/doi/10.5555/1385089.1385094
Christen PLi YLiu BSarawagi S(2008)Automatic record linkage using seeded nearest neighbour and support vector machine classificationProceedings of the 14th ACM SIGKDD international conference on Knowledge discovery and data mining10.1145/1401890.1401913(151-159)Online publication date: 24-Aug-2008
https://dl.acm.org/doi/10.1145/1401890.1401913
Christen P(2007)A two-step classification approach to unsupervised record linkageProceedings of the sixth Australasian conference on Data mining and analytics - Volume 7010.5555/1378245.1378260(111-119)Online publication date: 3-Dec-2007
https://dl.acm.org/doi/10.5555/1378245.1378260

View Options

View options

Media

Figures

Other

Tables

View Table of Contents