skip to main content
10.5555/2124128.2124142guidebooksArticle/Chapter ViewAbstractPublication PagesBookacm-pubtype
chapter

Decision models for record linkage

Published: 01 January 2006 Publication History

Abstract

The process of identifying record pairs that represent the same real-world entity in multiple databases, commonly known as record linkage, is one of the important steps in many data mining applications. In this paper, we address one of the sub-tasks in record linkage, i.e., the problem of assigning record pairs with an appropriate matching status. Techniques for solving this problem are referred to as decision models. Most existing decision models rely on good training data, which is, however, not commonly available in real-world applications. Decision models based on unsupervised machine learning techniques have recently been proposed. In this paper, we review several existing decision models and then propose an enhancement to cluster-based decision models. Experimental results show that our proposed decision model achieves the same accuracy of existing models while significantly reducing the number of record pairs required for manual review. The proposed model also provides a mechanism to trade off the accuracy with the number of record pairs required for clerical review.

References

[1]
Fayyad, U., Piatesky-Shapiro, G., Smith, P.: From Data Mining to Knowledge Discovery in Databases (a Survey). AI Magazine 17 (1996) 37-54
[2]
Fellegi, L., Sunter, A.: A Theory for Record Linkage. Journal of the American Statistical Society 64 (1969) 1183-1210
[3]
Winkler, W.: The State of Record Linkage and Current Research Problems. Technical Report RR/1999/04, US Bureau of the Census (1999)
[4]
Jaro, M.: Software Demonstrations. In: Proc. of an International Workshop and Exposition - Record Linkage Techniques, Arlington, VA, USA (1997)
[5]
Gill, L.: Methods for Automatic Record Matching and Linking and their Use in National Statistics. Technical Report National Statistics Methodological Series No. 25, National Statistics, London (2001)
[6]
Copas, J., Hilton, F.: Record Linkage: Statistical Models for Matching Computer Records. Journal of the Royal Statistical Society Series A 153 (1990) 287-320
[7]
Elfeky, M., Verykios, V., Elmagarmid, A.: TAILOR: A Record Linkage Toolbox. In: Proc. of the 18th Int. Conf. on Data Engineering, IEEE (2002)
[8]
Christen, P., Churches, T., Hegland, M.: Febrl - A Parallel Open Source Data Linkage System. In: Proc. of the 8th Pacific-Asia Conference on Knowledge Discovery and Data Mining (PAKDD'04), Sydney, Australia (2004) 638-647
[9]
Elfeky, M., Verykios, V.: On Search Enhancement of the Record Linkage Process. In: Proc. of ACM SIGKDD'03 Workshop on Data Cleaning, Record Linkage, and Object Consolidation, Washington DC, USA (2003) 31-33
[10]
Gu, L., Baxter, R.: Adaptive Filtering for Efficient Record Linkage. In: Proc. of the SIAM Data Mining Conference. (2004) 477-481
[11]
Bilenko, M., Mooney, R.: Adaptive Duplicate Detection Using Learnable String Similarity Measures. In: Proc. of ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (KDD'03), Washington DC, USA (2003)
[12]
Jin, L., Li, C., Mehrotra, S.: Efficient Record Linkage in Large Data Sets. In: Proc. of the International Conference on Database Systems for Advanced Applications (DASFAA'03), Kyoto, Japan (2003)
[13]
Verykios, V., Moustakides, G., Elfeky, M.: A Bayesian Decision Model for Cost Optimal Record Matching. The VLDB Journal (2002)
[14]
Winkler, W.: Using the EM Algorithm for Weight Computation in the Fellegi-Sunter Model of Record Linkage. In: Proc. of the Section on Survey Research Methods. (1988) 667-671
[15]
Hartigan, J., Wong, M.: A k-means Clustering Algorithm. Applied Statistics 28 (1979) 100-108
[16]
Fraley, C., Raftery, A.: Model-Based Clustering, Density Estimation and Discriminant Analysis. Journal of the American Statistical Association 97 (2002) 611-631
[17]
Christen, P.: Probabilistic Data Generation for Deduplication and Data Linkage. In: Proc. of the 6th International Conference on Intelligent Data Engineering and Automated Learning (IDEAL'05), Brisbane, Australia (2005) 109-116
[18]
Baxter, R., Christen, P., Churches, T.: A Comparison of Fast Blocking Methods for Record Linkage. In: Proc. of ACM SIGKDD'03 Workshop on Data Cleaning, Record Linkage, and Object Consolidation, Washington DC, USA (2003) 25-27
[19]
Venables, W., Smith, D.: An Introduction to R (http://www.r-project.org). (2003)
[20]
Baeza-Yates, R., Ribeiro-Neto, B.: Modern Information Retrieval. Addison Wesley Professional (1999)

Cited By

View all

Recommendations

Comments

Information & Contributors

Information

Published In

cover image Guide books
Data Mining: theory, Methodology, Techniques, and Applications
January 2006
329 pages
ISBN:3540325476

Publisher

Springer-Verlag

Berlin, Heidelberg

Publication History

Published: 01 January 2006

Author Tags

  1. classification
  2. clustering
  3. data linking
  4. decision model
  5. probabilistic linking
  6. record linkage

Qualifiers

  • Chapter

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)0
  • Downloads (Last 6 weeks)0
Reflects downloads up to 06 Nov 2024

Other Metrics

Citations

Cited By

View all

View Options

View options

Media

Figures

Other

Tables

Share

Share

Share this Publication link

Share on social media