skip to main content
research-article

Development and user experiences of an open source data cleaning, deduplication and record linkage system

Published: 16 November 2009 Publication History

Abstract

Record linkage, also known as database matching or entity resolution, is now recognised as a core step in the KDD process. Data mining projects increasingly require that information from several sources is combined before the actual mining can be conducted. Also of increasing interest is the deduplication of a single database. The objectives of record linkage and deduplication are to identify, match and merge all records that relate to the same real-world entities. Because real-world data is commonly 'dirty', data cleaning is an important first step in many deduplication, record linkage, and data mining project.
In this paper, an overview of the Febrl (Freely Extensible Biomedical Record Linkage) system is provided, and the results of a recent survey of Febrl users is discussed. Febrl includes a variety of functionalities required for data cleaning, deduplication and record linkage, and it provides a graphical user interface that facilitates its application for users who do not have programming experience.

References

[1]
A. Aizawa and K. Oyama. A fast linkage detection scheme for multi-source information integration. In International Workshop on Challenges in Web Information Retrieval and Integration (WIRI'05), pages 30--39, Tokyo, 2005.
[2]
R. Baxter, P. Christen, and T. Churches. A comparison of fast blocking methods for record linkage. In ACM SIGKDD'03 workshop on Data Cleaning, Record Linkage and Object Consolidation, pages 25--27, Washington DC, 2003.
[3]
P. Christen. A comparison of personal name matching: Techniques and practical issues. In Workshop on Mining Complex Data (MCD'06), held at IEEE ICDM'06, Hong Kong, 2006.
[4]
P. Christen. A two-step classification approach to unsupervised record linkage. In Australasian Data Mining Conference (AusDM'07), Conferences in Research and Practice in Information Technology (CRPIT), volume 70, pages 111--119, Gold Coast, Australia, 2007.
[5]
P. Christen. Automatic record linkage using seeded nearest neighbour and support vector machine classification. In ACM International Conference on Knowledge Discovery and Data Mining (SIGKDD'08), pages 151--159, Las Vegas, 2008.
[6]
P. Christen. Automatic training example selection for scalable unsupervised record linkage. In Pacific-Asia Conference on Knowledge Discovery and Data Mining (PAKDD'08), Springer LNAI 5012, pages 511--518, Osaka, Japan, 2008.
[7]
P. Christen. Febrl - An open source data cleaning, deduplication and record linkage system with a graphical user interface (Demonstration Session). In ACM International Conference on Knowledge Discovery and Data Mining (SIGKDD'08), pages 1065--1068, Las Vegas, 2008.
[8]
P. Christen and D. Belacic. Automated probabilistic address standardisation and verification. In Australasian Data Mining Conference (AusDM'05), pages 53--67, Sydney, 2005.
[9]
P. Christen, T. Churches, and M. Hegland. Febrl - A parallel open source data linkage system. In Pacific-Asia Conference on Knowledge Discovery and Data Mining (PAKDD'04), Springer LNAI 3056, pages 638--647, Sydney, 2004.
[10]
P. Christen and K. Goiser. Quality and complexity measures for data linkage and deduplication. In F. Guillet and H. Hamilton, editors, Quality Measures in Data Mining, volume 43 of Studies in Computational Intelligence, pages 127--151. Springer, 2007.
[11]
P. Christen and A. Pudjijono. Accurate synthetic generation of realistic personal information. In Pacific-Asia Conference on Knowledge Discovery and Data Mining (PAKDD'09), Springer LNAI 5476, pages 507--514, Bangkok, Thailand, 2009.
[12]
T. Churches, P. Christen, K. Lim, and J.X. Zhu. Preparation of name and address data for record linkage using hidden Markov models. BioMed Central Medical Informatics and Decision Making, 2(9), 2002.
[13]
D.E. Clark. Practical introduction to record linkage for injury research. British Medical Journal, 10(3):186--191, 2004.
[14]
W.W. Cohen, P. Ravikumar, and S. Fienberg. A comparison of string distance metrics for name-matching tasks. In Workshop on Information Integration on the Web (IIWeb'03), held at IJCAI'03, pages 73--78, Acapulco, 2003.
[15]
W.W. Cohen and J. Richman. Learning to match and cluster large high-dimensional data sets for data integration. In ACM International Conference on Knowledge Discovery and Data Mining (SIGKDD'02), pages 475--480, Edmonton, 2002.
[16]
A. Elmagarmid, P. Ipeirotis, and V. Verykios. Duplicate record detection: A survey. IEEE Transactions on Knowledge and Data Engineering, 19(1):1--16, 2007.
[17]
I.P. Fellegi and A.B. Sunter. A theory for record linkage. Journal of the American Statistical Society, 64(328):1183--1210, 1969.
[18]
K. Goiser and P. Christen. Towards automated record linkage. In Australasian Data Mining Conference (AusDM'06), Conferences in Research and Practice in Information Technology (CRPIT), volume 61, pages 23--31, Sydney, 2006.
[19]
L. Gu and R. Baxter. Decision models for record linkage. In Selected Papers from AusDM, Springer LNCS 3755, pages 146--160, 2006.
[20]
M.A. Hernandez and S.J. Stolfo. The merge/purge problem for large databases. In ACM International Conference on Management of Data (SIGMOD'95), pages 127--138, San Jose, 1995.
[21]
L. Jin, C. Li, and S. Mehrotra. Efficient record linkage in large data sets. In International Conference on Database Systems for Advanced Applications (DASFAA'03), pages 137--146, Tokyo, 2003.
[22]
G. Williams. Data mining with Rattle and R. Togaware, Canberra, 2009. Software available at: http://rattle.togaware.com.
[23]
W. Winkler. Methods for evaluating and creating data quality. Elsevier Information Systems, 29(7):531--550, 2004.
[24]
W. E. Yancey. BigMatch: A program for extracting probable matches from a large file for record linkage. Technical Report RR2007/01, US Bureau of the Census, 2007.

Cited By

View all

Index Terms

  1. Development and user experiences of an open source data cleaning, deduplication and record linkage system

    Recommendations

    Comments

    Information & Contributors

    Information

    Published In

    cover image ACM SIGKDD Explorations Newsletter
    ACM SIGKDD Explorations Newsletter  Volume 11, Issue 1
    June 2009
    56 pages
    ISSN:1931-0145
    EISSN:1931-0153
    DOI:10.1145/1656274
    Issue’s Table of Contents

    Publisher

    Association for Computing Machinery

    New York, NY, United States

    Publication History

    Published: 16 November 2009
    Published in SIGKDD Volume 11, Issue 1

    Check for updates

    Author Tags

    1. GUI
    2. Python
    3. data linkage
    4. data standardisation
    5. database matching
    6. open source software

    Qualifiers

    • Research-article

    Contributors

    Other Metrics

    Bibliometrics & Citations

    Bibliometrics

    Article Metrics

    • Downloads (Last 12 months)8
    • Downloads (Last 6 weeks)0
    Reflects downloads up to 14 Sep 2024

    Other Metrics

    Citations

    Cited By

    View all

    View Options

    Get Access

    Login options

    View options

    PDF

    View or Download as a PDF file.

    PDF

    eReader

    View online with eReader.

    eReader

    Media

    Figures

    Other

    Tables

    Share

    Share

    Share this Publication link

    Share on social media