skip to main content
research-article

Collective entity resolution in multi-relational familial networks

Published: 01 December 2019 Publication History

Abstract

Entity resolution in settings with rich relational structure often introduces complex dependencies between co-references. Exploiting these dependencies is challenging—it requires seamlessly combining statistical, relational, and logical dependencies. One task of particular interest is entity resolution in familial networks. In this setting, multiple partial representations of a family tree are provided, from the perspective of different family members, and the challenge is to reconstruct a family tree from these multiple, noisy, partial views. This reconstruction is crucial for applications such as understanding genetic inheritance, tracking disease contagion, and performing census surveys. Here, we design a model that incorporates statistical signals (such as name similarity), relational information (such as sibling overlap), logical constraints (such as transitivity and bijective matching), and predictions from other algorithms (such as logistic regression and support vector machines), in a collective model. We show how to integrate these features using probabilistic soft logic, a scalable probabilistic programming framework. In experiments on real-world data, our model significantly outperforms state-of-the-art classifiers that use relational features but are incapable of collective reasoning.

References

[1]
Arasu A, Ré C, Suciu D (2009) Large-scale deduplication with constraints using dedupalog. In: IEEE international conference on data engineering (ICDE)
[2]
Bach S, Broecheler M, Huang B, Getoor L (2017) Hinge-loss markov random fields and probabilistic soft logic. J Mach Learn Res (JMLR) 18(109):1–67
[3]
Bach S, Huang B, London B, Getoor L (2013) Hinge-loss Markov random fields: convex inference for structured prediction. In: Uncertainty in artificial intelligence (UAI)
[4]
Belin T, Rubin D (1995) A method for calibrating false-match rates in record linkage. J Am Stat Assoc 90(430):694–707
[5]
Bhattacharya I, Getoor L (2007) Collective entity resolution in relational data. ACM Trans Knowl Discov Data (TKDD) 1(1). https://doi.org/10.1145/1217299.1217304
[6]
Cessie S, Houwelingen J (1992) Ridge estimators in logistic regression. Appl Stat 41(1):191–201
[7]
Chang C, Lin C (2011) Libsvm: a library for support vector machines. ACM Trans Intell Syst Technol (TIST) 2(3):2:27:1–27:27
[8]
Christen P (2012) Data matching: concepts and techniques for record linkage, entity resolution, and duplicate detection. Springer, Berlin
[9]
Culotta A, McCallum A (2005) Joint deduplication of multiple record types in relational data. In: ACM international conference on information and knowledge management (CIKM)
[10]
Dong X, Halevy A, Madhavan J (2005) Reference reconciliation in complex information spaces. In: ACM special interest group on management of data (SIGMOD)
[11]
Driessens K, Reutemann P, Pfahringer B, Leschi C (2006) Using weighted nearest neighbor to benefit from unlabeled data. In: Pacific-Asia conference on knowledge discovery and data mining (PAKDD)
[12]
Efremova J, Ranjbar-Sahraei B, Rahmani H, Oliehoek F, Calders T, Tuyls K, Weiss G (2015) Multi-source entity resolution for genealogical data, population reconstruction
[13]
Fellegi P, Sunter B (1969) A theory for record linkage. J Am Stat Assoc 64(328):1183–1210
[14]
Frank E, Hall M, Witten I (2016) The WEKA Workbench. In: Gray J (ed) Practical machine learning tools and techniques. Morgan Kaufmann, Burlington (Online appendix for data mining)
[15]
Goergen A, Ashida S, Skapinsky K, de Heer H, Wilkinson A, Koehly L (2016) Knowledge is power: improving family health history knowledge of diabetes and heart disease among multigenerational mexican origin families. Public Health Genomics 19(2):93–101
[16]
Hand D, Christen P (2017) A note on using the f-measure for evaluating record linkage algorithms. Stat Comput 28(3):539–547
[17]
Hanneman R, Riddle F (2005) Introduction to social network methods. University of California, Riverside
[18]
Harron K, Wade A, Gilbert R, Muller-Pebody B, Goldstein H (2014) Evaluating bias due to data linkage error in electronic healthcare records. BMC Med Res Methodol 14:36
[19]
Hsu C, Chang C, Lin C (2003) A practical guide to support vector classification. Technical report, Department of Computer Science, National Taiwan University
[20]
Kalashnikov D, Mehrotra S (2006) Domain-independent data cleaning via analysis of entity-relationship graph. ACM Trans Database Syst (TODS) 31(2):716–767
[21]
Kouki P, Marcum C, Koehly L, Getoor L (2016) Entity resolution in familial networks. In: SIGKDD conference on knowledge discovery and data mining (KDD), workshop on mining and learning with graphs
[22]
Kouki P, Pujara J, Marcum C, Koehly L, Getoor L (2017) Collective entity resolution in familial networks. In: IEEE international conference on data mining (ICDM)
[23]
Landwehr N, Hall M, Frank E (2005) Logistic model trees. Mach Learn 95(1–2):161–205
[24]
Li X, Shen C (2008) Linkage of patient records from disparate sources. Stat Methods Med Res 22(1):31–8
[25]
Lin J, Marcum C, Myers M, Koehly L (2017) Put the family back in family health history: a multiple-informant approach. Am J Prev Med 5(52):640–644
[26]
Navarro G (2001) A guided tour to approximate string matching. ACM Comput Surv 33(1):31–88
[27]
Newcombe H (1988) Handbook of record linkage: methods for health and statistical studies, administration, and business. Oxford University Press Inc, Oxford
[28]
Nowozin S, Gehler P, Jancsary J, Lampert C (2014) Advanced structured prediction. The MIT Press, Cambridge
[29]
Platanios E, Poon H, Mitchell T, Horvitz E (2017) Estimating accuracy from unlabeled data: a probabilistic logic approach. In: Conference on neural information processing systems (NIPS)
[30]
Pujara J, Getoor L (2016) Generic statistical relational entity resolution in knowledge graphs. In: International joint conference on artificial intelligence (IJCAI), workshop on statistical relational artificial intelligence (StarAI)
[31]
Rastogi V, Dalvi N, Garofalakis M (2011) Large-scale collective entity matching. In: International conference on very large databases (VLDB)
[32]
Singla P, Domingos P (2006) Entity resolution with Markov logic. In: IEEE international conference on data mining (ICDM)
[33]
Suchanek F, Abiteboul S, Senellart P (2011) Paris: probabilistic alignment of relations, instances, and schema. In: Proceedings of the very large data bases endowment (PVLDB), vol 5(3)
[34]
Winkler W (2006) Overview of record linkage and current research directions. Technical report, US Census Bureau

Cited By

View all

Index Terms

  1. Collective entity resolution in multi-relational familial networks
        Index terms have been assigned to the content through auto-classification.

        Recommendations

        Comments

        Information & Contributors

        Information

        Published In

        cover image Knowledge and Information Systems
        Knowledge and Information Systems  Volume 61, Issue 3
        Dec 2019
        579 pages

        Publisher

        Springer-Verlag

        Berlin, Heidelberg

        Publication History

        Published: 01 December 2019

        Author Tags

        1. Entity resolution
        2. Data integration
        3. Familial networks
        4. Multi-relational networks
        5. Collective classification
        6. Family reconstruction
        7. Probabilistic soft logic

        Qualifiers

        • Research-article

        Contributors

        Other Metrics

        Bibliometrics & Citations

        Bibliometrics

        Article Metrics

        • Downloads (Last 12 months)0
        • Downloads (Last 6 weeks)0
        Reflects downloads up to 06 Jan 2025

        Other Metrics

        Citations

        Cited By

        View all

        View Options

        View options

        Media

        Figures

        Other

        Tables

        Share

        Share

        Share this Publication link

        Share on social media