skip to main content
10.1145/2187836.2187900acmotherconferencesArticle/Chapter ViewAbstractPublication PagesthewebconfConference Proceedingsconference-collections
research-article

ZenCrowd: leveraging probabilistic reasoning and crowdsourcing techniques for large-scale entity linking

Published: 16 April 2012 Publication History

Abstract

We tackle the problem of entity linking for large collections of online pages; Our system, ZenCrowd, identifies entities from natural language text using state of the art techniques and automatically connects them to the Linked Open Data cloud. We show how one can take advantage of human intelligence to improve the quality of the links by dynamically generating micro-tasks on an online crowdsourcing platform. We develop a probabilistic framework to make sensible decisions about candidate links and to identify unreliable human workers. We evaluate ZenCrowd in a real deployment and show how a combination of both probabilistic reasoning and crowdsourcing techniques can significantly improve the quality of the links, while limiting the amount of work performed by the crowd.

References

[1]
O. Alonso and R. A. Baeza-Yates. Design and Implementation of Relevance Assessments Using Crowdsourcing. In ECIR, pages 153--164, 2011.
[2]
P. Bailey, A. P. de Vries, N. Craswell, and I. Soboroff. Overview of the TREC 2007 Enterprise Track. In TREC, 2007.
[3]
K. Balog, P. Serdyukov, and A. P. de Vries. Overview of the TREC 2010 Entity Track. In TREC, 2010.
[4]
M. Banko, M. J. Cafarella, S. Soderland, M. Broadhead, and O. Etzioni. Open Information Extraction from the Web. In IJCAI, pages 2670--2676, 2007.
[5]
R. Blanco, H. Halpin, D. Herzig, P. Mika, J. Pound, H. S. Thompson, and D. T. Tran. Repeatable and reliable search system evaluation using crowdsourcing. In SIGIR, pages 923--932, 2011.
[6]
R. Blanco, P. Mika, and S. Vigna. Effective and Efficient Entity Search in RDF Data. InInternational Semantic Web Conference (ISWC), pages 83--97, 2011.
[7]
P. Bouquet, H. Stoermer, C. Niederee, and A. Mana. Entity Name System: The Backbone of an Open and Scalable Web of Data. In Proceedings of the IEEE International Conference on Semantic Computing, ICSC 2008, pages 554--561.
[8]
M. Ciaramita and Y. Altun. Broad-coverage sense disambiguation and information extraction with a supersense sequence tagger. In Proceedings of the 2006 Conference on Empirical Methods in Natural Language Processing, EMNLP '06, pages 594--602, Stroudsburg, PA, USA, 2006. ACL.
[9]
P. Cudre-Mauroux, K. Aberer, and A. Feher. Probabilistic Message Passing in Peer Data Management Systems. InInternational Conference on Data Engineering (ICDE), 2006.
[10]
P. Cudre-Mauroux, P. Haghani, M. Jost, K. Aberer, and H. De Meer. idMesh: graph-based disambiguation of linked data. In WWW '09, pages 591--600, New York, NY, USA, 2009. ACM.
[11]
H. Cunningham, D. Maynard, K. Bontcheva, and V. Tablan. GATE: A framework and graphical development environment for robust NLP tools and applications. In Proceedings of the 40th Anniversary Meeting of the ACL, 2002.
[12]
G. Demartini, T. Iofciu, and A. P. de Vries. Overview of the INEX 2009 Entity Ranking Track. In INEX, pages 254--264, 2009.
[13]
A. Dempster, N. Laird, and D. Rubin. Maximum likelihood from incomplete data via the EM algorithm. Journal of the Royal Statistical Society, 39, 1977.
[14]
X. Dong, A. Halevy, and J. Madhavan. Reference reconciliation in complex information spaces. In SIGMOD, pages 85--96. ACM, 2005.
[15]
A. Feng, M. J. Franklin, D. Kossmann, T. Kraska, S. Madden, S. Ramesh, A. Wang, and R. Xin. CrowdDB: Query Processing with the VLDB Crowd. PVLDB, 4(11):1387--1390, 2011.
[16]
T. Finin, W. Murnane, A. Karandikar, N. Keller, J. Martineau, and M. Dredze. Annotating named entities in Twitter data with crowdsourcing. In Proceedings of the NAACL HLT 2010 Workshop on Creating Speech and Language Data with Amazon's Mechanical Turk, CSLDAMT '10, pages 80--88, 2010.
[17]
K. Haas, P. Mika, P. Tarjan, and R. Blanco. Enhanced results for web search. In SIGIR, pages 725--734, 2011.
[18]
M. Jaro. Advances in record-linkage methodology as applied to matching the 1985 census of Tampa, Florida.Journal of the American Statistical Association, 84(406):414--420, 1989.
[19]
G. Kazai. In Search of Quality in Crowdsourcing for Search Engine Evaluation. In ECIR, pages 165--176, 2011.
[20]
G. Kazai, J. Kamps, M. Koolen, and N. Milic-Frayling. Crowdsourcing for book search evaluation: impact of hit design on comparative system ranking. In SIGIR, pages 205--214, 2011.
[21]
D. Klein and C. Manning. Accurate unlexicalized parsing. In Proceedings of the 41st Annual Meeting on Association for Computational Linguistics-Volume 1, pages 423--430. Association for Computational Linguistics, 2003.
[22]
C. Kohlschutter, P. Fankhauser, and W. Nejdl. Boilerplate detection using shallow text features. In WSDM, pages 441--450, 2010.
[23]
F. Kschischang, B. Frey, and H.-A. Loeliger. Factor graphs and the sum-product algorithm. IEEE Transactions on Information Theory, 47(2), 2001.
[24]
V. Levenshtein. Binary codes capable of correcting deletions, insertions, and reversals. In Soviet Physics Doklady, volume 10, pages 707--710, 1966.
[25]
A. Marcus, E. Wu, D. R. Karger, S. Madden, and R. C. Miller. Human-powered Sorts and Joins. PVLDB, 5(1):13--24, 2011.
[26]
P. N. Mendes, M. Jakob, A. Garcia-Silva, and C. Bizer. DBpedia Spotlight: Shedding Light on the Web of Documents. In Proceedings of the 7th International Conference on Semantic Systems (I-Semantics), 2011.
[27]
R. Mihalcea and A. Csomai. Wikify!: linking documents to encyclopedic knowledge. In Proceedings of the sixteenth ACM conference on Conference on information and knowledge management, CIKM '07, pages 233--242, New York, NY, USA, 2007. ACM.
[28]
B. On, N. Koudas, D. Lee, and D. Srivastava. Group linkage. In Data Engineering, 2007. ICDE 2007. IEEE 23rd International Conference on, pages 496--505. IEEE, 2007.
[29]
J. Pound, P. Mika, and H. Zaragoza. Ad-hoc object retrieval in the web of data. In WWW, pages 771--780, 2010.
[30]
W. Winkler. The state of record linkage and current research problems. In Statistical Research Division, US Census Bureau, 1999.
[31]
M. Wylot, J. Pont, M. Wisniewski, and P. Cudre-Mauroux. dipLODocus{RDF} - Short and Long-Tail RDF Analytics for Massive Webs of Data. In International Semantic Web Conference (ISWC), pages 778--793, 2011.

Cited By

View all

Index Terms

  1. ZenCrowd: leveraging probabilistic reasoning and crowdsourcing techniques for large-scale entity linking

        Recommendations

        Comments

        Information & Contributors

        Information

        Published In

        cover image ACM Other conferences
        WWW '12: Proceedings of the 21st international conference on World Wide Web
        April 2012
        1078 pages
        ISBN:9781450312295
        DOI:10.1145/2187836
        Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

        Sponsors

        • Univ. de Lyon: Universite de Lyon

        In-Cooperation

        Publisher

        Association for Computing Machinery

        New York, NY, United States

        Publication History

        Published: 16 April 2012

        Permissions

        Request permissions for this article.

        Check for updates

        Author Tags

        1. crowdsourcing
        2. entity linking
        3. linked data
        4. probabilistic reasoning

        Qualifiers

        • Research-article

        Conference

        WWW 2012
        Sponsor:
        • Univ. de Lyon
        WWW 2012: 21st World Wide Web Conference 2012
        April 16 - 20, 2012
        Lyon, France

        Acceptance Rates

        Overall Acceptance Rate 1,899 of 8,196 submissions, 23%

        Contributors

        Other Metrics

        Bibliometrics & Citations

        Bibliometrics

        Article Metrics

        • Downloads (Last 12 months)72
        • Downloads (Last 6 weeks)15
        Reflects downloads up to 23 Jan 2025

        Other Metrics

        Citations

        Cited By

        View all

        View Options

        Login options

        View options

        PDF

        View or Download as a PDF file.

        PDF

        eReader

        View online with eReader.

        eReader

        Media

        Figures

        Other

        Tables

        Share

        Share

        Share this Publication link

        Share on social media