skip to main content
10.1145/1935826.1935869acmconferencesArticle/Chapter ViewAbstractPublication PageswsdmConference Proceedingsconference-collections
research-article

Scalable knowledge harvesting with high precision and high recall

Published: 09 February 2011 Publication History

Abstract

Harvesting relational facts from Web sources has received great attention for automatically constructing large knowledge bases. Stateof-the-art approaches combine pattern-based gathering of fact candidates with constraint-based reasoning. However, they still face major challenges regarding the trade-offs between precision, recall, and scalability. Techniques that scale well are susceptible to noisy patterns that degrade precision, while techniques that employ deep reasoning for high precision cannot cope with Web-scale data.
This paper presents a scalable system, called PROSPERA, for high-quality knowledge harvesting. We propose a new notion of ngram-itemsets for richer patterns, and use MaxSat-based constraint reasoning on both the quality of patterns and the validity of fact candidates.We compute pattern-occurrence statistics for two benefits: they serve to prune the hypotheses space and to derive informative weights of clauses for the reasoner. The paper shows how to incorporate these building blocks into a scalable architecture that can parallelize all phases on a Hadoop-based distributed platform. Our experiments with the ClueWeb09 corpus include comparisons to the recent ReadTheWeb experiment. We substantially outperform these prior results in terms of recall, with the same precision, while having low run-times.

Supplementary Material

JPG File (wsdm2011_nakashole_skh_01.jpg)
MP4 File (wsdm2011_nakashole_skh_01.mp4)

References

[1]
First International Workshop on Automatic Knowledge Base Construction (AKBC), Grenoble, France, 2010. akbc.xrce.xerox.com
[2]
E. Agichtein, L. Gravano. Snowball: extracting relations from large plain-text collections. ACM DL, 2000.
[3]
R. Agrawal, T. Imielinski, A.N. Swami. Mining Association Rules between Sets of Items in Large Databases. SIGMOD, 1993.
[4]
S. Auer, C. Bizer, G. Kobilarov, J. Lehmann, R. Cyganiak, Z. Ives. DBpedia: A nucleus for a web of open data. ISWC, 2007. www.dbpedia.org.
[5]
M. Banko, M. J. Cafarella, S. Soderland, M. Broadhead, O. Etzioni. Open information extraction from the web. IJCAI, 2007. www.cs.washington.edu/research/knowitall/
[6]
S. Brin. Extracting patterns and relations from the World Wide Web. WebDB, 1998.
[7]
R. Bunescu, R. Mooney. Extracting relations from text: From word sequences to dependency paths. Text Mining & Natural Language Processing, 2007.
[8]
M. J. Cafarella. Extracting and querying a comprehensive web database. CIDR, 2009.
[9]
A. Carlson, J. Betteridge, R. C. Wang, E. R. Hruschka Jr., T. M. Mitchell. Coupled semi-supervised learning for information extraction. WSDM, 2010.
[10]
A. Carlson, J. Betteridge, B. Kisiel, B. Settles, E. R. Hruschka Jr., T. M. Mitchell. Toward an Architecture for Never-Ending Language Learning. AAAI, 2010. rtw.ml.cmu.edu/readtheweb.html.
[11]
M.-W. Chang, L.-A. Ratinov, N. Rizzolo, D. Roth. Learning and inference with constraints. AAAI, 2008.
[12]
P. Cimiano, J. Völker. Text2Onto -- a framework for ontology learning and data-driven change discovery. NLDB, 2005.
[13]
J. Dean, S. Ghemawat. MapReduce: a flexible data processing tool. Commun. ACM 53(1), 2010.
[14]
A. Doan, L. Gravano, R. Ramakrishnan, S. Vaithyanathan. (Eds.). Special issue on information extraction. SIGMOD Record, 37(4), 2008.
[15]
P. Domingos, D. Lowd. Markov Logic: An Interface Layer for Artificial Intelligence. Morgan & Claypool, 2009.
[16]
O. Etzioni, M. J. Cafarella, D. Downey, A.-M. Popescu, T. Shaked, S. Soderland, D. S. Weld, A. Yates. Unsupervised named-entity extraction from the web: An experimental study. Artif. Intell., 165(1), 2005.
[17]
A. Jain, P. G. Ipeirotis, A. Doan, L. Gravano. Join optimization of information extraction output: Quality matters! ICDE, 2009.
[18]
D.R. Karger, C. Stein. A New Approach to the Minimum Cut Problem. J. ACM 43(4), 1996.
[19]
G. Karypis, V. Kumar. A Parallel Algorithm for Multilevel Graph Partitioning and Sparse Matrix Ordering. J. Parallel Distrib. Comput. 48(1), 1998.
[20]
X Ling, D.S. Weld. Temporal Information Extraction. AAAI, 2010.
[21]
N. Nakashole, M. Theobald, G. Weikum. Find your Advisor: Robust Knowledge Gathering from the Web. WebDB Workshop, 2010.
[22]
F. Reiss, S. Raghavan, R. Krishnamurthy, H. Zhu, S. Vaithyanathan. An algebraic approach to rule-based information extraction. ICDE, 2008.
[23]
M. Richardson and P. Domingos. Markov Logic Networks. Machine Learning, 2006.
[24]
S. Riedel, L. Yao, A. McCallum. Modeling Relations and their Mentions without Labeled Text. ECML, 2010.
[25]
S. Sarawagi. Information extraction. Foundations and Trends in Databases, 1(3), 2008. A. Doan, J. F. Naughton, R. Ramakrishnan. Declarative information extraction using Datalog with embedded extraction predicates. VLDB, 2007.
[26]
W. Shen, A. Doan, J.F. Naughton, R. Ramakrishnan. Declarative Information Extraction Using Datalog with Embedded Extraction Predicates. VLDB, 2007.
[27]
K. Shvachko, H. Kuang, S. Radia, R. Chansler. The Hadoop Distributed File System. MSST, 2010.
[28]
R. Srikant, R. Agrawal. Mining Sequential Patterns: Generalizations and Performance Improvements. EDBT, 1996.
[29]
F. M. Suchanek, G. Kasneci, G. Weikum. YAGO: a core of semantic knowledge. WWW, 2007. www.mpi-inf.mpg.de/yago-naga/
[30]
F. M. Suchanek, M. Sozio, G. Weikum. SOFIE: a self-organizing framework for information extraction. WWW, 2009.
[31]
Y. Wang, M. Zhu, L. Qu, M. Spaniol, G. Weikum. Timely YAGO: harvesting, querying, and visualizing temporal knowledge from Wikipedia. EDBT, 2010.
[32]
G. Weikum, M. Theobald. From Information to Knowledge: Harvesting Entities and Relationships from Web Sources. PODS, 2010.
[33]
M.L. Wick, A. Culotta, K. Rohanimanesh, A. McCallum. An Entity Based Model for Coreference Resolution. SDM, 2009.
[34]
T. White. Hadoop: The Definitive Guide. O'Reilly, 2009.
[35]
F. Wu, D. S. Weld. Automatically refining the Wikipedia infobox ontology. WWW, 2008.
[36]
F. Xu, H. Uszkoreit, H. Li. A seed-driven bottom-up machine learning framework for extracting relations of various complexity. ACL. 2007.
[37]
J. Zhu, Z. Nie, X. Liu, B. Zhang, J.-R. Wen. StatSnowball: a statistical approach to extracting entity relationships. WWW, 2009.

Cited By

View all

Recommendations

Comments

Information & Contributors

Information

Published In

cover image ACM Conferences
WSDM '11: Proceedings of the fourth ACM international conference on Web search and data mining
February 2011
870 pages
ISBN:9781450304931
DOI:10.1145/1935826
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Sponsors

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 09 February 2011

Permissions

Request permissions for this article.

Check for updates

Author Tags

  1. information extraction
  2. knowledhe harvesting
  3. scalability

Qualifiers

  • Research-article

Conference

Acceptance Rates

WSDM '11 Paper Acceptance Rate 83 of 372 submissions, 22%;
Overall Acceptance Rate 498 of 2,863 submissions, 17%

Upcoming Conference

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)32
  • Downloads (Last 6 weeks)3
Reflects downloads up to 14 Sep 2024

Other Metrics

Citations

Cited By

View all

View Options

Get Access

Login options

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Media

Figures

Other

Tables

Share

Share

Share this Publication link

Share on social media