skip to main content
research-article

Automatic rule refinement for information extraction

Published: 01 September 2010 Publication History

Abstract

Rule-based information extraction from text is increasingly being used to populate databases and to support structured queries on unstructured text. Specification of suitable information extraction rules requires considerable skill and standard practice is to refine rules iteratively, with substantial effort. In this paper, we show that techniques developed in the context of data provenance, to determine the lineage of a tuple in a database, can be leveraged to assist in rule refinement. Specifically, given a set of extraction rules and correct and incorrect extracted data, we have developed a technique to suggest a ranked list of rule modifications that an expert rule specifier can consider. We implemented our technique in the SystemT information extraction system developed at IBM Research -- Almaden and experimentally demonstrate its effectiveness.

References

[1]
Database languages -- SQL -- Part 1: Framework (SQL/Framework). Technical report. ISO/IEC 9075--1:2003.
[2]
The Enron corpus. www.cs.cmu.edu/enron/.
[3]
Automatic Content Extraction 2005 Evaluation Dataset, 2005.
[4]
E. Agichtein and L. Gravano. Snowball: Extracting Relations from Large Plain-Text Collections. In ACM DL, 2000.
[5]
D. E. Appelt and B. Onyshkevych. The Common Pattern Specification Language. In TIPSTER workshop, 1998.
[6]
N. Ashish, S. Mehrotra, and P. Pirzadeh. XAR: An Integrated Framework for Information Extraction. In WRI Wold Congress on Computer Science and Information Engineering, 2009.
[7]
J. L. Bentley. Programming Pearls: Algorithm Design Techniques. Commun. ACM, 27(9):865--873, 1984.
[8]
B. Boguraev. Annotation-based Finite State Processing in a Large-Scale NLP Architecture. In RANLP, 2003.
[9]
A. Chapman and H. V. Jagadish. Why Not? In SIGMOD, 2009.
[10]
J. Cheney, L. Chiticariu, and W. Tan. Provenance in Databases: Why, How, and Where. Foundations and Trends in Databases, 1(4):379--474, 2009.
[11]
L. Chiticariu, R. Krishnamurthy, Y. Li, S. Raghavan, F. Reiss, and S. Vaithyanathan. SystemT: An Algebraic Approach to Declarative Information Extraction. In ACL, 2010.
[12]
H. Cunningham. JAPE: a Java Annotation Patterns Engine. Research Memorandum CS - 99 - 06, University of Sheffield, May 1999.
[13]
A. Das Sarma, A. Jain, and D. Srivastava. I4E: Interactive Investigation of Iterative Information Extraction. In SIGMOD, 2010.
[14]
D. DeJong. An Overview of the FRUMP System. In Strategies for Natural language Processing. 1982.
[15]
D. Freitag. Multistrategy Learning for Information Extraction. In ICML, 1998.
[16]
B. Glavic and G. Alonso. Perm: Processing Provenance and Data on the Same Data Model through Query Rewriting. In ICDE, 2009.
[17]
T. J. Green, G. Karvounarakis, and V. Tannen. Provenance Semirings. In PODS, 2007.
[18]
M. Herschel and M. Hernandez. Explaining Missing Answers to SPJUA Queries. PVLDB, 2010.
[19]
J. R. Hobbs, D. Appelt, J. Bear, D. Israel, M. Kameyama, and M. Tyson. FASTUS: a System for Extracting Information from Text. In HLT, 1993.
[20]
J. Huang, T. Chen, A. Doan, and J. F. Naughton. On the Provenance of Non-Answers to Queries over Extracted Data. PVLDB, 1(1), 2008.
[21]
A. Jain, P. Ipeirotis, A. Doan, and L. Gravano. Join Optimization of Information Extraction Output: Quality Matters! In ICDE, 2009.
[22]
R. Krishnamurthy, Y. Li, S. Raghavan, F. Reiss, S. Vaithyanathan, and H. Zhu. SystemT: a System for Declarative Information Extraction. SIGMOD Record, 37(4):7--13, 2008.
[23]
J. Lafferty, A. McCallum, and F. Pereira. Conditional Random Fields: Probabilistic Models for Segmenting and Labeling Sequence Data. In ICML, 2001.
[24]
W. Lehnert, J. McCarthy, S. Soderland, E. Riloff, C. Cardie, J. Peterson, F. Feng, C. Dolan, and S. Goldman. UMass/Hughes: Description of the CIRCUS System Used for MUC-5. In MUC, 1993.
[25]
Y. Li, R. Krishnamurthy, S. Raghavan, S. Vaithyanathan, and H. V. Jagadish. Regular Expression Learning for Information Extraction. In EMNLP, 2008.
[26]
F. Peng and A. McCallum. Accurate Information Extraction from Research Papers Using Conditional Random Fields. In HLT-NAACL, 2004.
[27]
F. Reiss, S. Raghavan, R. Krishnamurthy, H. Zhu, and S. Vaithyanathan. An Algebraic Approach to Rule-Based Information Extraction. In ICDE, 2008.
[28]
E. Riloff. Automatically Constructing a Dictionary for Information Extraction Tasks. In KDD, 1993.
[29]
W. Shen, P. DeRose, R. McCann, A. Doan, and R. Ramakrishnan. Toward Best-Effort Information Extraction. In SIGMOD, 2008.
[30]
W. Shen, A. Doan, J. F. Naughton, and R. Ramakrishnan. Declarative Information Extraction Using Datalog with Embedded Extraction Predicates. In VLDB, 2007.
[31]
S. G. Soderland. Learning Text Analysis Rules for Domain-specific Natural Language Processing. Technical report, U. Mass., 1996.
[32]
S. Tata, J. M. Patel, J. S. Friedman, and A. Swaroop. Declarative Querying for Biological Sequences. In ICDE, 2006.
[33]
C. Thompson, M. Califf, and R. Mooney. Active Learning for Natural Language Parsing and Information Extraction. In ICML, 1999.
[34]
E. F. Tjong Kim Sang and F. De Meulder. Introduction to the CoNLL-2003 Shared Task: Language-independent Named Entity Recognition. In CoNLL at HLT-NAACL, 2003.
[35]
A. Yates, M. Banko, M.Broadhead, M. J. Cafarella, O. Etzioni, and S. Soderland. TextRunner: Open Information Extraction on the Web. In HLT-NAACL (Demonstration), 2007.
[36]
S. Zhao and R. Grishman. Extracting Relations with Integrated Information Using Kernel Methods. In ACL, 2005.

Cited By

View all
  1. Automatic rule refinement for information extraction

    Recommendations

    Comments

    Information & Contributors

    Information

    Published In

    cover image Proceedings of the VLDB Endowment
    Proceedings of the VLDB Endowment  Volume 3, Issue 1-2
    September 2010
    1658 pages

    Publisher

    VLDB Endowment

    Publication History

    Published: 01 September 2010
    Published in PVLDB Volume 3, Issue 1-2

    Qualifiers

    • Research-article

    Contributors

    Other Metrics

    Bibliometrics & Citations

    Bibliometrics

    Article Metrics

    • Downloads (Last 12 months)7
    • Downloads (Last 6 weeks)1
    Reflects downloads up to 06 Nov 2024

    Other Metrics

    Citations

    Cited By

    View all

    View Options

    Get Access

    Login options

    Full Access

    View options

    PDF

    View or Download as a PDF file.

    PDF

    eReader

    View online with eReader.

    eReader

    Media

    Figures

    Other

    Tables

    Share

    Share

    Share this Publication link

    Share on social media