skip to main content
10.1145/375663.375682acmconferencesArticle/Chapter ViewAbstractPublication PagesmodConference Proceedingsconference-collections
Article

Automatic segmentation of text into structured records

Published: 01 May 2001 Publication History

Abstract

In this paper we present a method for automatically segmenting unformatted text records into structured elements. Several useful data sources today are human-generated as continuous text whereas convenient usage requires the data to be organized as structured records. A prime motivation is the warehouse address cleaning problem of transforming dirty addresses stored in large corporate databases as a single text field into subfields like “City” and “Street”. Existing tools rely on hand-tuned, domain-specific rule-based systems.
We describe a tool DATAMOLD that learns to automatically extract structure when seeded with a small number of training examples. The tool enhances on Hidden Markov Models (HMM) to build a powerful probabilistic model that corroborates multiple sources of information including, the sequence of elements, their length distribution, distinguishing words from the vocabulary and an optional external data dictionary. Experiments on real-life datasets yielded accuracy of 90% on Asian addresses and 99% on US addresses. In contrast, existing information extraction methods based on rule-learning techniques yielded considerably lower accuracy.

References

[1]
B. Aldelberg. Nodose: A tool for semi-automatically extracting structured and semistructured data from text documents. In SIGMOD, 1998.
[2]
Arnaud Sahuguet and Fabien Azavant. Building light-weight wrappers for legacy Web data-sources using W4F. In International Conference on Very Large Databases (VLDB), 1999.
[3]
G. Barish, Y.-S. Chen, D. DiPasquo, C. A. Knoblock, S. Minton, I. Muslea, and C. Shahabi. Theaterloc: Using information integration technology to rapidly build virtual applications. In Intl. Conf. on Data Engineering ICDE, pages 681-682, 2000.
[4]
D. M. Bikel, S. Miller, R. Schwartz, and R. Weischedel. Nymble: a high-performance learning name-under. In Proceedings of ANLP-97, pages 194-201, 1997.
[5]
M. E. Califf and R. J. Mooney. Relational learning of pattern-match rules for information extraction. In Proceedings of the Sixteenth National Conference on Artificial Intelligence (AAAI-99), pages 328-334, July 1999.
[6]
A. Crespo, J. Jannink, E. Neuhold, M. Rys, and R. Studer. A survey of semi-automatic extraction and transformation. http://www-db.stanford.edu/ crespo/publications/.
[7]
D. W. Embley, Y. S. Jiang, and Y.-K. Ng. Record-boundary discovery in web documents. In SIGMOD 1999, Proceedings ACM SIGMOD International Conference on Management of Data, June 1-3, 1999, Philadephia, Pennsylvania, USA, pages 467-478, 1999.
[8]
D. Freitag and A. McCallum. Information extraction using HMMs and shrinkage. In Papers from the AAAI-99 Workshop on Machine Learning for Information Extraction, pages 31-36, 1999.
[9]
D. Freitag and A. McCallum. Information extraction with HMM structures learned by stochastic optimization. In Proceedings of the AAAI 2000, 2000.
[10]
H. Galhardas. http://caravel.inria.fr/ galharda/cleaning.html.
[11]
J. Hammer, H. Garcia-Molina, J. Cho, R. Aranha, and A. Crespo. Extracting semistructure information from the web. In Workshop on mangement of semistructured data, 1997.
[12]
M. A. Hernandez and S. J. Stolfo. The merge/purge problem for large databases. In Proceedings of the ACM SIGMOD, 1995.
[13]
C.-N. Hsu and M.-T. Dung. Generating finite-state transducers for semistructured data extraction from the web. Information Systems Special Issue on Semistructured Data, 23(8), 1998.
[14]
S. Huffman. Learning information extraction patterns from examples. In S. Wermter, G. Scheler, and E. Riloff, editors, Proceedings of the 1995 IJCAI Workshop on New Approaches to Learning for Natural Language Processing., 1995.
[15]
R. Kimball. Dealing with dirty data. Intelligent Enterprise, September 1996. http://www.intelligententerprise.com/.
[16]
J. Kupiec. Robust part of speech tagging using a hidden Markov model. Computer Speech and Language, 6:225-242, 1992.
[17]
N. Kushmerick, D. Weld, and R. Doorenbos. Wrapper induction for information extraction. In Proceedings of IJCAI, 1997.
[18]
P.-S. Laplace. Philosophical Essays on Probabilities. Springer-Verlag, New York, 1995. Translated by A. I. Dale from the 5th French edition of 1825.
[19]
S. Lawrence, C. L. Giles, and K. Bollacker. Digital libraries and autonomous citation indexing. IEEE Computer, 32(6):67-71, 1999.
[20]
L. Liu, C. Pu, and W. Han. Xwrap: An xml-enabled wrapper construction system for web information sources. In International Conference on Data Engineering (ICDE), pages 611-621, 2000.
[21]
A. McCallum, D. Freitag, and F. Pereira". Maximum entropy markov models for information extraction and segmentation. In In proceedings of ICML-2000, 2000.
[22]
G. Mecca, P. Merialdo, and P. Atzeni. Araneus in the era of xml. In IEEE Data Engineering Bullettin, Special Issue on XML. IEEE, September 1999.
[23]
A. E. Monge and C. P. Elkan. The field matching problem: Algorithms and applications. In Proceedings of the Second International Conference on Knowledge Discovery and Data Mining (KDD-96), 1996.
[24]
I. Muslea. Extraction patterns for information extraction tasks: A survey. In The AAAI-99 Workshop on Machine Learning for Information Extraction, 1999.
[25]
I. Muslea, S. Minton, and C. A. Knoblock. A hierarchical approach to wrapper induction. In Proceedings of the Third International Conference on Autonomous Agents, Seattle, WA, 1999.
[26]
L. Rabiner. A tutorial on Hidden Markov Models and selected applications in speech recognition. In Proceedings of the IEEE, 77(2), 1989.
[27]
L. Rabiner and B.-H. Juang. Fundamentals of Speech Recognition, chapter 6. Prentice-Hall, 1993.
[28]
K. Seymore, A. McCallum, and R. Rosenfeld. Learning Hidden Markov Model structure for information extraction. In Papers from the AAAI-99 Workshop on Machine Learning for Information Extraction, pages 37-42, 1999.
[29]
S. Soderland. Learning information extraction rules for semi-structured and free text. Machine Learning, 34, 1999.

Cited By

View all

Recommendations

Comments

Information & Contributors

Information

Published In

cover image ACM Conferences
SIGMOD '01: Proceedings of the 2001 ACM SIGMOD international conference on Management of data
May 2001
630 pages
ISBN:1581133324
DOI:10.1145/375663
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Sponsors

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 01 May 2001

Permissions

Request permissions for this article.

Check for updates

Qualifiers

  • Article

Conference

SIGMOD/PODS01
Sponsor:

Acceptance Rates

SIGMOD '01 Paper Acceptance Rate 44 of 293 submissions, 15%;
Overall Acceptance Rate 785 of 4,003 submissions, 20%

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)30
  • Downloads (Last 6 weeks)4
Reflects downloads up to 31 Dec 2024

Other Metrics

Citations

Cited By

View all

View Options

Login options

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Media

Figures

Other

Tables

Share

Share

Share this Publication link

Share on social media