skip to main content
10.1145/872757.872799acmconferencesArticle/Chapter ViewAbstractPublication PagesmodConference Proceedingsconference-collections
Article

Extracting structured data from Web pages

Published: 09 June 2003 Publication History

Abstract

Many web sites contain large sets of pages generated using a common template or layout. For example, Amazon lays out the author, title, comments, etc. in the same way in all its book pages. The values used to generate the pages (e.g., the author, title,...) typically come from a database. In this paper, we study the problem of automatically extracting the database values from such template-generated web pages without any learning examples or other similar human input. We formally define a template, and propose a model that describes how values are encoded into pages using a template. We present an algorithm that takes, as input, a set of template-generated pages, deduces the unknown template used to generate the pages, and extracts, as output, the values encoded in the pages. Experimental evaluation on a large number of real input page collections indicates that our algorithm correctly extracts data in most cases.

References

[1]
S. Abiteboul, R. Hull, and V. Vianu. Foundations of Databases. Addison Wesley, Reading, Massachussetts, 1995.]]
[2]
Amazon.com. http://www.amazon.com.]]
[3]
S. Brin. Extracting patterns and relations from the world wide web. In WebDB Workshop at 6th Intl. Conf. on Extending Database Technology, 1998.]]
[4]
C. Chang and S. Lui. IEPAD: Information extraction based on pattern discovery. In Proc. of 2001 Intl. World Wide Web Conf., pages 681--688, 2001.]]
[5]
V. Crescenzi, G. Mecca, and P. Merialdo. ROADRUNNER: Towards automatic data extraction from large web sites. In Proc. of the 2001 Intl. Conf. on Very Large Data Bases, pages 109--118, 2001.]]
[6]
Experimental results. http://www-db.stanford.edu/~arvind/extract/.]]
[7]
H. Garcia-Molina, Y. Papakonstantinou, D. Quass, A. Rajaraman, Y. Sagiv, J. D. Ullman, and J. Widom. The TSIMMIS project: Integration of heterogenous information sources. Journal of Intelligent Information Systems, 8(2):117--132, 1997.]]
[8]
M. Garofalokis, A. Gionis, R. Rastogi, S. Seshadr, and K. Shim. XTRACT: A system for extracting document type descriptors from XML documents. In Proc. of the 2000 ACM SIGMOD Intl. Conf. on Management of Data, pages 165--176, 2000.]]
[9]
E. M. Gold. Language identification in the limit. Information and Control, 10(5):447--474, 1967.]]
[10]
S. Grumbach and G. Mecca. In search of the lost schema. In Proc. of 1999 Intl. Conf. of Database Theory, pages 314--331, 1999.]]
[11]
L. M. Haas, D. Kossmann, E. L. Wimmers, and J. Yang. Optimizing queries across diverse data sources. In Proc. of the 1997 Intl. Conf. on Very Large Data Bases, pages 276--285, 1997.]]
[12]
J. Hammer, H. Garcia-Molina, J. Cho, A. Crespo, and R. Aranha. Extracting semi structure information from the web. In Proceedings of the Workshop on Management of Semistructured Data, 1997.]]
[13]
C. N. Hsu and M. T. Dung. Generating finite-state transducers for semi-structured data extraction from the web. Information Systems Special Issue on Semistructured Data, 23(8):521--538, 1998.]]
[14]
IEPAD:. http://www.csie/ncu.edu.tw/~chia.]]
[15]
N. Kushmerick, D. Weld, and R. Doorenbos. Wrapper induction for information extraction. In Proc. of the 1997 Intl. Joint Conf. on Artificial Intelligence, pages 729--737, 1997.]]
[16]
A. Laender, B. Ribeiro-Neto, A. da Silva, and J. Teixeira. A brief survey of web data extraction tools. Sigmod Record, 31(2), 2002.]]
[17]
A. Levy, A. Rajaraman, and J. J. Ordille. Querying heterogeneous information sources using source descriptions. In Proc. of the 1996 Intl. Conf. on Very Large Data Bases, pages 251--262, 1996.]]
[18]
L. Liu, C. Pu, and W. Han. XWRAP: An XML-enabled wrapper construction system for web information sources. In Proc. of the 2000 Intl. Conf. on Data Engineering, pages 611--621, 2000.]]
[19]
I. Muslea, S. Minton, and C. A. Knoblock. A hierarchical approach to wrapper induction. In Proc. of Third Intl. Conf. on Autonomous Agents, pages 190--197, 1999.]]
[20]
L. Pitt. Inductive inference, DFAs, and computational complexity. Analogical and Inductive Inference, pages 18--44, 1989.]]
[21]
RISE:. http://www.isi.edu/~muslea/RISE/.]]
[22]
J. Rissanen. Modeling by shortest data description. Automatica, 14:465--471, 1978.]]
[23]
ROADRUNNER:. http://www.dia.uniroma3.it/db/roadRunner/index.html.]]
[24]
S. Sarawagi. Automation in InformationExtraction and Data Integration (tutorial). VLDB, 2002.]]
[25]
J. D. Ullman. Information integration using logical views. In Proc. of 1997 Intl. Conf. on Database Theory, pages 19--40, 1997.]]

Cited By

View all

Recommendations

Comments

Information & Contributors

Information

Published In

cover image ACM Conferences
SIGMOD '03: Proceedings of the 2003 ACM SIGMOD international conference on Management of data
June 2003
702 pages
ISBN:158113634X
DOI:10.1145/872757
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Sponsors

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 09 June 2003

Permissions

Request permissions for this article.

Check for updates

Qualifiers

  • Article

Conference

SIGMOD/PODS03
Sponsor:

Acceptance Rates

SIGMOD '03 Paper Acceptance Rate 53 of 342 submissions, 15%;
Overall Acceptance Rate 785 of 4,003 submissions, 20%

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)135
  • Downloads (Last 6 weeks)13
Reflects downloads up to 25 Dec 2024

Other Metrics

Citations

Cited By

View all

View Options

Login options

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Media

Figures

Other

Tables

Share

Share

Share this Publication link

Share on social media