skip to main content
10.1145/775152.775179acmconferencesArticle/Chapter ViewAbstractPublication PagesthewebconfConference Proceedingsconference-collections
Article

Data extraction and label assignment for web databases

Published: 20 May 2003 Publication History

Abstract

Many tools have been developed to help users query, extract and integrate data from web pages generated dynamically from databases, i.e., from the Hidden Web. A key prerequisite for such tools is to obtain the schema of the attributes of the retrieved data. In this paper, we describe a system called, DeLa, which reconstructs (part of) a "hidden" back-end web database. It does this by sending queries through HTML forms, automatically generating regular expression wrappers to extract data objects from the result pages and restoring the retrieved data into an annotated (labelled) table. The whole process needs no human involvement and proves to be fast (less than one minute for wrapper induction for each site) and accurate (over 90% correctness for data extraction and around 80% correctness for label assignment).

References

[1]
B. Adelberg. "NoDoSE - A tool for semi-automatically extracting structured and semistructured data from text documents," Proc. ACM SIGMOD Conf., 1998, 283--294.
[2]
R. Baumgartner, S. Flesca and G. Gottlob. "Visual web information extraction with Lixto," Proc.27th VLDB Conf., 2001, 119--128.
[3]
BrightPlanet Corp. "The Deep Web: Surfacing hidden value." http://www.completeplanet.com/Tutorials/DeepWeb/
[4]
D. Buttler, L. Liu and C. Pu. "A fully automated object extraction system for the World Wide Web," Proc. Intl. Conf. on Distributed Computing Systems, 2001, 361--370.
[5]
C.H. Chang and S.C. Lui. "IEPAD: information extraction based on pattern discovery," Proc. 10th Intl. Conf. on World Wide Web, 2001, 681--688.
[6]
S. Chawathe, H. Garcia-Molina, J. Hammer, K. Ireland, Y. Papakonstantinou, J. Ullman and J. Widom. "The TRIMMIS project: integration of heterogeneous information sources," Proc. IPSJ Conference, 1994, 7--18.
[7]
V. Crescenzi, G. Mecca and P. Merialdo. "ROADRUNNER: towards automatic data extraction from large web sites," Proc. 27th VLDB Conf., 2001, 109--118.
[8]
D. Embley, Y. Jiang and Y. K. Ng. "Record-boundary discovery in web documents," Proc. ACM SIGMOD Conf., 1999, 467--478.
[9]
D. Florescu, A. Y. Levy and A. O. Mendelzon. "Database techniques for the world-wide web: a survey," SIGMOD Record 27(3), 1998, 59--74.
[10]
D. Gusfield. Algorithms on Strings, Trees, and Sequences. Cambridge, 1997.
[11]
C. Hus and M. Dung. "Generating finite-state transducers for semi-structured data extraction from the web," Information Systems 23(8), 1998, 521--538.
[12]
T. Kirk, A. Levy, Y. Sagiv and D. Srivastava. "The Information Manifold," Proc. the AAAI Spring Symp. on Information Gathering from Heterogeneous, Distributed Environments, 1995, 85--91.
[13]
L. Liu, C. Pu and W. Han. "XWRAP: An XML-enabled wrapper construction system for web information sources," Proc. 16th Intl. Conf. on Data Engineering (ICDE), 2000, 611--621.
[14]
I. Muslea, S. Minton and C. Knoblock. "A hierarchical approach to wrapper induction," Proc. 3rd Intl. Conf. on Autonomous Agents, 1999, 190--197.
[15]
S. Raghavan and H. Garcia-Molina. "Crawling the hidden web," Proc. 27th VLDB Conf., 2001, 129--138.
[16]
S. Raghavan and H. Garcia-Molina. "Integrating diverse information management systems: a brief survey," IEEE Data Engineering Bulletin 24(4), 2001, 44--52.
[17]
B. Ribeiro-Neto, A. Laender and A.S. da Silva. "Extracting semi-structured data through examples," Proc. Intl. Conf. on Information and Knowledge Management, 1999, 94--101.
[18]
A. Sahuguet and F. Azavant. "WysiWyg web wrapper factory (W4F)," Proc. 8th World Wide Web, 1999.
[19]
J. Wang and F. Lochovsky. "Data-rich section extraction from HTML pages," Proc. 3rd Conf. on Web Information Systems Engineering, 2002, 313--322.
[20]
J. Wang and F. Lochovsky. "Wrapper Induction based on nested pattern discovery.", Technical Report HKUST-CS-27-02, Dept. of Computer Science, Hong Kong U. of Science and Technology, 2002 (submitted for publication). http://www.cs.ust.hk/~cswangjy/paper/tr-27-02.pdf
[21]
World Wide Web Consortium. Document Object Model Level 3 Core Specification, 2001.
[22]
World Wide Web Consortium. HTML 4.01 Specification, 1999.

Cited By

View all

Recommendations

Comments

Information & Contributors

Information

Published In

cover image ACM Conferences
WWW '03: Proceedings of the 12th international conference on World Wide Web
May 2003
772 pages
ISBN:1581136803
DOI:10.1145/775152
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Sponsors

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 20 May 2003

Permissions

Request permissions for this article.

Check for updates

Author Tags

  1. HTML forms
  2. automatic wrapper induction
  3. data annotation
  4. hidden web
  5. information integration
  6. web information extraction

Qualifiers

  • Article

Acceptance Rates

Overall Acceptance Rate 1,899 of 8,196 submissions, 23%

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)28
  • Downloads (Last 6 weeks)2
Reflects downloads up to 02 Feb 2025

Other Metrics

Citations

Cited By

View all

View Options

Login options

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Figures

Tables

Media

Share

Share

Share this Publication link

Share on social media