skip to main content
10.1145/2187980.2188044acmotherconferencesArticle/Chapter ViewAbstractPublication PagesthewebconfConference Proceedingsconference-collections
demonstration

Automatically learning gazetteers from the deep web

Published: 16 April 2012 Publication History

Abstract

Wrapper induction faces a dilemma: To reach web scale, it requires automatically generated examples, but to produce accurate results, these examples must have the quality of human annotations. We resolve this conflict with AMBER, a system for fully automated data extraction from result pages. In contrast to previous approaches, AMBER employs domain specific gazetteers to discern basic domain attributes on a page, and leverages repeated occurrences of similar attributes to group related attributes into records rather than relying on the noisy structure of the DOM. With this approach AMBER is able to identify records and their attributes with almost perfect accuracy (>98%) on a large sample of websites. To make such an approach feasible at scale, AMBER automatically learns domain gazetteers from a small seed set. In this demonstration, we show how AMBER uses the repeated structure of records on deep web result pages to learn such gazetteers. This is only possible with a highly accurate extraction system. Depending on its parametrization, this learning process runs either fully automatically or with human interaction. We show how AMBER bootstraps a gazetteer for UK locations in 4 iterations: From a small seed sample we achieve 94.4% accuracy in recognizing UK locations in the $4th$ iteration.

References

[1]
V. Crescenzi and G. Mecca. Automatic Information Extraction from Large Websites.Journal of the ACM, 51(5):731--779, 2004.
[2]
H. Cunningham, D. Maynard, K. Bontcheva, V. Tablan, N. Aswani, I. Roberts, G. Gorrell, A. Funk, A. Roberts, D. Damljanovic, T. Heitz, M. A. Greenwood, H. Saggion, J. Petrak, Y. Li, and W. Peters.Text Processing with GATE (Version 6). The University of Sheffield, Department of Computer Science, 2011.
[3]
N. N. Dalvi, P. Bohannon, and F. Sha. Robust web extraction: an approach based on a probabilistic tree-edit model. In Proc. of the ACM SIGMOD International Conference on Management of Data, pages 335--348, 2009.
[4]
N. N. Dalvi, R. Kumar, and M. A. Soliman. Automatic wrappers for large scale web extraction. The Proceedings of the VLDB Endowment, 4(4):219--230, 2011.
[5]
I. Muslea, S. Minton, and C. A. Knoblock. Hierarchical Wrapper Induction for Semistructrued Information Systems. Autonomous Agents and Multi-Agent Systems, 4:93--114, 2001.
[6]
P. Senellart, A. Mittal, D. Muschick, R. Gilleron, and M. Tommasi. Automatic wrapper induction from hidden-web sources with domain knowledge. InProc. of WIDM, pages 9--16, 2008.
[7]
K. Simon and G. Lausen. ViPER: Augmenting Automatic Information Extraction with visual Perceptions. In Proc.14th ACM Conference on Information and Knowledge Management, pages 381--388, 2005.
[8]
W. Su, J. Wang, and F. H. Lochovsky. ODE: Ontology-Assisted Data Extraction. ACM Transactions on Database Systems, 34(2), 2009.
[9]
Y. Zhai and B. Liu. Structured Data Extraction from the Web Based on Partial Tree Alignment. IEEE Transactions on Knowledge and Data Engineering, 18(12):1614--1628, 2006.

Cited By

View all

Index Terms

  1. Automatically learning gazetteers from the deep web

      Recommendations

      Comments

      Information & Contributors

      Information

      Published In

      cover image ACM Other conferences
      WWW '12 Companion: Proceedings of the 21st International Conference on World Wide Web
      April 2012
      1250 pages
      ISBN:9781450312301
      DOI:10.1145/2187980
      Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

      Sponsors

      • Univ. de Lyon: Universite de Lyon

      In-Cooperation

      Publisher

      Association for Computing Machinery

      New York, NY, United States

      Publication History

      Published: 16 April 2012

      Permissions

      Request permissions for this article.

      Check for updates

      Author Tags

      1. example generation
      2. gazetteer learning
      3. vertical search
      4. web data extraction

      Qualifiers

      • Demonstration

      Conference

      WWW 2012
      Sponsor:
      • Univ. de Lyon
      WWW 2012: 21st World Wide Web Conference 2012
      April 16 - 20, 2012
      Lyon, France

      Acceptance Rates

      Overall Acceptance Rate 1,899 of 8,196 submissions, 23%

      Contributors

      Other Metrics

      Bibliometrics & Citations

      Bibliometrics

      Article Metrics

      • Downloads (Last 12 months)1
      • Downloads (Last 6 weeks)0
      Reflects downloads up to 31 Dec 2024

      Other Metrics

      Citations

      Cited By

      View all

      View Options

      Login options

      View options

      PDF

      View or Download as a PDF file.

      PDF

      eReader

      View online with eReader.

      eReader

      Media

      Figures

      Other

      Tables

      Share

      Share

      Share this Publication link

      Share on social media