skip to main content
article

Adapting Web information extraction knowledge via mining site-invariant and site-dependent features

Published: 01 February 2007 Publication History

Abstract

We develop a novel framework that aims at automatically adapting previously learned information extraction knowledge from a source Web site to a new unseen target site in the same domain. Two kinds of features related to the text fragments from the Web documents are investigated. The first type of feature is called, a site-invariant feature. These features likely remain unchanged in Web pages from different sites in the same domain. The second type of feature is called a site-dependent feature. These features are different in the Web pages collected from different Web sites, while they are similar in the Web pages originating from the same site. In our framework, we derive the site-invariant features from previously learned extraction knowledge and the items previously collected or extracted from the source Web site. The derived site-invariant features will be exploited to automatically seek a new set of training examples in the new unseen target site. Both the site-dependent features and the site-invariant features of these automatically discovered training examples will be considered in the learning of new information extraction knowledge for the target site. We conducted extensive experiments on a set of real-world Web sites collected from three different domains to demonstrate the performance of our framework. For example, by just providing training examples from one online book catalog Web site, our approach can automatically extract information from ten different book catalog sites achieving an average precision and recall of 71.9% and 84.0% respectively without any further manual intervention.

References

[1]
Ambite, J., Barish, G., Knoblock, C., Muslea, M., Oh, J., and Minton, S. 2002. Getting from here to there: Interactive planning and agent execution for optimizing travel. In Proceedings of the Fourteenth Innovative Applications of Artificial Intelligence Conference, 862--869.
[2]
Bilenko, M. and Mooney, R. 2003. Adaptive duplicate detection using learnable string similarity measures. In Proceedings of the Ninth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (SIGKDD), 39--48.
[3]
Blei, D., Bagnell, J., and McCallum, A. 2002. Learning with scope, with application to information extraction and classification. In Proceedings of the Eighteenth Conference on Uncertainty in Artificial Intelligence (UAI), 53--60.
[4]
Brin, S. 1998. Extracting patterns and relations from the World Wide Web. In Proceedings of the International Workshop on the Web and Databases, 172--183.
[5]
Califf, M. E. and Mooney, R. J. 2003. Bottom-up relational learning of pattern matching rules for information extraction. J. Mach. Learn. Res. 4, 177--210.
[6]
Chawathe, S., Garcia-Molina, H., Hammer, J., Ireland, K., Papakonstaninou, Y., Ullman, J., and Widom, J. 1994. The TSIMMIS project: Integration of heterogeneous information sources. In Proceedings of the Information Processing Society of Japan, 7--18.
[7]
Ciravegna, F. 2001.(LP)2 an adaptive algorithm for information extraction from Web-related texts. In Proceedings of the Seventeenth International Joint Conference on Artificial Intelligence (IJCAI), 1251--1256.
[8]
Cohen, W. 1999. Reasoning about textual similarity in a Web-based information access system. Autonomous Agents and Multi-Agent Systems 2(1), 65--86.
[9]
Cohen, W. and Fan, W. 1999. Learning page-independent heuristics for extracting data from Web pages. Comput. Netw. 31(11-16), 1641--1652.
[10]
Cohen, W. W., Hurst, M., and Jensen, L. 2002. A flexible learning system for wrapping tables and lists in HTML documents. In Proceedings of the Eleventh International World Wide Web Conference (WWW), 232--241.
[11]
Crescenzi, V., Mecca, G., and Merialdo, P. 2001. ROADRUNNER: Towards automatic data extraction from large Web sites. In Proceedings of the 27th Very Large Databases Conference (VLDB), 109--118.
[12]
Doorenbos, R. B., Etzioni, O., and Weld, D. S. 1997. A scalable comparison-shopping agent for the World-Wide Web. In Proceedings of the First International Conference on Autonomous Agents, 39--48.
[13]
Downey, D., Etzioni, O., and Soderland, S. 2005. A probabilistic model of redundancy in information extraction. In Proceedings of the Eleventh International Joint Conference on Artificial Intelligence (IJCAI), 1034--1041.
[14]
Etzioni, O., Cafarella, M., Downey, D., Popescu, A. M., Shaked, T., Soderland, S., and Weld, D. 2005. Unsupervised named-entity extraction from the Web: An experimental study. Artif. Intell. 165(1), 91--134.
[15]
Freitag, D. and McCallum, A. 1999. Information extraction with HMMs and shrinkage. In Proceedings of the AAAI-99 Workshop on Machine Learning for Information Extraction.
[16]
French, J. C., Powell, A. L., and Schulman, E. 1997. Applications of approximate word matching in information retrieval. In Proceedings of the Sixth International Conference on Information and Knowledge Management (CIKM), 9--15.
[17]
Ghani, R. and Jones, R. 2002. A comparison of efficacy and assumptions of bootstrapping algorithms for training information extraction systems. In Proceedings of the Workshop on Linguistic Knowledge Acquisition and Representation: Bootstrapping Annotated Data at the Linguistic Resources and Evaluation Conference.
[18]
Golgher, P. and da Silva, A. 2001. Bootstrapping for example-based data extraction. In Proceedings of the Tenth ACM International Conference on Information and Knowledge Management (CIKM), 371--378.
[19]
Gusfield, D. 1997. Algorithms on Strings, Trees, and Sequences. Cambridge University Press.
[20]
Hogue, A. and Karger, D. 2005. Thresher: Automating the unwrapping of semantic content from the World Wide web. In Proceedings of the Fourteenth International World Wide Web Conference (WWW), 86--95.
[21]
Hsu, C. and Dung, M. 1998. Generating finite-state transducers for semi-structured data extraction from the Web. J. Info. Sys., Special Issue on Semistructured Data 23(8), 521--528.
[22]
Kushmerick, N. 2000a. Wrapper induction: Efficiency and expressiveness. Artif. Intell. 118(1-2), 15--68.
[23]
Kushmerick, N. 2000b. Wrapper verification. W.W.W. J. 3(2), 79--94.
[24]
Kushmerick, N. and Grace, B. 1998. The wrapper induction environment. In Proceedings of the Workshop on Software Tools for Developing Agents (AAAI), 131--132.
[25]
Kushmerick, N. and Thomas, B. 2002. Adaptive information extraction: Core technologies for information agents. In Intelligents Information Agents R&D In Europe: An AgentLink Perspective, 79--103.
[26]
Lam, W., Wang, W., and Yue, C. W. 2003. Web discovery and filtering based on textual relevance feedback learning. Computational Intell. 19(2), 136--163.
[27]
Lerman, K., Minton, S., and Knoblock, C. 2003. Wrapper maintenance: A machine-learning approach. J. Artif. Intell. Res. 18, 149--181.
[28]
Lin, W. Y. and Lam, W. 2000. Learning to extract hierarchical information from semi-structured documents. In Proceedings of the Ninth International Conference on Information and Knowledge Management (CIKM), 250--257.
[29]
Liu, B., Grossman, R., and Zhai, Y. 2003. Mining data records in Web pages. In Proceedings of the Ninth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (SIGKDD), 601--606.
[30]
Muslea, I., Minton, S., and Knoblock, C. 2000. Selective sampling with redundant views. In Proceedings of the Seventeenth National Conference on Artificial Intelligence (AAAI), 621--626.
[31]
Muslea, I., Minton, S., and Knoblock, C. 2001. Hierarchical wrapper induction for semistructured information sources. J. Autonomous Agents and Multi-Agent Systems 4(1-2), 93--114.
[32]
Riloff, E. and Jones, R. 1999. Learning dictionaries for information extraction by multi-level bootstrapping. In Proceedings of the Sixteenth National Conference on Artificial Intelligence (AAAI), 1044--1049.
[33]
Soderland, S. 1999. Learning information extraction rules for semi-structured and free text. Mach. Learn. 34(1-3), 233--272.
[34]
Srihari, R. and Li, W. 1999. Question answering supported by information extraction. In Proceedings of the Eighth Text REtrieval Conference (TREC-8), 185--196.
[35]
Tejada, S., Knoblock, C., and Minton, S. 2001. Learning object identification rules for information integration. Info. Syst. 26(8), 607--635.
[36]
Tejada, S., Knoblock, C., and Minton, S. 2002. Learning domain-independent string transformation weights for high accuracy object identification. In Proceedings of the Eighth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (SIGKDD), 350--359.
[37]
Vapnik, V. N. 1995. The Nature of Statistical Learning Theory. Springer.
[38]
Wang, J. and Lochovsky, F. H. 2003. Data extraction and label assignment for Web databases. In Proceedings of the Twelfth International World Wide Web Conference (WWW), 187--196.
[39]
Wong, T. L. and Lam, W. 2002. Adapting information extraction knowledge for unseen web sites. In Proceedings of the 2002 IEEE International Conference on Data Mining (ICDM), 506--513.
[40]
Wong, T. L. and Lam, W. 2004a. A probabilistic approach for adapting information extraction wrappers and discovering new attributes. In Proceedings of the IEEE International Conference on Data Mining (ICDM), 257--264.
[41]
Wong, T. L. and Lam, W. 2004b. Text mining from site-invariant and dependent features for information extraction knowledge adaptation. In Proceedings of the 2004 SIAM International Conference on Data Mining (SDM), 45--56.
[42]
Wong, T. L. and Lam, W. 2005. Learning to refine ontology for a new Web site using a Bayesian approach. In Proceedings of the 2005 SIAM International Conference on Data Mining (SDM). 298--309.

Cited By

View all

Recommendations

Comments

Information & Contributors

Information

Published In

cover image ACM Transactions on Internet Technology
ACM Transactions on Internet Technology  Volume 7, Issue 1
February 2007
184 pages
ISSN:1533-5399
EISSN:1557-6051
DOI:10.1145/1189740
Issue’s Table of Contents

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 01 February 2007
Published in TOIT Volume 7, Issue 1

Permissions

Request permissions for this article.

Check for updates

Author Tags

  1. Web mining
  2. Wrapper adaptation
  3. machine learning
  4. text mining

Qualifiers

  • Article

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)0
  • Downloads (Last 6 weeks)0
Reflects downloads up to 21 Sep 2024

Other Metrics

Citations

Cited By

View all

View Options

Get Access

Login options

Full Access

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Media

Figures

Other

Tables

Share

Share

Share this Publication link

Share on social media