skip to main content
article

Wrapper verification

Published: 15 March 2000 Publication History

Abstract

Many Internet information-management applications (e.g., information integration systems) require a library of wrappers, specialized information extraction procedures that translate a source's native format into a structured representation suitable for further application-specific processing. Maintaining wrappers is tedious and error-prone, because the formatting regularities on which wrappers rely change frequently on the decentralized and dynamic Internet. The wrapper verification problem is to determine whether a wrapper is operating correctly. Standard regression testing approaches are inappropriate, because both the formatting regularities on which wrappers rely and the source's underlying content may change. We introduce RAPTURE, a fully-implemented, domain-independent wrapper verification algorithm. RAPTURE computes a probabilistic similarity measure between a wrapper's expected and observed output, where similarity is defined in terms of simple numeric features (e.g., the length, or the fraction of punctuation characters) of the extracted strings. Experiments with numerous actual Internet sources demostrate that RAPTURE performs substantially better than standard regression testing.

References

[1]
Beizer, B. (1995), Black-Box Testing , Wiley, New York.
[2]
Cohen, W. (1999), "Recognizing Structure in Web Pages Using Similarity Querries," In Proc. 16th Nat. Conf. AI , pp. 59-66.
[3]
Cowie, J. and W. Lehnert (1996), "Information Extraction," Comm. of the ACM 39 , 1, 80-91.
[4]
Embley, D., D. Campbell, Y. Jiang, Y.-K. Ng, R. Smith, S. Liddle, and D. Quass (1998), "A Conceptual-Modeling Approach to Extracting Data from the Web," In Proc. Int. Conf. Conceptual Modeling , pp. 78-91.
[5]
Friedman, N. and M. Goldszmidt (1996), "Learning Bayesian Networks with Local Structure," In Proc. 12th Conf. Uncertainty in Artificial Intelligence , pp. 252-262.
[6]
Gruser, J.-B., L. Raschid, M. Vidal, and L. Bright (1998), "Wrapper Generation for Web Accessible Data Sources," In Proc. Conf. Cooperative Information Systems , pp. 14-23.
[7]
Hammer, J., H. Garcia-Molina, J. Cho, R. Aranha, and A. Crespo (1997), "Extracting Semistructured Information from the Web," In Proc. Workshop on Management of Semistructured Data .
[8]
Hsu, C. and M. Dung (1998), "Generating Finite-state Transducers for Semistructured Data Extraction from the Web," J. Information Systems 23 , 8, 521-538.
[9]
Huck, G., P. Frankhausewr, K. Aberer, and E. Neuhold (1998), "Jedi: Extracting and Synthesizing Information from the Web," In Proc. Conf. Cooperative Information Systems , pp. 32-43.
[10]
Knoblock, A., A. Levy, O. Duschka, D. Florescu, and N. Kushmerick, Eds. (1998), Proc. 1998 Workshop on AI and Information Integration , AAAI Press.
[11]
Kushmerick, N. (2000), "Wrapper Induction: Efficiency and Expressiveness," J. Artificial Intelligence 118 , 1-2, 15-68.
[12]
Kushmerick, N., D. Weld, and R. Doorenbos (1997), "Wrapper Induction for Information Extraction," In Proc. 15th Int. Joint Conf. AI , pp. 729-35.
[13]
Levy, A., C. Knoblock, S. Minton, and W. Cohen (1998), "Trends and Controversies: Information Integration," IEEE Intelligent Systems 13 , 5, 12-24.
[14]
Muslea, I., S. Minton, and C. Knoblock (1998), "Wrapper Induction for Semi-structured, Web-based Information Sources," In Proc. Conf. Automatic Learning & Discovery .
[15]
Muslea, I., S. Minton, and C. Knoblock (1999), "A Hierachical Approach to Wrapper Induction," In Proc. 3rd Int. Conf. Autonomous Agents , pp. 190-197.
[16]
Rosenfeld, R. (1996), "A Maximum Entropy Approach to Adaptive Statistical Language Modelling," Computer, Speech and Language 10 , 3, 187-228.
[17]
Smith, D. and M. Lopez (1997), "Information Extraction for Semistructured Documents," In Proc. Workshop on Management of Semistructured Data .
[18]
Wiederhold, G. (1996), Intelligent Information Integration , Kluwer, Dordrecht.

Cited By

View all

Recommendations

Comments

Information & Contributors

Information

Published In

cover image World Wide Web
World Wide Web  Volume 3, Issue 2
Oct 2000
75 pages

Publisher

Kluwer Academic Publishers

United States

Publication History

Published: 15 March 2000

Qualifiers

  • Article

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)0
  • Downloads (Last 6 weeks)0
Reflects downloads up to 16 Oct 2024

Other Metrics

Citations

Cited By

View all

View Options

View options

Media

Figures

Other

Tables

Share

Share

Share this Publication link

Share on social media