skip to main content
10.1145/1281192.1281288acmconferencesArticle/Chapter ViewAbstractPublication PageskddConference Proceedingsconference-collections
Article

Webpage understanding: an integrated approach

Published: 12 August 2007 Publication History

Abstract

Recent work has shown the effectiveness of leveraging layout and tag-tree structure for segmenting webpages and labeling HTML elements. However, how to effectively segment and label the text contents inside HTML elements is still an open problem. Since many text contents on a webpage are often text fragments and not strictly grammatical, traditional natural language processing techniques, that typically expect grammatical sentences, are no longer directly applicable. In this paper, we examine how to use layout and tag-tree structure in a principled way to help understand text contents on webpages. We propose to segment and label the page structure and the text content of a webpage in a joint discriminative probabilistic model. In this model, semantic labels of page structure can be leveraged to help text content understanding, and semantic labels ofthe text phrases can be used in page structure understanding tasks such as data record detection. Thus, integration of both page structure and text content understanding leads to an integrated solution of webpage understanding. Experimental results on research homepage extraction show the feasibility and promise of our approach.

Supplementary Material

JPG File (p903-zhu-200.jpg)
JPG File (p903-zhu-768.jpg)
Low Resolution (p903-zhu-200.mov)
High Resolution (p903-zhu-768.mov)

References

[1]
A. Arasu and H. Garcia-Molina. Extracting Structured Data from Webpages. Proc. of SIGMOD, 2003.
[2]
V. Borkar, K. Deshmukh and S. Sarawagi. Automatic segmentation of text into structured records. Proc. of SIGMOD, 2001.
[3]
M. E. Califf and R. J. Mooney. Bottom-up relational learning of pattern matching rules for information extraction. Journal of Machine Learning Research, 2004.
[4]
C.-H. Chang and S.-L. Liu. IEPAD: Information Extraction Based on Pattern Discovery. Proc. of WWW, 2001.
[5]
W. W. Cohen and S. Sarawagi. Exploiting Dictionaries in Named Entity Extraction: Combining Semi-Markov Extraction Processes and Data Integration Methods. Proc. of SIGKDD, 2004.
[6]
V. Crescenzi, G. Mecca and P. Merialdo. ROADRUNNER: Towards Automatic Data Extraction from Large Web Sites. Proc. of VLDB, 2001.
[7]
B. Crysmann, A. Frank, B. Kiefer, S. Muller, G. Neumann, J. Piskorski, U. Schafer, M. Siegel, H. Uszkoreit, F. Xu, M. Becher and H-U. Krieger. An Integrated Architecture for Shallow and Deep Processing. Proc. of ACL, 2004.
[8]
D. W. Embley, Y. Jiang and Y.-K. Ng. Record-Boundary Discovery in Web Documents. Proc. of SIGMOD, 1999.
[9]
D. DiPasquo. Using HTML Formatting to Aid in Natural Language Processing on the World Wide Web. Senior Honors Thesis, Carnegie Mellon University, 1998.
[10]
T. Duong, H. Bui, D. Phung and S. Venkatesh. Activity Recognition and Abnormality Detection with the Switching Hidden Semi-Markov Model. Proc. of CVPR, 2005.
[11]
D. Freitag. Information Extraction from HTML: Application of a General Machine Learning Approach. Proc. of AAAI, 1998.
[12]
C. Jacquemin and C. Bush. Combining Lexical and Formatting Cues for Named Entity Acquisition from the Web. Proc. of the Joint SIGDAT conference on Empirical methods in natural language processing and very large corpora, 2000.
[13]
F. V. Jensen, S. L. Lauritzen and K. G. Olesen. Bayesian updating in causal probabilistic networks by local computation. Computational Statistics Quarterly, 4:269--82, 1990.
[14]
N. Kushmerick. Wrapper induction: efficiency and expressiveness. Artificial Intelligence, 118:15--68, 2000.
[15]
J. Lafferty, A. McCallum and F. Pereira. Conditional random fields: Probabilistic models for segmenting and labeling sequence data. Proc. of ICML, 2001.
[16]
K. Lerman, L. Getoor, S. Minton and C. Knoblock. Using the Structure of Web Sites for Automatic Segmentation of Tables. Proc. of SIGMOD, 2004.
[17]
K. Lerman, S. Minton and C. Knoblock. Wrapper maintenance: A machine learning approach. Journal of Artificial Intelligence Research, 18:149--181, 2003.
[18]
C. Manning and H. Schutze. Foundations of Statistical Natural Language Processing. The MIT Press Cambridge, MA, May, 1999.
[19]
D. Phung, T. Duong, S. Venkatesh and H. Bui. Topic Transition Detection Using Hierarchical Hidden Markov and Semi-Markov Models. Proc. of MM, 2005.
[20]
S. Sarawagi and W. W. Cohen. Semi-Markov Conditional Random Fields for Information Extraction. Proc. of NIPS, 2004.
[21]
S. Sarawagi. Efficient Inference on Sequence Segmentation Models. Proc. of ICML, 2006.
[22]
S. Soderland. Learning to Extract Text-based Information from the World Wide Web. Proc. of SIGKDD, 1997.
[23]
S. Soderland. Learning Information Extraction Rules for Semi-structured and Free Text. Journal of Machine Learning, 1999.
[24]
F. Suchanek, G. Ifrim and G. Weikum. Combining Linguistic and Statistical Analysis to Extract Relations from Web Documents. Proc. of SIGKDD, 2006.
[25]
T. Yoshimasa and T. Jun'ichi. Chunk Parsing Revisited. Proc. of the 9th International Workshop on Parsing Technologies, 2005.
[26]
Y. Zhai and B. Liu. Web Data Extraction Based on Partial Tree Alignment. Proc. of WWW, 2005.
[27]
H. Zhao, W. Meng, Z. Wu, V. Raghavan and C. Yu. Fully Automatic Wrapper Generation for Search Engines. Proc. of WWW, 2005.
[28]
J. Zhu, Z. Nie, J.-R. Wen, B. Zhang and W.-Y. Ma. 2D Conditional Random Fields for Web Information Extraction. Proc. of ICML, 2005.
[29]
J. Zhu, Z. Nie, J.-R. Wen, B. Zhang and W.-Y. Ma. Simultaneous Record Detection and Attribute Labeling in Web Data Extraction. Proc. of SIGKDD, 2006.

Cited By

View all

Recommendations

Comments

Information & Contributors

Information

Published In

cover image ACM Conferences
KDD '07: Proceedings of the 13th ACM SIGKDD international conference on Knowledge discovery and data mining
August 2007
1080 pages
ISBN:9781595936097
DOI:10.1145/1281192
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Sponsors

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 12 August 2007

Permissions

Request permissions for this article.

Check for updates

Author Tags

  1. conditional random fields
  2. text processing
  3. webpage understanding

Qualifiers

  • Article

Conference

KDD07

Acceptance Rates

KDD '07 Paper Acceptance Rate 111 of 573 submissions, 19%;
Overall Acceptance Rate 1,133 of 8,635 submissions, 13%

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)18
  • Downloads (Last 6 weeks)1
Reflects downloads up to 15 Sep 2024

Other Metrics

Citations

Cited By

View all

View Options

Get Access

Login options

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Media

Figures

Other

Tables

Share

Share

Share this Publication link

Share on social media