skip to main content
10.1145/2767109.2767112acmconferencesArticle/Chapter ViewAbstractPublication PageswebdbConference Proceedingsconference-collections
research-article

FOREST: Focused Object Retrieval by Exploiting Significant Tag Paths

Published: 31 May 2015 Publication History

Abstract

Content-intensive websites, e.g., of blogs or news, present pages that contain Web articles automatically generated by content management systems. Identification and extraction of their main content is critical in many applications, such as indexing or classification. We present a novel unsupervised approach for the extraction of Web articles from dynamically-generated Web pages. Our system, called Forest, combines structural and information-based features to target the main content generated by a Web source, and published in associated Web pages. We extensively evaluate Forest with respect to various baselines and datasets, and report improved results over state-of-the art techniques in content extraction.

References

[1]
A. Arasu and H. Garcia-Molina. Extracting structured data from Web pages. In SIGMOD, 2003.
[2]
Z. Bar-Yossef and S. Rajagopalan. Template detection via data mining and its applications. In WWW, 2002.
[3]
P. Bohunsky and W. Gatterbauer. Visual structure-based Web page clustering and retrieval. In WWW, 2010.
[4]
E. Bruno, N. Faessel, H. Glotin, J. L. Maitre, and M. Scholl. Indexing and querying segmented web pages: the block Web model. World Wide Web, 14(5--6), 2011.
[5]
D. Buttler, L. Liu, and C. Pu. A fully automated object extraction system for the World Wide Web. In Distributed Computing Systems, 2001.
[6]
D. Cai, S. Yu, J.-R. Wen, and W.-Y. Ma. VIPS: a vision-based page segmentation algorithm. Technical Report MSR-TR-2003-79, Microsoft, 2003.
[7]
J. Caverlee, L. Liu, and D. Buttler. Probe, cluster, and discover: focused extraction of QA-pagelets from the deep Web. In ICDE, 2004.
[8]
D. Chakrabarti and R. R. Mehta. The paths more taken: Matching DOM trees to search logs for accurate Web page clustering. In WWW, 2010.
[9]
R. Creo, V. Crescenzi, D. Qiu, and P. Merialdo. Minimizing the costs of the training data for learning Web wrappers. In VLDS, 2012.
[10]
V. Crescenzi, G. Mecca, and P. Merialdo. RoadRunner: Towards automatic data extraction from large Web sites. In VLDB, 2001.
[11]
V. Crescenzi, P. Merialdo, and P. Missier. Clustering Web pages based on their structure. Data and Knowl. Eng., 54(3), 2005.
[12]
D. de Castro Reis, P. B. Golgher, A. S. da Silva, and A. H. F. Laender. Automatic Web news extraction using tree edit distance. In WWW, 2004.
[13]
A. Dimulescu and J.-L. Dessalles. Understanding narrative interest: Some evidence on the role of unexpectedness. In CogSci, 2009.
[14]
E. Finkelstein. Syndicating Web Sites with RSS Feeds for Dummies. Wiley Publishing, Inc., 2005.
[15]
D. Freedman, R. Pisani, and R. Purves. Statistics. W. W. Norton, 1998.
[16]
G. Gkotsis, K. Stepanyan, A. I. Cristea, and M. S. Joy. Zero-cost labelling with web feeds for weblog data extraction. In WWW, 2013.
[17]
S. Grumbach and G. Mecca. In search of the lost schema. In ICDT, 1999.
[18]
L. Guo, F. Shao, C. Botev, and J. Shanmugasundaram. XRANK: ranked keyword search over XML documents. In SIGMOD, 2003.
[19]
H.-Y. Kao, J.-M. Ho, and M.-S. Chen. WISDOM: Web intrapage informative structure mining based on document object model. IEEE TKDE, 17(5), 2005.
[20]
C. Kohlschütter, P. Fankhauser, and W. Nejdl. Boilerplate detection using shallow text features. In WSDM, 2010.
[21]
R. E. Krichevsky and V. K. Trofimov. The performance of universal encoding. IEEE ToIT, 27(2), 1981.
[22]
N. Kushmerick, D. S. Weld, and R. Doorenbos. Wrapper induction for information extraction. In IJCAI, 1997.
[23]
S.-J. Lim and Y.-K. Ng. An automated change-detection algorithm for HTML documents based on semantic hierarchies. In ICDE, 2001.
[24]
C.-Y. Lin. ROUGE: a package for automatic evaluation of summaries. In WAS, 2004.
[25]
B. Liu, R. L. Grossman, and Y. Zhai. Mining data records in Web pages. In KDD, 2003.
[26]
R. R. Mehta, P. Mitra, and H. Karnick. Extracting semantic structure of Web documents using content and visual information. In WWW, 2005.
[27]
G. Miao, J. Tatemura, W.-P. Hsiung, A. Sawires, and L. E. Moser. Extracting data records from the Web using tag path clustering. In WWW, 2009.
[28]
M. Oita. Deriving Semantic Objects from the Structured Web. PhD thesis, Télécom ParisTech, 2012.
[29]
M. Oita, A. Amarilli, and P. Senellart. Cross-fertilizing deep Web analysis and ontology enrichment. In VLDS, 2012.
[30]
M. Oita and P. Senellart. Archiving data objects using Web feeds. In IWAW, 2010.
[31]
J. Pasternack and D. Roth. Extracting article text from the Web with maximum subsequence segmentation. In WWW, 2009.
[32]
L. Ramaswamy, A. Iyengar, L. Liu, and F. Douglis. Automatic detection of fragments in dynamically generated Web pages. In WWW, 2004.
[33]
R. Song, H. Liu, J.-R. Wen, and W.-Y. Ma. Learning block importance models for Web pages. In WWW, 2004.
[34]
C. Sun, C.-Y. Chan, and A. K. Goenka. Multiway SLCA-based keyword search in XML data. In WWW, 2007.
[35]
K. Vieira, A. S. da Silva, and N. Pinto. A fast and robust method for Web page template detection and removal. In CIKM, 2006.
[36]
T. Weninger, W. H. Hsu, and J. Han. Content extraction via tag ratios. In WWW, 2010.
[37]
S. Yu, D. Cai, J.-R. Wen, and W.-Y. Ma. Improving pseudo-relevance feedback in Web information retrieval using Web page segmentation. In WWW, 2003.

Cited By

View all

Index Terms

  1. FOREST: Focused Object Retrieval by Exploiting Significant Tag Paths

        Recommendations

        Comments

        Information & Contributors

        Information

        Published In

        cover image ACM Conferences
        WebDB'15: Proceedings of the 18th International Workshop on Web and Databases
        May 2015
        75 pages
        ISBN:9781450336277
        DOI:10.1145/2767109
        Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

        Sponsors

        Publisher

        Association for Computing Machinery

        New York, NY, United States

        Publication History

        Published: 31 May 2015

        Permissions

        Request permissions for this article.

        Check for updates

        Qualifiers

        • Research-article
        • Research
        • Refereed limited

        Conference

        SIGMOD/PODS'15
        Sponsor:
        SIGMOD/PODS'15: International Conference on Management of Data
        May 31 - June 4, 2015
        VIC, Melbourne, Australia

        Acceptance Rates

        WebDB'15 Paper Acceptance Rate 9 of 31 submissions, 29%;
        Overall Acceptance Rate 30 of 100 submissions, 30%

        Contributors

        Other Metrics

        Bibliometrics & Citations

        Bibliometrics

        Article Metrics

        • Downloads (Last 12 months)0
        • Downloads (Last 6 weeks)0
        Reflects downloads up to 06 Nov 2024

        Other Metrics

        Citations

        Cited By

        View all

        View Options

        Get Access

        Login options

        View options

        PDF

        View or Download as a PDF file.

        PDF

        eReader

        View online with eReader.

        eReader

        Media

        Figures

        Other

        Tables

        Share

        Share

        Share this Publication link

        Share on social media