research-article

FOREST: Focused Object Retrieval by Exploiting Significant Tag Paths

Authors:

Pierre SenellartAuthors Info & Claims

WebDB'15: Proceedings of the 18th International Workshop on Web and Databases

Pages 55 - 61

https://doi.org/10.1145/2767109.2767112

Published: 31 May 2015 Publication History

Abstract

Content-intensive websites, e.g., of blogs or news, present pages that contain Web articles automatically generated by content management systems. Identification and extraction of their main content is critical in many applications, such as indexing or classification. We present a novel unsupervised approach for the extraction of Web articles from dynamically-generated Web pages. Our system, called Forest, combines structural and information-based features to target the main content generated by a Web source, and published in associated Web pages. We extensively evaluate Forest with respect to various baselines and datasets, and report improved results over state-of-the art techniques in content extraction.

References

[1]

A. Arasu and H. Garcia-Molina. Extracting structured data from Web pages. In SIGMOD, 2003.

Digital Library

[2]

Z. Bar-Yossef and S. Rajagopalan. Template detection via data mining and its applications. In WWW, 2002.

Digital Library

[3]

P. Bohunsky and W. Gatterbauer. Visual structure-based Web page clustering and retrieval. In WWW, 2010.

Digital Library

[4]

E. Bruno, N. Faessel, H. Glotin, J. L. Maitre, and M. Scholl. Indexing and querying segmented web pages: the block Web model. World Wide Web, 14(5--6), 2011.

Digital Library

[5]

D. Buttler, L. Liu, and C. Pu. A fully automated object extraction system for the World Wide Web. In Distributed Computing Systems, 2001.

Digital Library

[6]

D. Cai, S. Yu, J.-R. Wen, and W.-Y. Ma. VIPS: a vision-based page segmentation algorithm. Technical Report MSR-TR-2003-79, Microsoft, 2003.

[7]

J. Caverlee, L. Liu, and D. Buttler. Probe, cluster, and discover: focused extraction of QA-pagelets from the deep Web. In ICDE, 2004.

Digital Library

[8]

D. Chakrabarti and R. R. Mehta. The paths more taken: Matching DOM trees to search logs for accurate Web page clustering. In WWW, 2010.

Digital Library

[9]

R. Creo, V. Crescenzi, D. Qiu, and P. Merialdo. Minimizing the costs of the training data for learning Web wrappers. In VLDS, 2012.

[10]

V. Crescenzi, G. Mecca, and P. Merialdo. RoadRunner: Towards automatic data extraction from large Web sites. In VLDB, 2001.

Digital Library

[11]

V. Crescenzi, P. Merialdo, and P. Missier. Clustering Web pages based on their structure. Data and Knowl. Eng., 54(3), 2005.

Digital Library

[12]

D. de Castro Reis, P. B. Golgher, A. S. da Silva, and A. H. F. Laender. Automatic Web news extraction using tree edit distance. In WWW, 2004.

Digital Library

[13]

A. Dimulescu and J.-L. Dessalles. Understanding narrative interest: Some evidence on the role of unexpectedness. In CogSci, 2009.

[14]

E. Finkelstein. Syndicating Web Sites with RSS Feeds for Dummies. Wiley Publishing, Inc., 2005.

Digital Library

[15]

D. Freedman, R. Pisani, and R. Purves. Statistics. W. W. Norton, 1998.

[16]

G. Gkotsis, K. Stepanyan, A. I. Cristea, and M. S. Joy. Zero-cost labelling with web feeds for weblog data extraction. In WWW, 2013.

Digital Library

[17]

S. Grumbach and G. Mecca. In search of the lost schema. In ICDT, 1999.

Digital Library

[18]

L. Guo, F. Shao, C. Botev, and J. Shanmugasundaram. XRANK: ranked keyword search over XML documents. In SIGMOD, 2003.

Digital Library

[19]

H.-Y. Kao, J.-M. Ho, and M.-S. Chen. WISDOM: Web intrapage informative structure mining based on document object model. IEEE TKDE, 17(5), 2005.

Digital Library

[20]

C. Kohlschütter, P. Fankhauser, and W. Nejdl. Boilerplate detection using shallow text features. In WSDM, 2010.

Digital Library

[21]

R. E. Krichevsky and V. K. Trofimov. The performance of universal encoding. IEEE ToIT, 27(2), 1981.

Digital Library

[22]

N. Kushmerick, D. S. Weld, and R. Doorenbos. Wrapper induction for information extraction. In IJCAI, 1997.

Digital Library

[23]

S.-J. Lim and Y.-K. Ng. An automated change-detection algorithm for HTML documents based on semantic hierarchies. In ICDE, 2001.

Digital Library

[24]

C.-Y. Lin. ROUGE: a package for automatic evaluation of summaries. In WAS, 2004.

[25]

B. Liu, R. L. Grossman, and Y. Zhai. Mining data records in Web pages. In KDD, 2003.

Digital Library

[26]

R. R. Mehta, P. Mitra, and H. Karnick. Extracting semantic structure of Web documents using content and visual information. In WWW, 2005.

Digital Library

[27]

G. Miao, J. Tatemura, W.-P. Hsiung, A. Sawires, and L. E. Moser. Extracting data records from the Web using tag path clustering. In WWW, 2009.

Digital Library

[28]

M. Oita. Deriving Semantic Objects from the Structured Web. PhD thesis, Télécom ParisTech, 2012.

[29]

M. Oita, A. Amarilli, and P. Senellart. Cross-fertilizing deep Web analysis and ontology enrichment. In VLDS, 2012.

[30]

M. Oita and P. Senellart. Archiving data objects using Web feeds. In IWAW, 2010.

[31]

J. Pasternack and D. Roth. Extracting article text from the Web with maximum subsequence segmentation. In WWW, 2009.

Digital Library

[32]

L. Ramaswamy, A. Iyengar, L. Liu, and F. Douglis. Automatic detection of fragments in dynamically generated Web pages. In WWW, 2004.

Digital Library

[33]

R. Song, H. Liu, J.-R. Wen, and W.-Y. Ma. Learning block importance models for Web pages. In WWW, 2004.

Digital Library

[34]

C. Sun, C.-Y. Chan, and A. K. Goenka. Multiway SLCA-based keyword search in XML data. In WWW, 2007.

Digital Library

[35]

K. Vieira, A. S. da Silva, and N. Pinto. A fast and robust method for Web page template detection and removal. In CIKM, 2006.

Digital Library

[36]

T. Weninger, W. H. Hsu, and J. Han. Content extraction via tag ratios. In WWW, 2010.

Digital Library

[37]

S. Yu, D. Cai, J.-R. Wen, and W.-Y. Ma. Improving pseudo-relevance feedback in Web information retrieval using Web page segmentation. In WWW, 2003.

Digital Library

Cited By

Lobbé Q(2023) Continuity and discontinuity in web archives: a multi-level reconstruction of the firsttuesday community through persistences, continuity spaces and web cernes Internet Histories10.1080/24701475.2023.22540507:4(354-385)Online publication date: 8-Sep-2023
https://doi.org/10.1080/24701475.2023.2254050

Index Terms

FOREST: Focused Object Retrieval by Exploiting Significant Tag Paths
1. Information systems

Recommendations

Rotation Forest: A New Classifier Ensemble Method

We propose a method for generating classifier ensembles based on feature extraction. To create the training data for a base classifier, the feature set is randomly split into K subsets (K is a parameter of the algorithm) and Principal Component Analysis ...
Growing a tree in the forest: constructing folksonomies by integrating structured metadata
KDD '10: Proceedings of the 16th ACM SIGKDD international conference on Knowledge discovery and data mining

Many social Web sites allow users to annotate the content with descriptive metadata, such as tags, and more recently to organize content hierarchically. These types of structured metadata provide valuable evidence for learning how a community organizes ...
Reinforced random forest
ICVGIP '16: Proceedings of the Tenth Indian Conference on Computer Vision, Graphics and Image Processing

Reinforcement learning improves classification accuracy. But use of reinforcement learning is relatively unexplored in case of random forest classifier. We propose a reinforced random forest (RRF) classifier that exploits reinforcement learning to ...

Comments

Information & Contributors

Information

Published In

cover image ACM Conferences

WebDB'15: Proceedings of the 18th International Workshop on Web and Databases

May 2015

75 pages

ISBN:9781450336277

DOI:10.1145/2767109

Editors:
Julia Stoyanovich
Drexel University
,
Fabian M. Suchanek
Télécom ParisTech

Copyright © 2015 ACM.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

Sponsors

SIGMOD: ACM Special Interest Group on Management of Data

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 31 May 2015

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Qualifiers

Research-article
Research
Refereed limited

Conference

SIGMOD/PODS'15

Sponsor:

SIGMOD

SIGMOD/PODS'15: International Conference on Management of Data

May 31 - June 4, 2015

VIC, Melbourne, Australia

Acceptance Rates

WebDB'15 Paper Acceptance Rate 9 of 31 submissions, 29%;

Overall Acceptance Rate 30 of 100 submissions, 30%

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

1
Total Citations
View Citations
52
Total Downloads

Downloads (Last 12 months)0
Downloads (Last 6 weeks)0

Reflects downloads up to 06 Nov 2024

Other Metrics

View Author Metrics

Citations

Cited By

Lobbé Q(2023) Continuity and discontinuity in web archives: a multi-level reconstruction of the firsttuesday community through persistences, continuity spaces and web cernes Internet Histories10.1080/24701475.2023.22540507:4(354-385)Online publication date: 8-Sep-2023
https://doi.org/10.1080/24701475.2023.2254050

View Options

Get Access

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Media

Figures

Other

Tables

View Table of Contents