short-paper

Extracting news text from web pages: an application for the visually impaired

Authors:

Panagiotis Papapetrou,

Lars AskerAuthors Info & Claims

PETRA '15: Proceedings of the 8th ACM International Conference on PErvasive Technologies Related to Assistive Environments

Article No.: 68, Pages 1 - 4

https://doi.org/10.1145/2769493.2769573

Published: 01 July 2015 Publication History

Abstract

Apart from the actual content, web pages contain several other components (referred to as boilerplate text) that describes how, and in what context the content should be displayed. We show how content bearing text can be efficiently separated from boilerplate text using a random forest classifier. We compare the performance with another state-of-the-art method for boilerplate detection that uses a decision tree classifier and shallow features extracted from the text. The result is a general improvement using the random forest classifier for both classifying problems analyzed, significantly so for the more complex problem. We also show that a small increase in feature set range can lead to even further improved accuracy. The conclusion is that random forest classification can achieve significantly higher accuracy rates than at least one of the current state-of-the-art methods for content extraction. The results can improve content extraction methods for a variety of applications, including search engine optimization and making the web more accessible for the blind or visually impaired.

References

[1]

J. Ali, R. Khan, N. Ahmad, and I. Maqsood. Random forests and decision trees. International Journal of Computer Science Issues (IJCSI), 9(5), 2012.

[2]

Z. Bar-Yossef and S. Rajagopalan. Template detection via data mining and its applications. In Proceedings of the 11th international conference on World Wide Web, pages 580--591. ACM, 2002.

Digital Library

[3]

L. Breiman. Random forests. Machine learning, 45(1):5--32, 2001.

Digital Library

[4]

O. Buyukkokten, H. Garcia-Molina, and A. Paepcke. Accordion summarization for end-game browsing on pdas and cellular phones. In Proceedings of the SIGCHI conference on Human factors in computing systems, pages 213--240, 2001.

Digital Library

[5]

D. Cai, S. Yu, J.-R. Wen, and W.-Y. Ma. Extracting content structure for web pages based on visual representation. In Web Technologies and Applications, pages 406--417, 2003.

[6]

L. Chen, S. Ye, and X. Li. Template detection for large scale search engines. In Proceedings of the 2006 ACM symposium on Applied computing, pages 1094--1098, 2006.

Digital Library

[7]

S. Debnath, P. Mitra, N. Pal, and C. L. Giles. Automatic identification of informative sections of web pages. 17(9):1233--1246, 2005.

Digital Library

[8]

A. Finn, N. Kushmerik, and B. Smyth. Fact or fiction: Content classification for digital libraries. In Joint DELOS-NSF Workshop on Personalisation and Recommender Systems in Digital Libraries (Dublin).

[9]

D. Gibson, K. Punera, and A. Tomkins. The volume and evolution of web page templates. In Special interest tracks and posters of the 14th international conference on World Wide Web, pages 830--839, 2005.

Digital Library

[10]

H.-Y. Kao, S.-H. Lin, J.-M. Ho, and M.-S. Chen. Mining web informative structures and contents based on entropy analysis. 16(1):41--55, 2004.

Digital Library

[11]

C. Kohlschütter, P. Fankhauser, and W. Nejdl. Boilerpipe dataset. Available at http://www.L3S.de/~kohlschuetter/boilerplate (2014/02/18).

[12]

C. Kohlschütter, P. Fankhauser, and W. Nejdl. Boilerplate detection using shallow text features. In Proceedings of the third ACM international conference on Web search and data mining, pages 441--450, 2010.

Digital Library

[13]

T. Kovacic. Head to head comparison of text extraction algorithms. Available at http://readwrite.com/2011/06/10/head-to-head-comparison-of-tex (2014/04/23).

[14]

S.-H. Lin and J.-M. Ho. Discovering informative content blocks from web documents. In Proceedings of the eighth ACM SIGKDD international conference on Knowledge discovery and data mining, pages 588--593. ACM, 2002.

Digital Library

[15]

K. Nelfelt. private communication, 2014.

[16]

J. R. Quinlan. Induction of decision trees. Machine Learning, 1(1):81--106, 1986.

[17]

M. Spousta, M. Marek, and P. Pecina. Victor: the web-page cleaning tool. In 4th Web as Corpus Workshop (WAC4)-Can we beat Google, pages 12--17, 2008.

[18]

J. A. K. Suykens and J. Vandewalle. Least squares support vector machine classifiers. Neural processing letters, 9(3):293--300, 1999.

Digital Library

[19]

W. Tong, H. Hong, H. Fang, Q. Xie, and R. Perkins. Decision forest: combining the predictions of multiple independent decision tree models. Journal of Chemical Information and Computer Sciences, 43(2):525--531, 2003.

[20]

L. Yi, B. Liu, and X. Li. Eliminating noisy information in web pages for data mining. In Proceedings of the ninth ACM SIGKDD international conference on Knowledge discovery and data mining, pages 296--305, 2003.

Digital Library

Index Terms

Extracting news text from web pages: an application for the visually impaired
1. Information systems
  1. World Wide Web
    1. Web interfaces
      1. Browsers

Recommendations

Extracting content from accessible web pages
W4A '05: Proceedings of the 2005 International Cross-Disciplinary Workshop on Web Accessibility (W4A)

Web pages often contain clutter (such as ads, unnecessary animations and extraneous links) around the body of an article, which distracts a user from actual content. This can be especially inconvenient for blind and visually impaired users. The W3C's ...
News Article Text Classification in Indonesian Language

This research intends to find the appropriate algorithm to automatically classify a news article in Indonesian Language. We obtain our dataset which is taken by using a web crawling method from www.cnnindonesia.com. First of all, the document will first ...
Class dependent feature scaling method using naive Bayes classifier for text datamining

The problem of feature selection is to find a subset of features for optimal classification. A critical part of feature selection is to rank features according to their importance for classification. The naive Bayes classifier has been extensively used ...

Comments

Information & Contributors

Information

Published In

cover image ACM Other conferences

PETRA '15: Proceedings of the 8th ACM International Conference on PErvasive Technologies Related to Assistive Environments

July 2015

526 pages

ISBN:9781450334525

DOI:10.1145/2769493

Conference Chair:
Fillia Makedon
University of Texas Arlington
,
Program Chairs:
Gian-Luca Mariottini,
Oliver Korn,
Illias Maglogiannis,
Vangelis Metsis

Copyright © 2015 ACM.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Sponsors

NSF: National Science Foundation
University of Texas at Austin: University of Texas at Austin
Univ. of Piraeus: University of Piraeus
NCRS: Demokritos National Center for Scientific Research
Ionian: Ionian University, GREECE

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 01 July 2015

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Author Tags

Qualifiers

Short-paper

Conference

PETRA '15

Sponsor:

NSF
University of Texas at Austin
Univ. of Piraeus
NCRS
Ionian

PETRA '15: 8th PErvasive Technologies Related to Assistive Environments

July 1 - 3, 2015

Corfu, Greece

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

0
Total Citations
96
Total Downloads

Downloads (Last 12 months)0
Downloads (Last 6 weeks)0

Reflects downloads up to 06 Nov 2024

Other Metrics

View Author Metrics

Citations

View Options

Get Access

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Media

Figures

Other

Tables

View Table of Contents