skip to main content
10.1145/2769493.2769573acmotherconferencesArticle/Chapter ViewAbstractPublication PagespetraConference Proceedingsconference-collections
short-paper

Extracting news text from web pages: an application for the visually impaired

Published: 01 July 2015 Publication History

Abstract

Apart from the actual content, web pages contain several other components (referred to as boilerplate text) that describes how, and in what context the content should be displayed. We show how content bearing text can be efficiently separated from boilerplate text using a random forest classifier. We compare the performance with another state-of-the-art method for boilerplate detection that uses a decision tree classifier and shallow features extracted from the text. The result is a general improvement using the random forest classifier for both classifying problems analyzed, significantly so for the more complex problem. We also show that a small increase in feature set range can lead to even further improved accuracy. The conclusion is that random forest classification can achieve significantly higher accuracy rates than at least one of the current state-of-the-art methods for content extraction. The results can improve content extraction methods for a variety of applications, including search engine optimization and making the web more accessible for the blind or visually impaired.

References

[1]
J. Ali, R. Khan, N. Ahmad, and I. Maqsood. Random forests and decision trees. International Journal of Computer Science Issues (IJCSI), 9(5), 2012.
[2]
Z. Bar-Yossef and S. Rajagopalan. Template detection via data mining and its applications. In Proceedings of the 11th international conference on World Wide Web, pages 580--591. ACM, 2002.
[3]
L. Breiman. Random forests. Machine learning, 45(1):5--32, 2001.
[4]
O. Buyukkokten, H. Garcia-Molina, and A. Paepcke. Accordion summarization for end-game browsing on pdas and cellular phones. In Proceedings of the SIGCHI conference on Human factors in computing systems, pages 213--240, 2001.
[5]
D. Cai, S. Yu, J.-R. Wen, and W.-Y. Ma. Extracting content structure for web pages based on visual representation. In Web Technologies and Applications, pages 406--417, 2003.
[6]
L. Chen, S. Ye, and X. Li. Template detection for large scale search engines. In Proceedings of the 2006 ACM symposium on Applied computing, pages 1094--1098, 2006.
[7]
S. Debnath, P. Mitra, N. Pal, and C. L. Giles. Automatic identification of informative sections of web pages. 17(9):1233--1246, 2005.
[8]
A. Finn, N. Kushmerik, and B. Smyth. Fact or fiction: Content classification for digital libraries. In Joint DELOS-NSF Workshop on Personalisation and Recommender Systems in Digital Libraries (Dublin).
[9]
D. Gibson, K. Punera, and A. Tomkins. The volume and evolution of web page templates. In Special interest tracks and posters of the 14th international conference on World Wide Web, pages 830--839, 2005.
[10]
H.-Y. Kao, S.-H. Lin, J.-M. Ho, and M.-S. Chen. Mining web informative structures and contents based on entropy analysis. 16(1):41--55, 2004.
[11]
C. Kohlschütter, P. Fankhauser, and W. Nejdl. Boilerpipe dataset. Available at http://www.L3S.de/~kohlschuetter/boilerplate (2014/02/18).
[12]
C. Kohlschütter, P. Fankhauser, and W. Nejdl. Boilerplate detection using shallow text features. In Proceedings of the third ACM international conference on Web search and data mining, pages 441--450, 2010.
[13]
T. Kovacic. Head to head comparison of text extraction algorithms. Available at http://readwrite.com/2011/06/10/head-to-head-comparison-of-tex (2014/04/23).
[14]
S.-H. Lin and J.-M. Ho. Discovering informative content blocks from web documents. In Proceedings of the eighth ACM SIGKDD international conference on Knowledge discovery and data mining, pages 588--593. ACM, 2002.
[15]
K. Nelfelt. private communication, 2014.
[16]
J. R. Quinlan. Induction of decision trees. Machine Learning, 1(1):81--106, 1986.
[17]
M. Spousta, M. Marek, and P. Pecina. Victor: the web-page cleaning tool. In 4th Web as Corpus Workshop (WAC4)-Can we beat Google, pages 12--17, 2008.
[18]
J. A. K. Suykens and J. Vandewalle. Least squares support vector machine classifiers. Neural processing letters, 9(3):293--300, 1999.
[19]
W. Tong, H. Hong, H. Fang, Q. Xie, and R. Perkins. Decision forest: combining the predictions of multiple independent decision tree models. Journal of Chemical Information and Computer Sciences, 43(2):525--531, 2003.
[20]
L. Yi, B. Liu, and X. Li. Eliminating noisy information in web pages for data mining. In Proceedings of the ninth ACM SIGKDD international conference on Knowledge discovery and data mining, pages 296--305, 2003.

Index Terms

  1. Extracting news text from web pages: an application for the visually impaired

    Recommendations

    Comments

    Information & Contributors

    Information

    Published In

    cover image ACM Other conferences
    PETRA '15: Proceedings of the 8th ACM International Conference on PErvasive Technologies Related to Assistive Environments
    July 2015
    526 pages
    ISBN:9781450334525
    DOI:10.1145/2769493
    Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

    Sponsors

    • NSF: National Science Foundation
    • University of Texas at Austin: University of Texas at Austin
    • Univ. of Piraeus: University of Piraeus
    • NCRS: Demokritos National Center for Scientific Research
    • Ionian: Ionian University, GREECE

    Publisher

    Association for Computing Machinery

    New York, NY, United States

    Publication History

    Published: 01 July 2015

    Permissions

    Request permissions for this article.

    Check for updates

    Author Tags

    1. classification
    2. text extraction

    Qualifiers

    • Short-paper

    Conference

    PETRA '15
    Sponsor:
    • NSF
    • University of Texas at Austin
    • Univ. of Piraeus
    • NCRS
    • Ionian

    Contributors

    Other Metrics

    Bibliometrics & Citations

    Bibliometrics

    Article Metrics

    • 0
      Total Citations
    • 96
      Total Downloads
    • Downloads (Last 12 months)0
    • Downloads (Last 6 weeks)0
    Reflects downloads up to 06 Nov 2024

    Other Metrics

    Citations

    View Options

    Get Access

    Login options

    View options

    PDF

    View or Download as a PDF file.

    PDF

    eReader

    View online with eReader.

    eReader

    Media

    Figures

    Other

    Tables

    Share

    Share

    Share this Publication link

    Share on social media