skip to main content
10.1145/2009916.2010083acmconferencesArticle/Chapter ViewAbstractPublication PagesirConference Proceedingsconference-collections
poster

Statistical feature extraction for cross-language web content quality assessment

Published: 24 July 2011 Publication History

Abstract

Web content quality assessment is a typical static ranking problem. Heuristic content and TFIDF features based statistical systems have proven effective for Web content quality assessment. But they are all language dependent features, which are not suitable for cross-language ranking. In this paper, we fuse a series of language-independent features including hostname features, domain registration features, two-layer hyperlink analysis features and third-party Web service features to assess the Web content quality. The experiments on ECML/PKDD 2010 Discovery Challenge cross-language datasets show that the assessment is effective.

References

[1]
Richardson, M. and Prakash, A. and Brill, E. Beyond PageRank: Machine Learning for Static Ranking. In WWW'06. ACM, New York, NY, USA, 707--715, 200.
[2]
András B., Carlos C., Julien M., Michael M., Miklós E., Zoltán G. ECML/PKDD 2010 Discovery Challenge data set. Crawled by European Archive Foundation.
[3]
Geng, G., Xiao, B., Zhang, X., and Zhang, D. Evaluating Web Content Quality via Multi-scale Features. In ECML/PKDD Discovery Challenge, September 20th, 2010, Barcelona, Spain.
[4]
Becchetti, L. and Castillo, C. and Donato, D. and Baeza-Yates, R. and Leonardi, S. Link analysis for web spam detection. ACM Trans. Web 2, 1, March 2008.
[5]
Andras B., Carlos C., Zoltán G., and Julien M. Overview of the ECML/PKDD Discovery Challenge 2010 on Web Quality. In ECML/PKDD Discovery Challenge}, September 20th, 2010, Barcelona, Spain.
[6]
C. Castillo, D. Donato, A. Gionis, V. Murdock, and F. Silvestri. Know Your Neighbors: Web Spam Detection Using the Web Topology. ACM SIGIR, 423--430, 2007.

Recommendations

Comments

Information & Contributors

Information

Published In

cover image ACM Conferences
SIGIR '11: Proceedings of the 34th international ACM SIGIR conference on Research and development in Information Retrieval
July 2011
1374 pages
ISBN:9781450307574
DOI:10.1145/2009916

Sponsors

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 24 July 2011

Permissions

Request permissions for this article.

Check for updates

Author Tags

  1. feature extraction
  2. machine learning
  3. quality assessment

Qualifiers

  • Poster

Conference

SIGIR '11
Sponsor:

Acceptance Rates

Overall Acceptance Rate 792 of 3,983 submissions, 20%

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • 0
    Total Citations
  • 268
    Total Downloads
  • Downloads (Last 12 months)1
  • Downloads (Last 6 weeks)0
Reflects downloads up to 06 Jan 2025

Other Metrics

Citations

View Options

Login options

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Media

Figures

Other

Tables

Share

Share

Share this Publication link

Share on social media