skip to main content
10.1145/3323503.3360628acmotherconferencesArticle/Chapter ViewAbstractPublication PageswebmediaConference Proceedingsconference-collections
short-paper

Quality assessment of Wikipedia content using topic models

Published: 29 October 2019 Publication History

Abstract

The web has become a large knowledge provider for society, allowing people to not just consume information but also produce it. Collaborative documents bring some significant advantages and decentralization, but they also raise questions concerning its quality. In this work, we explore the quality assessment on collaborative documents using these documents' topics. The proposed approach improved in 3.2% the accuracy of quality assesment of Wikipedia content. Then, the main contribution in this paper is an analysis of how we can use topic modelling in order to improve quality prediction performance.

References

[1]
Maik Anderka, Benno Stein, and Nedim Lipka. 2012. Predicting Quality Flaws in User-generated Content: The Case of Wikipedia. In Proc. of the 35th SIGIR (SIGIR '12). ACM, New York, NY, USA, 981--990.
[2]
R. Arun, V. Suresh, C. E. Veni Madhavan, and M. N. Narasimha Murthy. 2010. On Finding the Natural Number of Topics with Latent Dirichlet Allocation: Some Observations. In Advances in Knowledge Discovery and Data Mining, Mohammed J. Zaki, Jeffrey Xu Yu, B. Ravindran, and Vikram Pudi (Eds.). Springer Berlin Heidelberg, Berlin, Heidelberg, 391--402.
[3]
David M. Blei and Jon D. McAuliffe. 2007. Supervised Topic Models. In Proceedings of the 20th International Conference on NIPS (NIPS'07). Curran Associates Inc., USA, 121--128. http://dl.acm.org/citation.cfm?id=2981562.2981578
[4]
David M. Blei, Andrew Y. Ng, and Michael I. Jordan. 2003. Latent Dirichlet Allocation. J. Mach. Learn. Res. 3 (March 2003), 993--1022. http://dl.acm.org/citation.cfm?id=944919.944937
[5]
Joshua E. Blumenstock. 2008. Size Matters: Word Count As a Measure of Quality on Wikipedia. In Proc. of the 17th WWW (WWW '08). ACM, New York, NY, USA, 1095--1096.
[6]
Sergey Brin and Lawrence Page. 1998. The anatomy of a large-scale hypertextual Web search engine. Computer Networks and ISDN Systems 30, 1 (1998), 107 -- 117. Proc. of the 7h WWW.
[7]
Daniel H. Dalip, Marcos A. Gonçalves, Marco Cristo, and Pável Calado. 2011. Automatic Assessment of Document Quality in Web Collaborative Digital Libraries. J. Data and Information Quality 2, 3, Article 14 (Dec. 2011), 30 pages.
[8]
Quang-Vinh Dang and Claudia-Lavinia Ignat. 2016. Measuring Quality of Collaboratively Edited Documents: The Case of Wikipedia. In 2016 IEEE 2nd CIC. 266--275.
[9]
Gabriel De la Calzada and Alex Dekhtyar. 2010. On Measuring the Quality of Wikipedia Articles. In Proc. of the 4th WICOW (WICOW '10). ACM, New York, NY, USA, 11--18.
[10]
Daniel H. Dalip, Marcos A. Gonçalves, Marco Cristo, and Pável Calado. 2009. Automatic Quality Assessment of Content Created Collaboratively by Web Communities: A Case Study of Wikipedia. In Proc. of the 9th ACM/IEEE-CS JCDL (JCDL '09). ACM, New York, NY, USA, 295--304.
[11]
Aaron Halfaker, R. Stuart Geiger, Jonathan T. Morgan, and John Riedl. 2013. The Rise and Decline of an Open Collaboration System: How Wikipedia's Reaction to Popularity Is Causing Its Decline. American Behavioral Scientist 57, 5 (2013), 664--688.
[12]
Alexa Internet. 2019. The top 500 sites on the web. (2019). Retrieved June 21, 2019 from https://www.alexa.com/topsites
[13]
Sara Javanmardi and Cristina Lopes. 2010. Statistical Measure of Quality in Wikipedia. In Proc. of the 1st SOMA (SOMA '10). ACM, New York, NY, USA, 132--138.
[14]
Jonathan Leo and Jeffrey Lacasse. 2014. Wikipedia vs peer-reviewed medical literature for information about the 10 most costly medical conditions. 114 (10 2014), 761--4.
[15]
Nedim Lipka and Benno Stein. 2010. Identifying Featured Articles in Wikipedia: Writing Style Matters. In Proc. of the 19th WWW (WWW '10). ACM, New York, NY, USA, 1147--1148.
[16]
Alex Primo. 2006. O aspecto relacional das interações na Web 2.0 1. 9 (01 2006).
[17]
Rodrigo R. do Carmo, Anísio M. Lacerda, and Daniel H. Dalip. 2017. A Majority Voting Approach for Sentiment Analysis in Short Texts Using Topic Models. In Proceedings of the 23rd Brazillian Symposium on WebMedia (WebMedia '17). ACM, New York, NY, USA, 449--455.
[18]
E. A. Smith, R. J. Senter, and Air Force Aerospace Medical Research Laboratory (U.S.). 1967. Automated Readability Index. Aerospace Medical Research Laboratories. https://books.google.com.br/books?id=HejUGwAACAAJ
[19]
Yu Suzuki. 2015. Quality Assessment of Wikipedia Articles Using h-index. JIP 23 (2015), 22--30.
[20]
Vladimir N. Vapnik. 1995. The Nature of Statistical Learning Theory. Springer-Verlag, Berlin, Heidelberg.
[21]
Yanxiang Xu and Tiejian Luo. 2011. Measuring article quality in Wikipedia: Lexical clue model. IEEE Symposium on Web Society (10 2011), 141--146.
[22]
Jun Zhu, Amr Ahmed, and Eric P. Xing. 2009. MedLDA: Maximum Margin Supervised Topic Models for Regression and Classification. In Proceedings of the 26th Annual ICML (ICML '09). ACM, New York, NY, USA, 1257--1264.

Cited By

View all

Recommendations

Comments

Information & Contributors

Information

Published In

cover image ACM Other conferences
WebMedia '19: Proceedings of the 25th Brazillian Symposium on Multimedia and the Web
October 2019
537 pages
ISBN:9781450367639
DOI:10.1145/3323503
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 29 October 2019

Permissions

Request permissions for this article.

Check for updates

Author Tags

  1. automatic quality assessment
  2. information quality
  3. latent dirichlet allocation
  4. machine learning
  5. topic prediction

Qualifiers

  • Short-paper

Conference

WebMedia '19
WebMedia '19: Brazilian Symposium on Multimedia and the Web
October 29 - November 1, 2019
Rio de Janeiro, Brazil

Acceptance Rates

Overall Acceptance Rate 270 of 873 submissions, 31%

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)3
  • Downloads (Last 6 weeks)0
Reflects downloads up to 06 Jan 2025

Other Metrics

Citations

Cited By

View all

View Options

Login options

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Media

Figures

Other

Tables

Share

Share

Share this Publication link

Share on social media