skip to main content
10.1145/2983323.2983740acmconferencesArticle/Chapter ViewAbstractPublication PagescikmConference Proceedingsconference-collections
research-article

Vandalism Detection in Wikidata

Published: 24 October 2016 Publication History

Abstract

Wikidata is the new, large-scale knowledge base of the Wikimedia Foundation. Its knowledge is increasingly used within Wikipedia itself and various other kinds of information systems, imposing high demands on its integrity. Wikidata can be edited by anyone and, unfortunately, it frequently gets vandalized, exposing all information systems using it to the risk of spreading vandalized and falsified information. In this paper, we present a new machine learning-based approach to detect vandalism in Wikidata. We propose a set of 47 features that exploit both content and context information, and we report on 4 classifiers of increasing effectiveness tailored to this learning task. Our approach is evaluated on the recently published Wikidata Vandalism Corpus WDVC-2015 and it achieves an area under curve value of the receiver operating characteristic, ROC-AUC, of 0.991. It significantly outperforms the state of the art represented by the rule-based Wikidata Abuse Filter (0.865 ROC-AUC) and a prototypical vandalism detector recently introduced by Wikimedia within the Objective Revision Evaluation Service (0.859 ROC-AUC).

References

[1]
B. Adler, L. de Alfaro, and I. Pye. Detecting Wikipedia Vandalism Using WikiTrust. CLEF Notebooks 2010.
[2]
B. Adler, L. de Alfaro, S. M. Mola-Velasco, P. Rosso, and A. G. West. Wikipedia Vandalism Detection: Combining Natural Language, Metadata, and Reputation Features. CICLing 2011.
[3]
E. Agichtein, C. Castillo, D. Donato, A. Gionis, and G. Mishne. Finding High-Quality Content in Social Media. WSDM 2008.
[4]
J. Amores. Multiple instance classification: Review, taxonomy and comparative study. Artificial Intelligence, 201: 81--105, Aug. 2013.
[5]
K. Bollacker, C. Evans, P. Paritosh, T. Sturge, and J. Taylor. Freebase: A Collaboratively Created Graph Database for Structuring Human Knowledge. SIGMOD 2008.
[6]
K. Boyd, V. Santos Costa, J. Davis, C. D. Page. Unachievable Region in Precision-Recall Space and Its Effect on Empirical Evaluation. ICML 2012.
[7]
L. Breiman. Bagging Predictors. Machine learning, 24 (2): 123--140, 1996.
[8]
L. Breiman. Random Forests. Machine Learning, 45 (1): 5--32, Oct. 2001.
[9]
S. Burgstaller-Muehlbacher, A. Waagmeester, E. Mitraka, J. Turner, T. E. Putman, J. Leong, P. Pavlidis, L. Schriml, B. M. Good, and A. I. Su. Wikidata as a Semantic Framework for the Gene Wiki Initiative. bioRxiv 032144, 2015.
[10]
J. Davis and M. Goadrich. The relationship between Precision-Recall and ROC curves. ICML 2006.
[11]
T. Fawcett. An introduction to ROC analysis. Pattern recognition letters, 27 (8): 861--874, 2006.
[12]
T. Gärtner, P. Flach, A. Kowalczyk, and A. Smola. Multi-Instance Kernels. ICML 2002.
[13]
H. He and E. Garcia. Learning from Imbalanced Data. IEEE Transactions on Knowledge and Data Engineering, 21 (9): 1263--1284, Sept. 2009.
[14]
S. Heindorf, M. Potthast, B. Stein, and G. Engels. Towards Vandalism Detection in Knowledge Bases: Corpus Construction and Analysis. SIGIR 2015.
[15]
IPligence. Ipligence. http://www.ipligence.com, 2014.
[16]
K. Y. Itakura and C. L. A. Clarke. Using Dynamic Markov Compression to Detect Vandalism in the Wikipedia. SIGIR 2009.
[17]
A. Ladsgroup and A. Halfaker. Wikidata features. https://github.com/wiki-ai/wb-vandalism/blob/ 31d74f8a50a8c43dd446d41cafee89ada5a051f8/wb_vandalism/feature_lists/wikidata. py.
[18]
B. Li, T. Jin, M. R. Lyu, I. King, and B. Mak. Analyzing and predicting question quality in community question answering services. WWW 2012.
[19]
C. D. Manning, P. Raghavan, and H. Schütze. Introduction to Information Retrieval. Cambridge University Press, 2008.
[20]
E. Mitraka, A. Waagmeester, S. Burgstaller-Muehlbacher, L. M. Schriml, A. I. Su, and B. M. Good. Wikidata: A platform for data integration and dissemination for the life sciences and beyond. bioRxiv 031971, 2015.
[21]
S. M. Mola-Velasco. Wikipedia Vandalism Detection Through Machine Learning: Feature Review and New Proposals: Lab Report for PAN at CLEF 2010. CLEF Notebooks 2010.
[22]
P. Neis, M. Goetz, and A. Zipf. Towards Automatic Vandalism Detection in OpenStreetMap. ISPRS International Journal of Geo-Information, 2012.
[23]
F. Pedregosa, G. Varoquaux, A. Gramfort, V. Michel, B. Thirion, O. Grisel, M. Blondel, P. Prettenhofer, R. Weiss, V. Dubourg, J. Vanderplas, A. Passos, D. Cournapeau, M. Brucher, M. Perrot, and E. Duchesnay. Scikit-learn: Machine learning in Python. Journal of Machine Learning Research, 12: 2825--2830, 2011.
[24]
M. Potthast, B. Stein, and R. Gerling. Automatic Vandalism Detection in Wikipedia. ECIR 2008.
[25]
L. Ramaswamy, R. Tummalapenta, K. Li, and C. Pu. A Content-Context-Centric Approach for Detecting Vandalism in Wikipedia. Collaboratecom 2013.
[26]
C. Shah and J. Pomerantz. Evaluating and Predicting Answer Quality in Community QA. SIGIR 2010.
[27]
C. H. Tan, E. Agichtein, P. Ipeirotis, and E. Gabrilovich. Trust, but Verify: Predicting Contribution Quality for Knowledge Base Construction and Curation. WSDM 2014.
[28]
T. P. Tanon, D. Vrandecic, S. Schaffert, T. Steiner, and L. Pintscher. From Freebase to Wikidata: The Great Migration. WWW 2016.
[29]
K.-N. Tran and P. Christen. Cross Language Prediction of Vandalism on Wikipedia Using Article Views and Revisions. PAKDD 2013.
[30]
K.-N. Tran and P. Christen. Cross-Language Learning from Bots and Users to Detect Vandalism on Wikipedia. IEEE Transactions on Knowledge and Data Engineering, 27 (3): 673--685, Mar. 2015.
[31]
K.-N. Tran, P. Christen, S. Sanner, and L. Xie. Context-Aware Detection of Sneaky Vandalism on Wikipedia Across Multiple Languages. PAKDD 2015.
[32]
L. von Ahn. Offensive/Profane Word List. http://www.cs.cmu.edu/ biglou/resources/, 2009.
[33]
D. Vrandeičć and M. Krötzsch. Wikidata: A Free Collaborative Knowledgebase. Communications of the ACM, 2014.
[34]
W. Y. Wang and K. R. McKeown. "Got You!": Automatic Vandalism Detection in Wikipedia with Web-based Shallow Syntactic-semantic Modeling. COLING 2010.
[35]
A. West and I. Lee. Multilingual Vandalism Detection Using Language-Independent & Ex Post Facto Evidence. CLEF Notebooks 2011.
[36]
A. G. West, S. Kannan, and I. Lee. Detecting Wikipedia Vandalism via Spatio-temporal Analysis of Revision Metadata. EUROSEC 2010.
[37]
Q. Wu, D. Irani, C. Pu, and L. Ramaswamy. Elusive Vandalism Detection in Wikipedia: A Text Stability-based Approach. CIKM 2010.
[38]
Wikimedia Foundation. Wikidata Abuse Filter. https://www.wikidata.org/wiki/Special:AbuseFilter, 2015.
[39]
Wikimedia Foundation. Objective Revision Evaluation Service. https://meta.wikimedia.org/wiki/Objective_Revision_Evaluation_Service, 2016.
[40]
Wikimedia Foundation. Wikidata:Rollbackers. https://www.wikidata.org/wiki/Wikidata:Rollbackers, 2016.

Cited By

View all

Recommendations

Comments

Information & Contributors

Information

Published In

cover image ACM Conferences
CIKM '16: Proceedings of the 25th ACM International on Conference on Information and Knowledge Management
October 2016
2566 pages
ISBN:9781450340731
DOI:10.1145/2983323
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Sponsors

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 24 October 2016

Permissions

Request permissions for this article.

Check for updates

Author Tags

  1. data quality
  2. knowledge base
  3. trust
  4. vandalism

Qualifiers

  • Research-article

Funding Sources

Conference

CIKM'16
Sponsor:
CIKM'16: ACM Conference on Information and Knowledge Management
October 24 - 28, 2016
Indiana, Indianapolis, USA

Acceptance Rates

CIKM '16 Paper Acceptance Rate 160 of 701 submissions, 23%;
Overall Acceptance Rate 1,861 of 8,427 submissions, 22%

Upcoming Conference

CIKM '25

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)31
  • Downloads (Last 6 weeks)2
Reflects downloads up to 08 Feb 2025

Other Metrics

Citations

Cited By

View all

View Options

Login options

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Figures

Tables

Media

Share

Share

Share this Publication link

Share on social media