skip to main content
10.1145/3442442.3452350acmconferencesArticle/Chapter ViewAbstractPublication PagesthewebconfConference Proceedingsconference-collections
research-article

Inferring Sociodemographic Attributes of Wikipedia Editors: State-of-the-art and Implications for Editor Privacy

Published: 03 June 2021 Publication History

Abstract

In this paper, we investigate the state-of-the-art of machine learning models to infer sociodemographic attributes of Wikipedia editors based on their public profile pages and corresponding implications for editor privacy. To build models for inferring sociodemographic attributes, ground truth labels are obtained via different strategies, using publicly disclosed information from editor profile pages. Different embedding techniques are used to derive features from editors’ profile texts. In comparative evaluations of different machine learning models, we show that the highest prediction accuracy can be obtained for the attribute gender, with precision values of 82% to 91% for women and men respectively, as well as an averaged F1-score of 0.78. For other attributes like age group, education, and religion, the utilized classifiers exhibit F1-scores in the range of 0.32 to 0.74, depending on the model class. By merely using publicly disclosed information of Wikipedia editors, we highlight issues surrounding editor privacy on Wikipedia and discuss ways to mitigate this problem. We believe our work can help start a conversation about carefully weighing the potential benefits and harms that come with the existence of information-rich, pre-labeled profile pages of Wikipedia editors.

References

[1]
2020. PAN. https://pan.webis.de/. (accessed August 11, 2020).
[2]
Judd Antin, Raymond Yee, Coye Cheshire, and Oded Nov. 2011. Gender differences in Wikipedia editing. In Proceedings of the 7th international symposium on wikis and open collaboration. 11–14. https://dl.acm.org/doi/pdf/10.1145/2038558.2038561
[3]
Nicolas Bérubé, Gita Ghiasi, Maxime Sainte-Marie, and Vincent Larivière. 2020. Wiki-Gendersort: Automatic gender detection using first names in Wikipedia. (2020). https://osf.io/preprints/socarxiv/ezw7p/download
[4]
Benjamin Cabrera, Björn Ross, Marielle Dado, and Maritta Heisel. 2018. The Gender Gap in Wikipedia Talk Pages. In Twelfth International AAAI Conference on Web and Social Media.
[5]
Nitesh V Chawla, Kevin W Bowyer, Lawrence O Hall, and W Philip Kegelmeyer. 2002. SMOTE: synthetic minority over-sampling technique. Journal of artificial intelligence research 16 (2002), 321–357.
[6]
Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. 2018. Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805(2018).
[7]
Wikimedia Foundation. 2011. Wikipedia editors study: Results from The Editor Survey, April 2011. (2011). https://upload.wikimedia.org/wikipedia/commons/7/76/Editor_Survey_Report_-_April_2011.pdf
[8]
Google and Jigsaw. 2020. Perspective API. https://www.perspectiveapi.com. (accessed August 11, 2020).
[9]
Fariba Karimi, Claudia Wagner, Florian Lemmerich, Mohsen Jadidi, and Markus Strohmaier. 2016. Inferring gender from names on the web: A comparative evaluation of gender detection methods. In Proceedings of the 25th International conference companion on World Wide Web. 53–54.
[10]
Moshe Koppel, Jonathan Schler, Shlomo Argamon, and James Pennebaker. 2006. Effects of age and gender on blogging. In AAAI 2006 spring symposium on computational approaches to analysing weblogs. 1–7.
[11]
Shyong (Tony) K Lam, Anuradha Uduwage, Zhenhua Dong, Shilad Sen, David R Musicant, Loren Terveen, and John Riedl. 2011. WP: clubhouse? An exploration of Wikipedia’s gender imbalance. In Proceedings of the 7th international symposium on Wikis and open collaboration. 1–10.
[12]
Quoc Le and Tomas Mikolov. 2014. Distributed representations of sentences and documents. In International conference on machine learning. 1188–1196.
[13]
Florian Lemmerich, Diego Sáez-Trumper, Robert West, and Leila Zia. 2019. Why the world reads Wikipedia: Beyond English speakers. In Proceedings of the Twelfth ACM International Conference on Web Search and Data Mining. 618–626.
[14]
MediaWiki. 2020. Manual:FAQ — MediaWiki, The Free Wiki Engine. https://www.mediawiki.org/w/index.php?title=Manual:FAQ&oldid=3888037 [Online; accessed 14-July-2020].
[15]
F. Pedregosa, G. Varoquaux, A. Gramfort, V. Michel, B. Thirion, O. Grisel, M. Blondel, P. Prettenhofer, R. Weiss, V. Dubourg, J. Vanderplas, A. Passos, D. Cournapeau, M. Brucher, M. Perrot, and E. Duchesnay. 2011. Scikit-learn: Machine Learning in Python. Journal of Machine Learning Research 12 (2011), 2825–2830.
[16]
Iris Qu, Nithum Thain, and Yiqing Hua. 2019. WikiDetox Visualization. (2019).
[17]
Anand Rajaraman and Jeffrey David Ullman. 2011. Mining of massive datasets: Data mining (ch01). Min. Massive Datasets 18(2011), 114–142.
[18]
Francisco Rangel, Paolo Rosso, Moshe Koppel, Efstathios Stamatatos, and Giacomo Inches. 2013. Overview of the author profiling task at PAN 2013. In CLEF Conference on Multilingual and Multimodal Information Access Evaluation. CELCT, 352–365.
[19]
Francisco Rangel, Paolo Rosso, Manuel Montes-y Gómez, Martin Potthast, and Benno Stein. 2018. Overview of the 6th author profiling task at pan 2018: multimodal gender identification in twitter. Working Notes Papers of the CLEF(2018).
[20]
Francisco Rangel, Paolo Rosso, Martin Potthast, Benno Stein, and Walter Daelemans. 2015. Overview of the 3rd Author Profiling Task at PAN 2015. In CLEF. sn, 2015.
[21]
Miriam Redi, Martin Gerlach, Isaac Johnson, Jonathan Morgan, and Leila Zia. 2020. A Taxonomy of Knowledge Gaps for Wikimedia Projects (First Draft). arXiv preprint arXiv:2008.12314(2020).
[22]
Marian-Andrei Rizoiu, Lexing Xie, Tiberio Caetano, and Manuel Cebrian. 2016. Evolution of privacy loss in Wikipedia. In Proceedings of the Ninth ACM International Conference on Web Search and Data Mining. 215–224. https://dl.acm.org/doi/pdf/10.1145/2835776.2835798
[23]
Björn Ross, Marielle Dado, Maritta Heisel, and Benjamin Cabrera. 2018. Gender markers in wikipedia usernames. In Wiki Workshop.
[24]
K Santosh, Aditya Joshi, Manish Gupta, and Vasudeva Varma. 2014. Exploiting Wikipedia Categorization for Predicting Age and Gender of Blog Authors. In UMAP Workshops.
[25]
H Andrew Schwartz, Johannes C Eichstaedt, Margaret L Kern, Lukasz Dziurzynski, Stephanie M Ramones, Megha Agrawal, Achal Shah, Michal Kosinski, David Stillwell, Martin EP Seligman, 2013. Personality, gender, and age in the language of social media: The open-vocabulary approach. PloS one 8, 9 (2013), e73791.
[26]
Philipp Singer, Florian Lemmerich, Robert West, Leila Zia, Ellery Wulczyn, Markus Strohmaier, and Jure Leskovec. 2017. Why we read Wikipedia. In Proceedings of the 26th international conference on world wide web. 1591–1600.
[27]
Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Lukasz Kaiser, and Illia Polosukhin. 2017. Attention is all you need. arXiv preprint arXiv:1706.03762(2017).
[28]
Claudia Wagner, David Garcia, Mohsen Jadidi, and Markus Strohmaier. 2015. It’s a man’s Wikipedia? Assessing gender inequality in an online encyclopedia. In Proceedings of the International AAAI Conference on Web and Social Media, Vol. 9.
[29]
Wikipedia. [n.d.]. Community Insights/2018 Report. https://meta.wikimedia.org/wiki/Community_Insights/2018_Report. 2018 (accessed December 1, 2020).
[30]
Wikipedia. 2020. Liste der meistaufgerufenen Websites. https://de.wikipedia.org/wiki/Liste_der_meistaufgerufenen_Websites. September 2018 (accessed July 12, 2020).
[31]
Ellery Wulczyn, Nithum Thain, and Lucas Dixon. 2017. Ex machina: Personal attacks seen at scale. In Proceedings of the 26th International Conference on World Wide Web. 1391–1399. https://dl.acm.org/doi/pdf/10.1145/3038912.3052591

Cited By

View all

Recommendations

Comments

Information & Contributors

Information

Published In

cover image ACM Conferences
WWW '21: Companion Proceedings of the Web Conference 2021
April 2021
726 pages
ISBN:9781450383134
DOI:10.1145/3442442
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Sponsors

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 03 June 2021

Permissions

Request permissions for this article.

Check for updates

Author Tags

  1. BERT
  2. Wikipedia
  3. editor attribute prediction

Qualifiers

  • Research-article
  • Research
  • Refereed limited

Conference

WWW '21
Sponsor:
WWW '21: The Web Conference 2021
April 19 - 23, 2021
Ljubljana, Slovenia

Acceptance Rates

Overall Acceptance Rate 1,899 of 8,196 submissions, 23%

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)25
  • Downloads (Last 6 weeks)4
Reflects downloads up to 14 Sep 2024

Other Metrics

Citations

Cited By

View all

View Options

Get Access

Login options

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

HTML Format

View this article in HTML Format.

HTML Format

Media

Figures

Other

Tables

Share

Share

Share this Publication link

Share on social media