Wikimedia data for AI: a review of Wikimedia datasets for NLP tasks and AI-assisted editing

Johnson, Isaac; Kaffee, Lucie-Aimée; Redi, Miriam

Computer Science > Computers and Society

arXiv:2410.08918 (cs)

[Submitted on 11 Oct 2024]

Title:Wikimedia data for AI: a review of Wikimedia datasets for NLP tasks and AI-assisted editing

Authors:Isaac Johnson, Lucie-Aimée Kaffee, Miriam Redi

View PDF HTML (experimental)

Abstract:Wikimedia content is used extensively by the AI community and within the language modeling community in particular. In this paper, we provide a review of the different ways in which Wikimedia data is curated to use in NLP tasks across pre-training, post-training, and model evaluations. We point to opportunities for greater use of Wikimedia content but also identify ways in which the language modeling community could better center the needs of Wikimedia editors. In particular, we call for incorporating additional sources of Wikimedia data, a greater focus on benchmarks for LLMs that encode Wikimedia principles, and greater multilingualism in Wikimedia-derived datasets.

Comments:	Accepted to NLP for Wikipedia Workshop at EMNLP '24
Subjects:	Computers and Society (cs.CY)
Cite as:	arXiv:2410.08918 [cs.CY]
	(or arXiv:2410.08918v1 [cs.CY] for this version)
	https://doi.org/10.48550/arXiv.2410.08918

Submission history

From: Isaac Johnson [view email]
[v1] Fri, 11 Oct 2024 15:46:09 UTC (40 KB)

Full-text links:

Access Paper:

view license

Current browse context:

cs.CY

< prev | next >

new | recent | 2024-10

Change to browse by:

References & Citations

export BibTeX citation

Computer Science > Computers and Society

Title:Wikimedia data for AI: a review of Wikimedia datasets for NLP tasks and AI-assisted editing

Submission history

Access Paper:

References & Citations

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators

Computer Science > Computers and Society

Title:Wikimedia data for AI: a review of Wikimedia datasets for NLP tasks and AI-assisted editing

Submission history

Access Paper:

References & Citations

BibTeX formatted citation

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators