Mukhyansh: A Headline Generation Dataset for Indic Languages

Madasu, Lokesh; Kanumolu, Gopichand; Surange, Nirmal; Shrivastava, Manish

Computer Science > Computation and Language

arXiv:2311.17743 (cs)

[Submitted on 29 Nov 2023]

Title:Mukhyansh: A Headline Generation Dataset for Indic Languages

Authors:Lokesh Madasu, Gopichand Kanumolu, Nirmal Surange, Manish Shrivastava

View PDF

Abstract:The task of headline generation within the realm of Natural Language Processing (NLP) holds immense significance, as it strives to distill the true essence of textual content into concise and attention-grabbing summaries. While noteworthy progress has been made in headline generation for widely spoken languages like English, there persist numerous challenges when it comes to generating headlines in low-resource languages, such as the rich and diverse Indian languages. A prominent obstacle that specifically hinders headline generation in Indian languages is the scarcity of high-quality annotated data. To address this crucial gap, we proudly present Mukhyansh, an extensive multilingual dataset, tailored for Indian language headline generation. Comprising an impressive collection of over 3.39 million article-headline pairs, Mukhyansh spans across eight prominent Indian languages, namely Telugu, Tamil, Kannada, Malayalam, Hindi, Bengali, Marathi, and Gujarati. We present a comprehensive evaluation of several state-of-the-art baseline models. Additionally, through an empirical analysis of existing works, we demonstrate that Mukhyansh outperforms all other models, achieving an impressive average ROUGE-L score of 31.43 across all 8 languages.

Comments:	Accepted at PACLIC 2023
Subjects:	Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
Cite as:	arXiv:2311.17743 [cs.CL]
	(or arXiv:2311.17743v1 [cs.CL] for this version)
	https://doi.org/10.48550/arXiv.2311.17743

Submission history

From: Lokesh Madasu [view email]
[v1] Wed, 29 Nov 2023 15:49:24 UTC (198 KB)

Computer Science > Computation and Language

Title:Mukhyansh: A Headline Generation Dataset for Indic Languages

Submission history

Access Paper:

References & Citations

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators

Computer Science > Computation and Language

Title:Mukhyansh: A Headline Generation Dataset for Indic Languages

Submission history

Access Paper:

References & Citations

BibTeX formatted citation

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators