AlcLaM: Arabic Dialectal Language Model

Ahmed, Murtadha; Alfasly, Saghir; Wen, Bo; Qasem, Jamaal; Ahmed, Mohammed; Liu, Yunfeng

Computer Science > Computation and Language

arXiv:2407.13097 (cs)

[Submitted on 18 Jul 2024]

Title:AlcLaM: Arabic Dialectal Language Model

Authors:Murtadha Ahmed, Saghir Alfasly, Bo Wen, Jamaal Qasem, Mohammed Ahmed, Yunfeng Liu

View PDF HTML (experimental)

Abstract:Pre-trained Language Models (PLMs) are integral to many modern natural language processing (NLP) systems. Although multilingual models cover a wide range of languages, they often grapple with challenges like high inference costs and a lack of diverse non-English training data. Arabic-specific PLMs are trained predominantly on modern standard Arabic, which compromises their performance on regional dialects. To tackle this, we construct an Arabic dialectal corpus comprising 3.4M sentences gathered from social media platforms. We utilize this corpus to expand the vocabulary and retrain a BERT-based model from scratch. Named AlcLaM, our model was trained using only 13 GB of text, which represents a fraction of the data used by existing models such as CAMeL, MARBERT, and ArBERT, compared to 7.8%, 10.2%, and 21.3%, respectively. Remarkably, AlcLaM demonstrates superior performance on a variety of Arabic NLP tasks despite the limited training data. AlcLaM is available at GitHub this https URL and HuggingFace this https URL.

Comments:	Accepted by ArabicNLP 2024, presented in ACL 2024
Subjects:	Computation and Language (cs.CL)
Cite as:	arXiv:2407.13097 [cs.CL]
	(or arXiv:2407.13097v1 [cs.CL] for this version)
	https://doi.org/10.48550/arXiv.2407.13097

Submission history

From: Murtadha Ahmed [view email]
[v1] Thu, 18 Jul 2024 02:13:50 UTC (23 KB)

Computer Science > Computation and Language

Title:AlcLaM: Arabic Dialectal Language Model

Submission history

Access Paper:

References & Citations

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators

Computer Science > Computation and Language

Title:AlcLaM: Arabic Dialectal Language Model

Submission history

Access Paper:

References & Citations

BibTeX formatted citation

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators