SMTCE: A Social Media Text Classification Evaluation Benchmark and BERTology Models for Vietnamese

Nguyen, Luan Thanh; Van Nguyen, Kiet; Nguyen, Ngan Luu-Thuy

Computer Science > Computation and Language

arXiv:2209.10482 (cs)

[Submitted on 21 Sep 2022]

Title:SMTCE: A Social Media Text Classification Evaluation Benchmark and BERTology Models for Vietnamese

Authors:Luan Thanh Nguyen, Kiet Van Nguyen, Ngan Luu-Thuy Nguyen

View PDF

Abstract:Text classification is a typical natural language processing or computational linguistics task with various interesting applications. As the number of users on social media platforms increases, data acceleration promotes emerging studies on Social Media Text Classification (SMTC) or social media text mining on these valuable resources. In contrast to English, Vietnamese, one of the low-resource languages, is still not concentrated on and exploited thoroughly. Inspired by the success of the GLUE, we introduce the Social Media Text Classification Evaluation (SMTCE) benchmark, as a collection of datasets and models across a diverse set of SMTC tasks. With the proposed benchmark, we implement and analyze the effectiveness of a variety of multilingual BERT-based models (mBERT, XLM-R, and DistilmBERT) and monolingual BERT-based models (PhoBERT, viBERT, vELECTRA, and viBERT4news) for tasks in the SMTCE benchmark. Monolingual models outperform multilingual models and achieve state-of-the-art results on all text classification tasks. It provides an objective assessment of multilingual and monolingual BERT-based models on the benchmark, which will benefit future studies about BERTology in the Vietnamese language.

Comments:	Accepted at The 36th annual Meeting of Pacific Asia Conference on Language, Information and Computation (PACLIC 36)
Subjects:	Computation and Language (cs.CL)
Cite as:	arXiv:2209.10482 [cs.CL]
	(or arXiv:2209.10482v1 [cs.CL] for this version)
	https://doi.org/10.48550/arXiv.2209.10482

Submission history

From: Luan Thanh Nguyen [view email]
[v1] Wed, 21 Sep 2022 16:33:46 UTC (195 KB)

Computer Science > Computation and Language

Title:SMTCE: A Social Media Text Classification Evaluation Benchmark and BERTology Models for Vietnamese

Submission history

Access Paper:

References & Citations

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators

Computer Science > Computation and Language

Title:SMTCE: A Social Media Text Classification Evaluation Benchmark and BERTology Models for Vietnamese

Submission history

Access Paper:

References & Citations

BibTeX formatted citation

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators