research-article

Abusive Span Detection for Vietnamese Narrative Texts

Authors:

Nhu-Thanh Nguyen,

Khoa Thi-Kim Phan,

Ngan Luu-Thuy NguyenAuthors Info & Claims

SOICT '23: Proceedings of the 12th International Symposium on Information and Communication Technology

Pages 471 - 478

https://doi.org/10.1145/3628797.3628921

Published: 07 December 2023 Publication History

Abstract

Abuse in its various forms, including physical, psychological, verbal, sexual, financial, and cultural, has a negative impact on mental health. However, there are limited studies on applying natural language processing (NLP) in this field in Vietnam. Therefore, we aim to contribute by building a human-annotated Vietnamese dataset for detecting abusive content in Vietnamese narrative texts. We sourced these texts from VnExpress, Vietnam’s popular online newspaper, where readers often share stories containing abusive content. Identifying and categorizing abusive spans in these texts posed significant challenges during dataset creation, but it also motivated our research. We experimented with lightweight baseline models by freezing PhoBERT and XLM-RoBERTa and using their hidden states in a BiLSTM to assess the complexity of the dataset. According to our experimental results, PhoBERT outperforms other models in both labeled and unlabeled abusive span detection tasks. These results indicate that it has the potential for future improvements.

References

[1]

Stephen Afrifa. 2022. Cyberbullying detection on twitter using natural language processing and machine learning techniques. International Journal of Innovative Technology and Interdisciplinary Sciences 5, 4 (2022), 1069–1080.

[2]

Mohammed Ali Al-Garadi, Sangmi Kim, Yuting Guo, Elise Warren, Yuan-Chi Yang, Sahithi Lakamana, and Abeed Sarker. 2022. Natural language model for automatic identification of intimate partner violence reports from twitter. Array 15 (2022), 100217.

[3]

Wan Noor Hamiza Wan Ali, Masnizah Mohd, and Fariza Fauzi. 2018. Cyberbullying detection: an overview. In 2018 Cyber Resilience Conference (CRC). IEEE, 1–3.

[4]

Aymé Arango, Jorge Pérez, and Barbara Poblete. 2022. Hate speech detection is not as easy as you may think: A closer look at model validation (extended version). Information Systems 105 (2022), 101584.

Digital Library

[5]

Tommaso Caselli, Valerio Basile, Jelena Mitrović, Inga Kartoziya, and Michael Granitzer. 2020. I Feel Offended, Don’t Be Abusive! Implicit/Explicit Messages in Offensive and Abusive Language. In Proceedings of the Twelfth Language Resources and Evaluation Conference. European Language Resources Association, Marseille, France, 6193–6202. https://aclanthology.org/2020.lrec-1.760

[6]

Patricia Chiril, Farah Benamara, Véronique Moriceau, Marlene Coulomb-Gully, and Abhishek Kumar. 2019. Multilingual and multitarget hate speech detection in tweets. In Conférence sur le Traitement Automatique des Langues Naturelles (TALN-PFIA 2019). ATALA, 351–360.

[7]

Alexis Conneau, Kartikay Khandelwal, Naman Goyal, Vishrav Chaudhary, Guillaume Wenzek, Francisco Guzmán, Edouard Grave, Myle Ott, Luke Zettlemoyer, and Veselin Stoyanov. 2020. Unsupervised Cross-lingual Representation Learning at Scale. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics. Association for Computational Linguistics, Online, 8440–8451. https://doi.org/10.18653/v1/2020.acl-main.747

[8]

Thomas Davidson, Dana Warmsley, Michael Macy, and Ingmar Weber. 2017. Automated hate speech detection and the problem of offensive language. In Proceedings of the international AAAI conference on web and social media, Vol. 11. 512–515.

[9]

Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. 2019. BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding. arxiv:1810.04805 [cs.CL]

[10]

Chuka Emezue 2020. Digital or digitally delivered responses to domestic and intimate partner violence during COVID-19. JMIR public health and surveillance 6, 3 (2020), e19831.

[11]

Md Imdadul Haque Emon, Khondoker Nazia Iqbal, Md Humaion Kabir Mehedi, Mohammed Julfikar Ali Mahbub, and Annajiat Alim Rasel. 2022. Detection of bangla hate comments and cyberbullying in social media using nlp and transformer models. In International Conference on Advances in Computing and Data Sciences. Springer, 86–96.

[12]

Alex Graves and Alex Graves. 2012. Long short-term memory. Supervised sequence labelling with recurrent neural networks (2012), 37–45.

[13]

Tommi Gröndahl, Luca Pajola, Mika Juuti, Mauro Conti, and N Asokan. 2018. All you need is" love" evading hate speech detection. In Proceedings of the 11th ACM workshop on artificial intelligence and security. 2–12.

Digital Library

[14]

Phu Gia Hoang, Canh Duc Luu, Khanh Quoc Tran, Kiet Van Nguyen, and Ngan Luu-Thuy Nguyen. 2023. ViHOS: Hate Speech Spans Detection for Vietnamese. In Proceedings of the 17th Conference of the European Chapter of the Association for Computational Linguistics. Association for Computational Linguistics, Dubrovnik, Croatia, 652–669. https://aclanthology.org/2023.eacl-main.47

[15]

Sepp Hochreiter and Jürgen Schmidhuber. 1997. Long short-term memory. Neural computation 9, 8 (1997), 1735–1780.

Digital Library

[16]

Aiqi Jiang, Xiaohan Yang, Yang Liu, and Arkaitz Zubiaga. 2022. SWSR: A Chinese dataset and lexicon for online sexism detection. Online Social Networks and Media 27 (2022), 100182.

[17]

Ritesh Kumar, Atul Kr. Ojha, Shervin Malmasi, and Marcos Zampieri. 2018. Benchmarking Aggression Identification in Social Media. In Proceedings of the First Workshop on Trolling, Aggression and Cyberbullying (TRAC-2018). Association for Computational Linguistics, Santa Fe, New Mexico, USA, 1–11. https://aclanthology.org/W18-4401

[18]

John D. Lafferty, Andrew McCallum, and Fernando C. N. Pereira. 2001. Conditional Random Fields: Probabilistic Models for Segmenting and Labeling Sequence Data. In Proceedings of the Eighteenth International Conference on Machine Learning(ICML ’01). Morgan Kaufmann Publishers Inc., San Francisco, CA, USA, 282–289. http://dl.acm.org/citation.cfm?id=645530.655813

[19]

Guillaume Lample, Miguel Ballesteros, Sandeep Subramanian, Kazuya Kawakami, and Chris Dyer. 2016. Neural Architectures for Named Entity Recognition. In Proceedings of the 2016 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies. Association for Computational Linguistics, San Diego, California, 260–270. https://doi.org/10.18653/v1/N16-1030

[20]

Yinhan Liu, Myle Ott, Naman Goyal, Jingfei Du, Mandar Joshi, Danqi Chen, Omer Levy, Mike Lewis, Luke Zettlemoyer, and Veselin Stoyanov. 2019. RoBERTa: A Robustly Optimized BERT Pretraining Approach. arxiv:1907.11692 [cs.CL]

[21]

Son T Luu, Kiet Van Nguyen, and Ngan Luu-Thuy Nguyen. 2021. A large-scale dataset for hate speech detection on Vietnamese social media texts. In Advances and Trends in Artificial Intelligence. Artificial Intelligence Practices: 34th International Conference on Industrial, Engineering and Other Applications of Applied Intelligent Systems, IEA/AIE 2021, Kuala Lumpur, Malaysia, July 26–29, 2021, Proceedings, Part I 34. Springer, 415–426.

Digital Library

[22]

Binny Mathew, Punyajoy Saha, Seid Muhie Yimam, Chris Biemann, Pawan Goyal, and Animesh Mukherjee. 2021. Hatexplain: A benchmark dataset for explainable hate speech detection. In Proceedings of the AAAI conference on artificial intelligence, Vol. 35. 14867–14875.

[23]

Dat Quoc Nguyen and Anh Tuan Nguyen. 2020. PhoBERT: Pre-trained language models for Vietnamese. In Findings of the Association for Computational Linguistics: EMNLP 2020. Association for Computational Linguistics, Online, 1037–1042. https://doi.org/10.18653/v1/2020.findings-emnlp.92

[24]

Luan Thanh Nguyen, Kiet Van Nguyen, and Ngan Luu-Thuy Nguyen. 2021. Constructive and toxic speech detection for open-domain social media comments in vietnamese. In Advances and Trends in Artificial Intelligence. Artificial Intelligence Practices: 34th International Conference on Industrial, Engineering and Other Applications of Applied Intelligent Systems, IEA/AIE 2021, Kuala Lumpur, Malaysia, July 26–29, 2021, Proceedings, Part I 34. Springer, 572–583.

Digital Library

[25]

Endang Wahyu Pamungkas, Valerio Basile, and Viviana Patti. 2020. Misogyny detection in twitter: a multilingual and cross-domain study. Information processing & management 57, 6 (2020), 102360.

[26]

John Pavlopoulos, Jeffrey Sorensen, Léo Laugier, and Ion Androutsopoulos. 2021. SemEval-2021 Task 5: Toxic Spans Detection. In Proceedings of the 15th International Workshop on Semantic Evaluation (SemEval-2021). Association for Computational Linguistics, Online, 59–69. https://doi.org/10.18653/v1/2021.semeval-1.6

[27]

John Pavlopoulos, Jeffrey Sorensen, Léo Laugier, and Ion Androutsopoulos. 2021. SemEval-2021 task 5: Toxic spans detection. In Proceedings of the 15th international workshop on semantic evaluation (SemEval-2021). 59–69.

[28]

Milen L Radell, Eid G Abo Hamza, Wid H Daghustani, Asma Perveen, Ahmed A Moustafa, 2021. The impact of different types of abuse on depression. Depression research and treatment 2021 (2021).

[29]

Tharindu Ranasinghe and Hansi Hettiarachchi. 2019. Emoji powered capsule network to detect type and target of offensive posts in social media. (2019).

[30]

Melanie F Shepard and James A Campbell. 1992. The Abusive Behavior Inventory: A measure of psychological and physical abuse. Journal of interpersonal violence 7, 3 (1992), 291–305.

[31]

Charles Sutton, Andrew McCallum, 2012. An introduction to conditional random fields. Foundations and Trends® in Machine Learning 4, 4 (2012), 267–373.

[32]

Reach Team. 2017. 6 Different Types of Abuse. https://reachma.org/blog/6-different-types-of-abuse/

[33]

Kim Nguyen Thi Thanh, Sieu Huynh Khai, Phuc Pham Huynh, Luong Phan Luc, Duc-Vu Nguyen, and Kiet Nguyen Van. 2021. Span Detection for Aspect-Based Sentiment Analysis in Vietnamese. In Proceedings of the 35th Pacific Asia Conference on Language, Information and Computation. Association for Computational Lingustics, Shanghai, China, 318–328. https://aclanthology.org/2021.paclic-1.34

[34]

Amanda Stent Tina Tseng and Domenic Maida. 2020. Best Practices for Managing Data Annotation Projects. Bloomberg Finance L.P.

[35]

VnExpress. 2023. Narrative section. https://vnexpress.net/tam-su

[36]

Xuan-Son Vu, Thanh Vu, Mai-Vu Tran, Thanh Le-Cong, and Huyen Nguyen. 2020. HSD shared task in VLSP campaign 2019: Hate speech detection for social good. arXiv preprint arXiv:2007.06493 (2020).

[37]

Zeerak Waseem and Dirk Hovy. 2016. Hateful symbols or hateful people? predictive features for hate speech detection on twitter. In Proceedings of the NAACL student research workshop. 88–93.

[38]

Yongjie Yon, Maria Ramiro-Gonzalez, Christopher R Mikton, Manfred Huber, and Dinesh Sethi. 2019. The prevalence of elder abuse in institutional settings: a systematic review and meta-analysis. European journal of public health 29, 1 (2019), 58–67.

[39]

Marcos Zampieri, Shervin Malmasi, Preslav Nakov, Sara Rosenthal, Noura Farra, and Ritesh Kumar. 2019. Semeval-2019 task 6: Identifying and categorizing offensive language in social media (offenseval). arXiv preprint arXiv:1903.08983 (2019).

[40]

Marcos Zampieri, Preslav Nakov, Sara Rosenthal, Pepa Atanasova, Georgi Karadzhov, Hamdy Mubarak, Leon Derczynski, Zeses Pitenis, and Çağrı Çöltekin. 2020. SemEval-2020 task 12: Multilingual offensive language identification in social media (OffensEval 2020). arXiv preprint arXiv:2006.07235 (2020).

[41]

Ziqi Zhang, David Robinson, and Jonathan Tepper. 2018. Detecting hate speech on twitter using a convolution-gru based deep neural network. In The Semantic Web: 15th International Conference, ESWC 2018, Heraklion, Crete, Greece, June 3–7, 2018, Proceedings 15. Springer, 745–760.

Digital Library

[42]

Qinglin Zhu, Zijie Lin, Yice Zhang, Jingyi Sun, Xiang Li, Qihui Lin, Yixue Dang, and Ruifeng Xu. 2021. HITSZ-HLT at SemEval-2021 task 5: Ensemble sequence labeling and span boundary detection for toxic span detection. In Proceedings of the 15th international workshop on semantic evaluation (SemEval-2021). 521–526.

Index Terms

Abusive Span Detection for Vietnamese Narrative Texts
1. Computing methodologies
  1. Artificial intelligence
    1. Natural language processing

Recommendations

Improving sequence labeling with labeled clue sentences
Abstract
Pre-trained language models (PLMs) have achieved noticeable success on a variety of natural language processing tasks, such as sequence labeling. In particular, the existing sequence labeling methods fine-tune PLMs on large-scale ...
Highlights
- A general framework uses labeled clues to mitigate labeled data shortages.
- Two ...
A Statistical Language Model for Pre-Trained Sequence Labeling: A Case Study on Vietnamese
By defining the computable word segmentation unit and studying its probability characteristics, we establish an unsupervised statistical language model (SLM) for a new pre-trained sequence labeling framework in this article. The proposed SLM is an ...
An effective joint model for chinese word segmentation and POS tagging
ICIIP '16: Proceedings of the 1st International Conference on Intelligent Information Processing

Chinese word segmentation and Part-of-speech (POS) tagging have been studied for decades. However, most of the previous works mainly focus on pipeline method which will lead to error propagation. In order to make word segmentation and POS tagging ...

Comments

Information & Contributors

Information

Published In

cover image ACM Other conferences

SOICT '23: Proceedings of the 12th International Symposium on Information and Communication Technology

December 2023

1058 pages

ISBN:9798400708916

DOI:10.1145/3628797

Copyright © 2023 ACM.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 07 December 2023

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Author Tags

Qualifiers

Research-article
Research
Refereed limited

Conference

SOICT 2023

SOICT 2023: The 12th International Symposium on Information and Communication Technology

December 7 - 8, 2023

Ho Chi Minh, Vietnam

Acceptance Rates

Overall Acceptance Rate 147 of 318 submissions, 46%

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

0
Total Citations
35
Total Downloads

Downloads (Last 12 months)25
Downloads (Last 6 weeks)2

Reflects downloads up to 24 Jan 2025

Other Metrics

View Author Metrics

Citations

View Options

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

HTML Format

View this article in HTML Format.

Figures

Tables

Media

View Table of Conten