skip to main content
research-article
Free access
Just Accepted

THAR- Targeted Hate Speech Against Religion: A high-quality Hindi-English code-mixed Dataset with the Application of Deep Learning Models for Automatic Detection

Online AM: 18 March 2024 Publication History

Abstract

During the last decade, social media has gained significant popularity as a medium for individuals to express their views on various topics. However, some individuals also exploit the social media platforms to spread hatred through their comments and posts, some of which target individuals, communities or religions. Given the deep emotional connections people have to their religious beliefs, this form of hate speech can be divisive and harmful, and may result in issues of mental health as social disorder. Therefore, there is a need of algorithmic approaches for the automatic detection of instances of hate speech. Most of the existing studies in this area focus on social media content in English, and as a result several low-resource languages lack computational resources for the task. This study attempts to address this research gap by providing a high-quality annotated dataset designed specifically for identifying hate speech against religions in the Hindi-English code-mixed language. This dataset “Targeted Hate Speech Against Religion” (THAR)) consists of 11,549 comments and has been annotated by five independent annotators. It comprises two subtasks: (i) Subtask-1 (Binary classification), (ii) Subtask-2 (multi-class classification). To ensure the quality of annotation, the Fleiss Kappa measure has been employed. The suitability of the dataset is then further explored by applying different standard deep learning, and transformer-based models. The transformer-based model, namely Multilingual Representations for Indian Languages (MuRIL), is found to outperform the other implemented models in both subtasks, achieving macro average and weighted average F1 scores of 0.78 and 0.78 for Subtask-1, and 0.65 and 0.72 for Subtask-2, respectively. The experimental results obtained not only confirm the suitability of the dataset but also advance the research towards automatic detection of hate speech, particularly in the low-resource Hindi-English code-mixed language.

References

[1]
Schultz, P. W., Nolan, J. M., Cialdini, R. B., Goldstein, N. J., & Griskevicius, V. (2018). The constructive, destructive, and reconstructive power of social norms: Reprise. Perspectives on psychological science, 13(2), 249-254.
[2]
Akram, W., & Kumar, R. (2017). A study on positive and negative effects of social media on society. International journal of computer sciences and engineering, 5(10), 351-354.
[3]
Singh, A., Kanaujia, A., Singh, V. K., & Vinuesa, R. (2023). Artificial intelligence for Sustainable Development Goals: Bibliometric patterns and concept evolution trajectories. Sustainable Development
[4]
Singh, A., Kanaujia, A., & Singh, V. K. (2022). Research on Sustainable Development Goals: How has Indian Scientific Community Responded?. Journal of Scientific & Industrial Research, 81(11), 1147-1161.
[5]
Del Vigna12, F., Cimino23, A., Dell'Orletta, F., Petrocchi, M., & Tesconi, M. (2017, January). Hate me, hate me not: Hate speech detection on facebook. In Proceedings of the first Italian conference on cybersecurity (ITASEC17) (pp. 86-95).
[6]
Djuric, N., Zhou, J., Morris, R., Grbovic, M., Radosavljevic, V., & Bhamidipati, N. (2015, May). Hate speech detection with comment embeddings. In Proceedings of the 24th international conference on world wide web (pp. 29-30).
[7]
Kumari, K., & Singh, J. P. (2022). Machine Learning Approach for Hate Speech and Offensive Content Identification in English and Indo-Aryan Code-Mixed Languages. In Forum for Information Retrieval Evaluation (Working Notes)(FIRE). CEUR-WS. org.
[8]
Waseem, Z., & Hovy, D. (2016, June). Hateful symbols or hateful people? predictive features for hate speech detection on twitter. In Proceedings of the NAACL student research workshop (pp. 88-93).
[9]
Nobata, C., Tetreault, J., Thomas, A., Mehdad, Y., & Chang, Y. (2016, April). Abusive language detection in online user content. In Proceedings of the 25th international conference on world wide web (pp. 145-153).
[10]
Sharma, D., Gupta, V., & Singh, V. K. (2024). Abusive comment detection in Tamil using deep learning. In Computational Intelligence Methods for Sentiment Analysis in Natural Language Processing Applications (pp. 207-226). Morgan Kaufmann.
[11]
Sazzed, S. (2021, June). Abusive content detection in transliterated Bengali-English social media corpus. In Proceedings of the Fifth Workshop on Computational Approaches to Linguistic Code-Switching (pp. 125-130).
[12]
Rehman, M. Z. U., Mehta, S., Singh, K., Kaushik, K., & Kumar, N. (2023). User-aware multilingual abusive content detection in social media. Information Processing & Management, 60(5), 103450.
[13]
D Paul, S., Saha, S., & Singh, J. P. (2023). COVID-19 and cyberbullying: deep ensemble model to identify cyberbullying from code-switched languages during the pandemic. Multimedia tools and applications, 82(6), 8773-8789.
[14]
Nahar, V., Li, X., & Pang, C. (2013). An effective approach for cyberbullying detection. Communications in information science and management engineering, 3(5), 238.
[15]
Iwendi, C., Srivastava, G., Khan, S., & Maddikunta, P. K. R. (2020). Cyberbullying detection solutions based on deep learning architectures. Multimedia Systems, 1-14.
[16]
Bagora, A., Shrestha, K., Maurya, K., & Desarkar, M. S. (2022, June). Hostility Detection in Online Hindi-English Code-Mixed Conversations. In Proceedings of the 14th ACM Web Science Conference 2022 (pp. 390-400).
[17]
Kamal, O., Kumar, A., & Vaidhya, T. (2021). Hostility detection in Hindi leveraging pre-trained language models. In Combating Online Hostile Posts in Regional Languages during Emergency Situation: First International Workshop, CONSTRAINT 2021, Collocated with AAAI 2021, Virtual Event, February 8, 2021, Revised Selected Papers 1 (pp. 213-223). Springer International Publishing.
[18]
Raha, T., Ghosh Roy, S., Narayan, U., Abid, Z., & Varma, V. (2021). Task adaptive pretraining of transformers for hostility detection. In Combating Online Hostile Posts in Regional Languages during Emergency Situation: First International Workshop, CONSTRAINT 2021, Collocated with AAAI 2021, Virtual Event, February 8, 2021, Revised Selected Papers 1 (pp. 236-243). Springer International Publishing.
[19]
ElSherief, M., Kulkarni, V., Nguyen, D., Wang, W. Y., & Belding, E. (2018, June). Hate lingo: A target-based linguistic analysis of hate speech in social media. In Proceedings of the international AAAI conference on web and social media (Vol. 12, No. 1).
[20]
Sharma, D., Singh, V. K., & Gupta, V. (2023). TABHATE: A Target-based Hate Speech Detection Dataset in Hindi. Research Square, 1-12.
[21]
Chiril, P., Pamungkas, E. W., Benamara, F., Moriceau, V., & Patti, V. (2022). Emotionally informed hate speech detection: a multi-target perspective. Cognitive Computation, 1-31.
[22]
Sharma, D., Gupta, V., & Singh, V. K. (2022, December). Detection of homophobia & transphobia in Malayalam and Tamil: Exploring deep learning methods. In International Conference on Advanced Network Technologies and Intelligent Computing (pp. 217-226). Cham: Springer Nature Switzerland.
[23]
Nozza, D. (2022). Nozza@ LT-EDI-ACL2022: Ensemble modeling for homophobia and transphobia detection. In Proceedings of the Second Workshop on Language Technology for Equality, Diversity and Inclusion. Association for Computational Linguistics.
[24]
Chakravarthi, B. R., Hande, A., Ponnusamy, R., Kumaresan, P. K., & Priyadharshini, R. (2022). How can we detect Homophobia and Transphobia? Experiments in a multilingual code-mixed setting for social media governance. International Journal of Information Management Data Insights, 2(2), 100119.
[25]
Awan, I., & Zempi, I. (2016). The affinity between online and offline anti-Muslim hate crime: Dynamics and impacts. Aggression and violent behavior, 27, 1-8.
[26]
Chen, Y., Zhou, Y., Zhu, S., & Xu, H. (2012, September). Detecting offensive language in social media to protect adolescent online safety. In 2012 International Conference on Privacy, Security, Risk and Trust and 2012 International Confernece on Social Computing (pp. 71-80). IEEE.
[27]
Saumya, S., Kumar, A., & Singh, J. P. (2021, April). Offensive language identification in Dravidian code mixed social media text. In Proceedings of the first workshop on speech and language technologies for Dravidian languages (pp. 36-45).
[28]
Salaam, C., Dernoncourt, F., Bui, T., Rawat, D., & Yoon, S. (2022, October). Offensive Content Detection Via Synthetic Code-Switched Text. In Proceedings of the 29th International Conference on Computational Linguistics (pp. 6617-6624).
[29]
Malik, P., Aggrawal, A., & Vishwakarma, D. K. (2021, April). Toxic speech detection using traditional machine learning models and BERT and fasttext embedding with deep neural networks. In 2021 5th International Conference on Computing Methodologies and Communication (ICCMC) (pp. 1254-1259). IEEE.
[30]
Nguyen, L. T., Van Nguyen, K., & Nguyen, N. L. T. (2021). Constructive and toxic speech detection for open-domain social media comments in vietnamese. In Advances and Trends in Artificial Intelligence. Artificial Intelligence Practices: 34th International Conference on Industrial, Engineering and Other Applications of Applied Intelligent Systems, IEA/AIE 2021, Kuala Lumpur, Malaysia, July 26–29, 2021, Proceedings, Part I 34 (pp. 572-583). Springer International Publishing.
[31]
d'Sa, A. G., Illina, I., & Fohr, D. (2020, February). BERT and fasttext embeddings for automatic detection of toxic speech. In 2020 International Multi-Conference on:“Organization of Knowledge and Advanced Technologies”(OCTA) (pp. 1-5). IEEE.
[32]
Pamungkas, E. W., Basile, V., & Patti, V. (2020). Misogyny detection in twitter: a multilingual and cross-domain study. Information processing & management, 57(6), 102360.
[33]
Nozza, D., Volpetti, C., & Fersini, E. (2019, October). Unintended bias in misogyny detection. In Ieee/wic/acm international conference on web intelligence (pp. 149-155).
[34]
García-Díaz, J. A., Cánovas-García, M., Colomo-Palacios, R., & Valencia-García, R. (2021). Detecting misogyny in Spanish tweets. An approach based on linguistics features and word embeddings. Future Generation Computer Systems, 114, 506-518.
[35]
Vidgen, B., & Yasseri, T. (2020). Detecting weak and strong Islamophobic hate speech on social media. Journal of Information Technology & Politics, 17(1), 66-78.
[36]
Kurniawan, F., Badruddin, B., & Wibawa, P. A. (2022). Identification of islamophobia sentiment analysis on Twitter using text mining language detection. Journal of Positive School Psychology, 6(5), 8286-8294.
[37]
Mehmood, Q., Kaleem, A., & Siddiqi, I. (2021, December). Islamophobic hate speech detection from electronic media using deep learning. In Mediterranean conference on pattern recognition and artificial intelligence (pp. 187-200). Cham: Springer International Publishing.
[38]
Albadi, N., Kurdi, M., & Mishra, S. (2018, August). Are they our brothers? analysis and detection of religious hate speech in the arabic twittersphere. In 2018 IEEE/ACM International Conference on Advances in Social Networks Analysis and Mining (ASONAM) (pp. 69-76). IEEE
[39]
Jaleel, A., Anwar, M., Ali, F., Mukhtar, R., & Farooq, M. (2023). Islamophobia Content Detection Using Natural Language Processing. Journal of Computing & Biomedical Informatics, 4(02), 88-97
[40]
Khan, H., & Phillips, J. L. (2021, April). Language agnostic model: Detecting Islamophobic content on social media. In Proceedings of the 2021 ACM Southeast conference (pp. 229-233).
[41]
Aldreabi, E., Lee, J. M., & Blackburn, J. (2023, May). Using Deep Learning to Detect Islamophobia on Reddit. In The International FLAIRS Conference Proceedings (Vol. 36).
[42]
Ishmam, A. M., Arman, J., & Sharmin, S. (2019, May). Towards the development of the bengali language corpus from public facebook pages for hate speech research. In Proceedings of Asian CHI Symposium2019: Emerging HCI Research Collection (pp. 141-146).
[43]
Alfina, I., Mulia, R., Fanany, M. I., & Ekanata, Y. (2017, October). Hate speech detection in the Indonesian language: A dataset and preliminary study. In 2017 international conference on advanced computer science and information systems (ICACSIS) (pp. 233-238). IEEE.
[44]
Fleiss, J. L. 1971. Measuring nominal scale agreement among many raters. Psychological bulletin, 76(5), 378.
[45]
O'Shea, K., & Nash, R. (2015). An introduction to convolutional neural networks. arXiv preprint arXiv:1511.08458.
[46]
Huang, Z., Xu, W., & Yu, K. (2015). Bidirectional LSTM-CRF models for sequence tagging. arXiv preprint arXiv:1508.01991.
[47]
Devlin, J., Chang, M.W., Lee, K. & Toutanova, K. 2018. BERT: Pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805.
[48]
Kakwani, D., Kunchukuttan, A., Golla, S., Gokul, N. C., Bhattacharyya, A., Khapra, M. M., & Kumar, P. 2020, November. IndicNLPSuite: Monolingual corpora, evaluation benchmarks and pre-trained multilingual language models for Indian languages. In Findings of the Association for Computational Linguistics: EMNLP 2020, pp. 4948-4961.
[49]
Khanuja, S., Bansal, D., Mehtani, S., Khosla, S., Dey, A., Gopalan, B., ... & Talukdar, P. 2021. Muril: Multilingual representations for indian languages. arXiv preprint arXiv:2103.10730.
[50]
Singh, A., & Singh, V. K. (2022, December). Exploring Deep Learning Methods for Classification of Synthetic Aperture Radar Images: Towards NextGen Convolutions via Transformers. In International Conference on Advanced Network Technologies and Intelligent Computing (pp. 249-260). Cham: Springer Nature Switzerland.

Cited By

View all

Index Terms

  1. THAR- Targeted Hate Speech Against Religion: A high-quality Hindi-English code-mixed Dataset with the Application of Deep Learning Models for Automatic Detection
          Index terms have been assigned to the content through auto-classification.

          Recommendations

          Comments

          Information & Contributors

          Information

          Published In

          cover image ACM Transactions on Asian and Low-Resource Language Information Processing
          ACM Transactions on Asian and Low-Resource Language Information Processing Just Accepted
          EISSN:2375-4702
          Table of Contents
          Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

          Publisher

          Association for Computing Machinery

          New York, NY, United States

          Publication History

          Online AM: 18 March 2024

          Check for updates

          Author Tags

          1. Hate Speech
          2. Targeted Hate Speech
          3. Religious Hate Speech
          4. Hate Speech Dataset
          5. Deep Learning

          Qualifiers

          • Research-article

          Contributors

          Other Metrics

          Bibliometrics & Citations

          Bibliometrics

          Article Metrics

          • Downloads (Last 12 months)649
          • Downloads (Last 6 weeks)87
          Reflects downloads up to 16 Jan 2025

          Other Metrics

          Citations

          Cited By

          View all

          View Options

          View options

          PDF

          View or Download as a PDF file.

          PDF

          eReader

          View online with eReader.

          eReader

          Login options

          Full Access

          Media

          Figures

          Other

          Tables

          Share

          Share

          Share this Publication link

          Share on social media