skip to main content
research-article

OLID-BR: offensive language identification dataset for Brazilian Portuguese

Published: 03 May 2023 Publication History

Abstract

Social media has revolutionized the manner in which our society is interconnected. While this extensive connectivity offers numerous benefits, it is also accompanied by significant drawbacks, particularly in terms of the proliferation of fake news and the vast dissemination of hate speech. Identifying offensive comments is a critical task for ensuring the safety of users, which is why industry and academia have been working on developing solutions to this problem. Prior research on hate speech detection has predominantly focused on the English language, with few studies devoted to other languages such as Portuguese. This paper introduces the Offensive Language Identification Dataset for Brazilian Portuguese (OLID-BR), a high-quality NLP dataset for offensive language detection, which we make publicly available. The dataset contains 6,354 (extendable to 13,538) comments labeled using a fine-grained three-layer annotation schema compatible with datasets in other languages, which allows the training of multilingual/cross-lingual models. The five NLP tasks available in OLID-BR allow the detection of offensive comments, the classification of the types of offenses such as racism, LGBTQphobia, sexism, xenophobia, and so on, the identification of the type and the target of offensive comments, and the extraction of toxic spans of offensive comments. All those tasks can enhance the capabilities of content moderation systems by providing deep contextual analysis or highlighting the spans that make a text toxic. We further experiment with and evaluate the dataset using state-of-the-art BERT-based and NER models, which demonstrates the usefulness of OLID-BR for the development of toxicity detection systems for Portuguese texts.

References

[1]
Alonso, P., Saini, R., & Kovács, G. (2020). Hate speech detection using transformer ensembles on the hasoc dataset. In International Conference on Speech and Computer (pp. 13–21). Springer
[2]
Basile, V., Bosco, C., & Fersini, E., et al. (2019). Semeval-2019 task 5: Multilingual detection of hate speech against immigrants and women in twitter. In Proceedings of the 13th international workshop on semantic evaluation (pp. 54–63).
[3]
Çöltekin, Ç. (2020). A corpus of turkish offensive language on social media. In Proceedings of the 12th language resources and evaluation conference (pp. 6174–6184).
[4]
de Pelle, R. P., & Moreira, V. P. (2017). Offensive comments in the brazilian web: a dataset and baseline results. In Anais do VI Brazilian Workshop on Social Network Analysis and Mining. SBC.
[5]
Eugenio BD and Glass M The kappa statistic: A second look Computational Linguistics 2004 30 1 95-101
[6]
Feinstein AR and Cicchetti DV High agreement but low kappa: I. the problems of two paradoxes Journal of Clinical Epidemiology 1990 43 6 543-549
[7]
Fortuna P and Nunes S A survey on automatic detection of hate speech in text ACM Computing Surveys (CSUR) 2018 51 4 1-30
[8]
Fortuna, P., da Silva, J. R., & Wanner, L., et al. (2019). A hierarchically-labeled portuguese hate speech dataset. In Proceedings of the Third Workshop on Abusive Language Online (pp. 94–104).
[9]
Gwet KL Handbook of inter-rater reliability: The definitive guide to measuring the extent of agreement among raters 2014 Gaithersburg, MD, USA Advanced Analytics, LLC
[10]
Han J, Kamber M, and Pei J Data mining: Concepts and techniques 2011 3 San Francisco, CA, USA Morgan Kaufmann Publishers Inc.
[11]
Leite, J. A., Silva, D., & Bontcheva, K., et al. (2020). Toxic language detection in social media for Brazilian Portuguese: New dataset and multilingual analysis. In Proceedings of the 1st Conference of the Asia-Pacific Chapter of the Association for Computational Linguistics and the 10th International Joint Conference on Natural Language Processing. Association for Computational Linguistics, Suzhou, China, pp. 914–924, https://aclanthology.org/2020.aacl-main.91
[12]
Levy, L., Karst, K., & Winkler, A. (2000). Encyclopedia of the American Constitution. No. v. 6 in Encyclopedia of the American Constitution, Macmillan Reference USA, USA.
[13]
Nascimento, G., Carvalho, F., & Cunha, A. M. d., et al. (2019). Hate speech detection using brazilian imageboards. In Proceedings of the 25th Brazillian Symposium on Multimedia and the Web (pp. 325–328).
[14]
Pavlopoulos, J., Sorensen, J., & Laugier, L., et al. (2021). SemEval-2021 task 5: Toxic spans detection. In Proceedings of the 15th International Workshop on Semantic Evaluation (SemEval-2021). Association for Computational Linguistics, Online, pp. 59–69, https://aclanthology.org/2021.semeval-1.6
[15]
Pitenis, Z., Zampieri, M., & Ranasinghe, T. (2020). Offensive language identification in Greek. In Proceedings of the Twelfth Language Resources and Evaluation Conference. European Language Resources Association (pp. 5113–5119). Marseille, France. https://aclanthology.org/2020.lrec-1.629
[16]
Poletto F, Basile V, Sanguinetti M, et al. Resources and benchmark corpora for hate speech detection: a systematic review Language Resources and Evaluation 2021 55 2 477-523
[17]
Raghunathan B The complete book of data anonymization: From planning to implementation 2013 Auerbach Publications
[18]
Rosenthal S, Atanasova P, Karadzhov G, et al. Solid: A large-scale semi-supervised dataset for offensive language identification Findings of the Association for Computational Linguistics: ACL-IJCNLP 2021 2021 915-928
[19]
Shelar H, Kaur G, Heda N, and Mai Named entity recognition approaches and their comparison for custom ner model Science & Technology Libraries 2020 39 324-337
[20]
Siddiqui S, Singh T, et al. Social media its impact with positive and negative aspects International Journal of Computer Applications Technology and Research 2016 5 2 71-75
[21]
Sigurbergsson, G. I., & Derczynski, L. (2020). Offensive language and hate speech detection for danish. In Proceedings of the 12th Language Resources and Evaluation Conference (pp. 3498–3508).
[22]
Souza, F., Nogueira, R., & Lotufo, R. (2020). Bertimbau: pretrained bert models for brazilian portuguese. In Brazilian Conference on Intelligent Systems (pp. 403–417). Springer.
[23]
Vaswani, A., Shazeer, N., & Parmar, N., et al. (2017). Attention is all you need. In Advances in neural information processing systems (pp. 5998–6008). Curran Associates, Inc.
[24]
Zampieri, M., Malmasi, S., & Nakov, P., et al. (2019a). Predicting the type and target of offensive posts in social media. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies Volume 1 (Long and Short Papers). Association for Computational Linguistics, Minneapolis, Minnesota, pp. 1415–1420, https://aclanthology.org/N19-1144
[25]
Zampieri, M., Malmasi, S., & Nakov, P., et al. (2019b). Semeval-2019 task 6: Identifying and categorizing offensive language in social media (offenseval). In Proceedings of the 13th International Workshop on Semantic Evaluation (pp. 75–86).
[26]
Zampieri, M., Nakov, P., & Rosenthal, S., et al. (2020). Semeval-2020 task 12: Multilingual offensive language identification in social media (offenseval 2020). In Proceedings of the Fourteenth Workshop on Semantic Evaluation (pp. 1425–1447).

Recommendations

Comments

Information & Contributors

Information

Published In

cover image Language Resources and Evaluation
Language Resources and Evaluation  Volume 58, Issue 4
Dec 2024
422 pages

Publisher

Springer-Verlag

Berlin, Heidelberg

Publication History

Published: 03 May 2023
Accepted: 27 March 2023

Author Tags

  1. Hate speech
  2. Offensive comments
  3. OLID-BR
  4. Dataset
  5. NLP
  6. Natural language processing
  7. Offensive language detection
  8. Content moderation systems
  9. Toxicity detection systems
  10. Toxic spans detection
  11. NER
  12. BERT

Qualifiers

  • Research-article

Funding Sources

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • 0
    Total Citations
  • 0
    Total Downloads
  • Downloads (Last 12 months)0
  • Downloads (Last 6 weeks)0
Reflects downloads up to 07 Jan 2025

Other Metrics

Citations

View Options

View options

Media

Figures

Other

Tables

Share

Share

Share this Publication link

Share on social media