skip to main content
survey

Handling Bias in Toxic Speech Detection: A Survey

Published: 13 July 2023 Publication History

Abstract

Detecting online toxicity has always been a challenge due to its inherent subjectivity. Factors such as the context, geography, socio-political climate, and background of the producers and consumers of the posts play a crucial role in determining if the content can be flagged as toxic. Adoption of automated toxicity detection models in production can thus lead to a sidelining of the various groups they aim to help in the first place. It has piqued researchers’ interest in examining unintended biases and their mitigation. Due to the nascent and multi-faceted nature of the work, complete literature is chaotic in its terminologies, techniques, and findings. In this article, we put together a systematic study of the limitations and challenges of existing methods for mitigating bias in toxicity detection.
We look closely at proposed methods for evaluating and mitigating bias in toxic speech detection. To examine the limitations of existing methods, we also conduct a case study to introduce the concept of bias shift due to knowledge-based bias mitigation. The survey concludes with an overview of the critical challenges, research gaps, and future directions. While reducing toxicity on online platforms continues to be an active area of research, a systematic study of various biases and their mitigation strategies will help the research community produce robust and fair models.

References

[1]
Osman Aka, Ken Burke, Alex Bauerle, Christina Greer, and Margaret Mitchell. 2021. Measuring model biases in the absence of ground truth. In Proceedings of the AAAI/ACM Conference on AI, Ethics, and Society (AIES’21). Association for Computing Machinery, New York, NY, 327–335. DOI:
[2]
Hala Al Kuwatly, Maximilian Wich, and Georg Groh. 2020. Identifying and measuring annotator bias based on annotators’ demographic characteristics. In Proceedings of the 4th Workshop on Online Abuse and Harms. 184–190.
[3]
Nuha Albadi, Maram Kurdi, and Shivakant Mishra. 2018. Are they our brothers? Analysis and detection of religious hate speech in the Arabic Twittersphere. In Proceedings of the IEEE/ACM International Conference on Advances in Social Networks Analysis and Mining (ASONAM’18). IEEE, 69–76.
[4]
Aymé Arango, Jorge Pérez, and Barbara Poblete. 2019. Hate speech detection is not as easy as you may think: A closer look at model validation. In Proceedings of the 42nd International ACM SIGIR Conference on Research and Development in Information Retrieval. 45–54.
[5]
Pinkesh Badjatiya, Manish Gupta, and Vasudeva Varma. 2019. Stereotypical bias removal for hate speech detection task using knowledge-based generalizations. In Proceedings of the World Wide Web Conference. 49–59.
[6]
Ricardo Baeza-Yates. 2020. Bias in search and recommender systems. In Proceedings of the 14th ACM Conference on Recommender Systems (RecSys’20). Association for Computing Machinery, New York, NY, 2. DOI:
[7]
Su Lin Blodgett, Solon Barocas, Hal Daumé III, and Hanna Wallach. 2020. Language (technology) is power: A critical survey of “Bias” in NLP. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics. 5454–5476.
[8]
Su Lin Blodgett, Lisa Green, and Brendan O’Connor. 2016. Demographic dialectal variation in social media: A case study of African-American English. In Proceedings of the Conference on Empirical Methods in Natural Language Processing. Association for Computational Linguistics, 1119–1130. DOI:
[9]
Vincent D. Blondel, Jean-Loup Guillaume, Renaud Lambiotte, and Etienne Lefebvre. 2008. Fast unfolding of communities in large networks. J. Statist. Mechan.: Theor. Experim. 2008, 10 (2008), P10008.
[10]
Tolga Bolukbasi, Kai-Wei Chang, James Zou, Venkatesh Saligrama, and Adam Kalai. 2016. Man is to computer programmer as woman is to homemaker? Debiasing word embeddings. In Proceedings of the 30th International Conference on Neural Information Processing Systems (NIPS’16). Curran Associates Inc., Red Hook, NY, 4356–4364.
[11]
Daniel Borkan, Lucas Dixon, John Li, Jeffrey Sorensen, Nithum Thain, and Lucy Vasserman. 2019. Limitations of pinned AUC for measuring unintended bias. arXiv preprint arXiv:1903.02088 (2019).
[12]
Daniel Borkan, Lucas Dixon, Jeffrey Sorensen, Nithum Thain, and Lucy Vasserman. 2019. Nuanced metrics for measuring unintended bias with real data for text classification. In Proceedings of the World Wide Web Conference. 491–500.
[13]
Tom B. Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared Kaplan, Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell et al. 2020. Language models are few-shot learners. arXiv preprint arXiv:2005.14165 (2020).
[14]
Joy Buolamwini and Timnit Gebru. 2018. Gender shades: Intersectional accuracy disparities in commercial gender classification. In Proceedings of the 1st Conference on Fairness, Accountability and Transparency(Proceedings of Machine Learning Research, Vol. 81), Sorelle A. Friedler and Christo Wilson (Eds.). PMLR, 77–91. Retrieved from: https://proceedings.mlr.press/v81/buolamwini18a.html.
[15]
Aylin Caliskan, Joanna J. Bryson, and Arvind Narayanan. 2017. Semantics derived automatically from language corpora contain human-like biases. Science 356, 6334 (2017), 183–186.
[16]
Aylin Caliskan, Joanna J. Bryson, and Arvind Narayanan. 2017. Semantics derived automatically from language corpora contain human-like biases. Science 356, 6334 (2017), 183–186. DOI:
[17]
Tanmoy Chakraborty and Sarah Masud. 2022. Nipping in the bud: Detection, diffusion and mitigation of hate speech on social media. arxiv:2201.00961 [cs.SI].
[18]
Christopher Clark, Mark Yatskar, and Luke Zettlemoyer. 2019. Don’t take the easy way out: Ensemble based methods for avoiding known dataset biases. In Proceedings of the Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP’19). Association for Computational Linguistics, 4069–4082. DOI:
[19]
Thomas Davidson and Debasmita Bhattacharya. 2020. Examining racial bias in an online abuse corpus with structural topic modeling. arXiv preprint arXiv:2005.13041 (2020).
[20]
Thomas Davidson, Debasmita Bhattacharya, and Ingmar Weber. 2019. Racial bias in hate speech and abusive language detection datasets. In Proceedings of the 3rd Workshop on Abusive Language Online. Association for Computational Linguistics, 25–35. DOI:
[21]
Thomas Davidson, Dana Warmsley, Michael Macy, and Ingmar Weber. 2017. Automated hate speech detection and the problem of offensive language. In Proceedings of the International AAAI Conference on Web and Social Media.
[22]
Ona de Gibert, Naiara Perez, Aitor García-Pablos, and Montse Cuadros. 2018. Hate speech dataset from a white supremacy forum. arXiv preprint arXiv:1809.04444 (2018).
[23]
Sunipa Dev, Masoud Monajatipoor, Anaelia Ovalle, Arjun Subramonian, Jeff Phillips, and Kai-Wei Chang. 2021. Harms of gender exclusivity and challenges in non-binary representation in language technologies. In Proceedings of the Conference on Empirical Methods in Natural Language Processing. Association for Computational Linguistics, 1968–1994. DOI:
[24]
Lucas Dixon, John Li, Jeffrey Sorensen, Nithum Thain, and Lucy Vasserman. 2018. Measuring and mitigating unintended bias in text classification. In Proceedings of the AAAI/ACM Conference on AI, Ethics, and Society. 67–73.
[25]
Jacob Eisenstein. 2017. Identifying Regional Dialects in On-line Social Media. John Wiley & Sons, Ltd, 368–383. DOI:
[26]
Jacob Eisenstein, Noah A. Smith, and Eric P. Xing. 2011. Discovering sociolinguistic associations with structured sparsity. In Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies. Association for Computational Linguistics, 1365–1374. Retrieved from: https://aclanthology.org/P11-1137.
[27]
Mai ElSherief, Caleb Ziems, David Muchlinski, Vaishnavi Anupindi, Jordyn Seybolt, Munmun De Choudhury, and Diyi Yang. 2021. Latent hatred: A benchmark for understanding implicit hate speech. In Proceedings of the Conference on Empirical Methods in Natural Language Processing. Association for Computational Linguistics, 345–363. DOI:
[28]
Christian Ezeibe. 2020. Hate speech and election violence in Nigeria. J. Asian Afric. Stud. 56, 4 (2020), 0021909620951208.
[29]
Elisabetta Fersini, Debora Nozza, and Paolo Rosso. 2018. Overview of the Evalita 2018 task on automatic misogyny identification (AMI). EVALITA Eval. NLP Speech Tools Ital. 12 (2018), 59.
[30]
Agneta Fischer, Eran Halperin, Daphna Canetti, and Alba Jasini. 2018. Why we hate. Emot. Rev. 10, 4 (2018), 309–320.
[31]
Paula Fortuna, Joao Rocha da Silva, Leo Wanner, Sérgio Nunes et al. 2019. A hierarchically-labeled Portuguese hate speech dataset. In Proceedings of the 3rd Workshop on Abusive Language Online. 94–104.
[32]
Paula Fortuna and Sérgio Nunes. 2018. A survey on automatic detection of hate speech in text. ACM Comput. Surv. 51, 4 (2018), 1–30.
[33]
Paula Fortuna, Juan Soler, and Leo Wanner. 2020. Toxic, hateful, offensive or abusive? What are we really classifying? An empirical analysis of hate speech datasets. In Proceedings of the 12th Language Resources and Evaluation Conference. 6786–6794.
[34]
Antigoni Founta, Constantinos Djouvas, Despoina Chatzakou, Ilias Leontiadis, Jeremy Blackburn, Gianluca Stringhini, Athena Vakali, Michael Sirivianos, and Nicolas Kourtellis. 2018. Large scale crowdsourcing and characterization of Twitter abusive behavior. In Proceedings of the International AAAI Conference on Web and Social Media.
[35]
Matt Gardner, Jonathan Berant, Hannaneh Hajishirzi, Alon Talmor, and Sewon Min. 2019. On making reading comprehension more comprehensive. In Proceedings of the 2nd Workshop on Machine Reading for Question Answering. 105–112.
[36]
Ismael Garrido-Muñoz, Arturo Montejo-Ráez, Fernando Martínez-Santiago, and L. Alfonso Ureña-López. 2021. A survey on bias in deep NLP. Appl. Sci. 11, 7 (Apr.2021), 3184. DOI:
[37]
Sayan Ghosh, Dylan Baker, David Jurgens, and Vinodkumar Prabhakaran. 2021. Detecting cross-geographic biases in toxicity modeling on social media. In Proceedings of the 7th Workshop on Noisy User-generated Text (W-NUT’21). Association for Computational Linguistics, 313–328. DOI:
[38]
Jennifer Golbeck, Zahra Ashktorab, Rashad O. Banjo, Alexandra Berlinger, Siddharth Bhagwan, Cody Buntain, Paul Cheakalos, Alicia A. Geller, Quint Gergory, Rajesh Kumar Gnanasekaran, Raja Rajan Gunasekaran, Kelly M. Hoffman, Jenny Hottle, Vichita Jienjitlert, Shivika Khare, Ryan Lau, Marianna J. Martindale, Shalmali Naik, Heather L. Nixon, Piyush Ramachandran, Kristine M. Rogers, Lisa Rogers, Meghna Sardana Sarin, Gaurav Shahane, Jayanee Thanki, Priyanka Vengataraman, Zijian Wan, and Derek Michael Wu. 2017. A large labeled corpus for online harassment research. In Proceedings of the ACM Web Science Conference (WebSci’17). Association for Computing Machinery, New York, NY, 229–233. DOI:
[39]
Mitchell L. Gordon, Kaitlyn Zhou, Kayur Patel, Tatsunori Hashimoto, and Michael S. Bernstein. 2021. The disagreement deconvolution: Bringing machine learning performance metrics in line with reality. In Proceedings of the CHI Conference on Human Factors in Computing Systems (CHI’21). Association for Computing Machinery, New York, NY. DOI:
[40]
Leon Graumas, Roy David, and Tommaso Caselli. 2019. Twitter-based polarised embeddings for abusive language detection. In Proceedings of the 8th International Conference on Affective Computing and Intelligent Interaction Workshops and Demos (ACIIW’19). IEEE, 1–7.
[41]
Anthony G. Greenwald, Debbie E. McGhee, and Jordan L. K. Schwartz. 1998. Measuring individual differences in implicit cognition: The implicit association test. J. Personal. Soc. Psychol. 74, 6 (1998), 1464–1480. DOI:
[42]
Thomas Hartvigsen, Saadia Gabriel, Hamid Palangi, Maarten Sap, Dipankar Ray, and Ece Kamar. 2022. ToxiGen: A large-scale machine-generated dataset for adversarial and implicit hate speech detection. In Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics. Association for Computational Linguistics, 3309–3326. DOI:
[43]
Martie G. Haselton, Daniel Nettle, and Paul W. Andrews. 2015. The evolution of cognitive bias. In The Handbook of Evolutionary Psychology. John Wiley & Sons, Inc., 724–746. DOI:
[44]
Xiaolei Huang, Linzi Xing, Franck Dernoncourt, and Michael J. Paul. 2020. Multilingual Twitter corpus and baselines for evaluating demographic bias in hate speech recognition. arXiv preprint arXiv:2002.10361 (2020).
[45]
Ferenc Huszár, Sofia Ira Ktena, Conor O’Brien, Luca Belli, Andrew Schlaikjer, and Moritz Hardt. 2021. Algorithmic amplification of politics on Twitter. Proc. Nat. Acad. Sci. 119, 1 (Dec.2021), e2025334119. DOI:
[46]
Òscar Garibo i Orts. 2019. Multilingual detection of hate speech against immigrants and women in Twitter at SemEval-2019 task 5: Frequency analysis interpolation for hate in speech detection. In Proceedings of the 13th International Workshop on Semantic Evaluation. 460–463.
[47]
Xisen Jin, Zhongyu Wei, Junyi Du, Xiangyang Xue, and Xiang Ren. 2020. Towards hierarchical importance attribution: Explaining compositional semantics for neural sequence models. In Proceedings of the International Conference on Learning Representations. Retrieved from: https://openreview.net/forum?id=BkxRRkSKwr.
[48]
Brendan Kennedy, Mohammad Atari, Aida Mostafazadeh Davani, Leigh Yeh, Ali Omrani, Yehsong Kim, Kris Coombs, Shreya Havaldar, Gwenyth Portillo-Wightman, Elaine Gonzalez et al. 2018. TheGab Hate Corpus: A collection of 27k posts annotated for hate speech. PsyArXiv (2018). DOI:
[49]
Brendan Kennedy, Xisen Jin, Aida Mostafazadeh Davani, Morteza Dehghani, and Xiang Ren. 2020. Contextualizing hate speech classifiers with post-hoc explanation. arXiv preprint arXiv:2005.02439 (2020).
[50]
Douwe Kiela, Hamed Firooz, Aravind Mohan, Vedanuj Goswami, Amanpreet Singh, Pratik Ringshia, and Davide Testuggine. 2020. The hateful memes challenge: Detecting hate speech in multimodal memes. arXiv preprint arXiv:2005.04790 (2020).
[51]
Jae Yeon Kim, Carlos Ortiz, Sarah Nam, Sarah Santiago, and Vivek Datta. 2020. Intersectional Bias in Hate Speech and Abusive Language Datasets. arxiv:2005.05921 [cs.CL].
[52]
Senthil Kumar, Aravindan Chandrabose, and Bharathi Raja Chakravarthi. 2021. An overview of fairness in data—Illuminating the bias in data pipeline. In Proceedings of the 1st Workshop on Language Technology for Equality, Diversity and Inclusion. Association for Computational Linguistics, 34–45. Retrieved from https://aclanthology.org/2021.ltedi-1.5.
[53]
Sachin Kumar, Shuly Wintner, Noah A. Smith, and Yulia Tsvetkov. 2019. Topics to avoid: Demoting latent confounds in text classification. arXiv preprint arXiv:1909.00453 (2019).
[54]
Ronan Le Bras, Swabha Swayamdipta, Chandra Bhagavatula, Rowan Zellers, Matthew Peters, Ashish Sabharwal, and Yejin Choi. 2020. Adversarial filters of dataset biases. In Proceedings of the International Conference on Machine Learning. PMLR, 1078–1088.
[55]
Paul Pu Liang, Irene Mengze Li, Emily Zheng, Yao Chong Lim, Ruslan Salakhutdinov, and Louis-Philippe Morency. 2020. Towards debiasing sentence representations. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics. Association for Computational Linguistics, 5502–5515. DOI:
[56]
Haochen Liu, Wei Jin, Hamid Karimi, Zitao Liu, and Jiliang Tang. 2021. The authors matter: Understanding and mitigating implicit bias in deep text classification. In Proceedings of the Findings of the Association for Computational Linguistics: ACL-IJCNLP 2021. Association for Computational Linguistics, 74–85. DOI:
[57]
Florian Ludwig, Klara Dolos, Torsten Zesch, and Eleanor Hobley. 2022. Improving generalization of hate speech detection systems to novel target groups via domain adaptation. In Proceedings of the 6th Workshop on Online Abuse and Harms (WOAH’22). Association for Computational Linguistics, 29–39. DOI:
[58]
Raghvendra Mall, Mridul Nagpal, Joni Salminen, Hind Almerekhi, Soon-Gyo Jung, and Bernard J. Jansen. 2020. Four types of toxic people: Characterizing online users’ toxicity over time. In Proceedings of the 11th Nordic Conference on Human-Computer Interaction: Shaping Experiences, Shaping Society (NordiCHI’20). Association for Computing Machinery, New York, NY. DOI:
[59]
Binny Mathew, Ritam Dutt, Pawan Goyal, and Animesh Mukherjee. 2019. Spread of hate speech in online social media. In Proceedings of the 10th ACM Conference on Web Science. 173–182.
[60]
Chandler May, Alex Wang, Shikha Bordia, Samuel R. Bowman, and Rachel Rudinger. 2019. On measuring social biases in sentence encoders. In Proceedings of the Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies. Association for Computational Linguistics, 622–628. DOI:
[61]
Tomás Mikolov, Kai Chen, Greg Corrado, and Jeffrey Dean. 2013. Efficient estimation of word representations in vector space. In Proceedings of the 1st International Conference on Learning Representations. Retrieved from: http://arxiv.org/abs/1301.3781.
[62]
George A. Miller. 1995. WordNet: A lexical database for English. Commun. ACM 38, 11 (1995), 39–41.
[63]
Pushkar Mishra, Marco Del Tredici, Helen Yannakoudakis, and Ekaterina Shutova. 2018. Author profiling for abuse detection. In Proceedings of the 27th International Conference on Computational Linguistics. 1088–1098.
[64]
Marzieh Mozafari, Reza Farahbakhsh, and Noël Crespi. 2020. Hate speech detection and racial bias mitigation in social media based on BERT model. PloS One 15, 8 (2020), e0237861.
[65]
Luke Munn. 2020. Angry by design: Toxic communication and technical architectures. Human. Soc. Sci. Commun. 7, 1 (July2020). DOI:
[66]
Jose I. Navarro. 2013. The psychology of hatred. Open Criminol. J. 6, 1 (Apr.2013), 10–17. DOI:
[67]
Dat Quoc Nguyen, Thanh Vu, and Anh Tuan Nguyen. 2020. BERTweet: A pre-trained language model for English tweets. In Proceedings of the Conference on Empirical Methods in Natural Language Processing: System Demonstrations. Association for Computational Linguistics, 9–14. DOI:
[68]
Debora Nozza, Claudia Volpetti, and Elisabetta Fersini. 2019. Unintended bias in misogyny detection. In Proceedings of the IEEE/WIC/ACM International Conference on Web Intelligence. 149–155.
[69]
Nedjma Ousidhoum, Yangqiu Song, and Dit-Yan Yeung. 2020. Comparative evaluation of label agnostic selection bias in multilingual hate speech datasets. In Proceedings of the Conference on Empirical Methods in Natural Language Processing (EMNLP’20). 2532–2542.
[70]
Ji Ho Park, Jamin Shin, and Pascale Fung. 2018. Reducing gender bias in abusive language detection. In Proceedings of the Conference on Empirical Methods in Natural Language Processing. Association for Computational Linguistics, 2799–2804. DOI:
[71]
Jeffrey Pennington, Richard Socher, and Christopher D. Manning. 2014. GloVe: Global vectors for word representation. In Proceedings of the Conference on Empirical Methods in Natural Language Processing (EMNLP’14). 1532–1543.
[72]
Shraman Pramanick, Dimitar Dimitrov, Rituparna Mukherjee, Shivam Sharma, Md. Shad Akhtar, Preslav Nakov, and Tanmoy Chakraborty. 2021. Detecting harmful memes and their targets. In Proceedings of the Findings of the Association for Computational Linguistics: ACL-IJCNLP 2021. Association for Computational Linguistics, 2783–2796. DOI:
[73]
Shraman Pramanick, Shivam Sharma, Dimitar Dimitrov, Md. Shad Akhtar, Preslav Nakov, and Tanmoy Chakraborty. 2021. MOMENTA: A multimodal framework for detecting harmful memes and their targets. In Proceedings of the Findings of the Association for Computational Linguistics: EMNLP 2021. Association for Computational Linguistics, 4439–4455. Retrieved from: https://aclanthology.org/2021.findings-emnlp.379.
[74]
Daniel Preoţiuc-Pietro and Lyle Ungar. 2018. User-level race and ethnicity predictors from Twitter text. In Proceedings of the 27th International Conference on Computational Linguistics. 1534–1545.
[75]
Jing Qian, Mai ElSherief, Elizabeth Belding, and William Yang Wang. 2018. Leveraging intra-user and inter-user representation learning for automated hate speech detection. In Proceedings of the Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies. Association for Computational Linguistics, 118–123. DOI:
[76]
Jing Qian, Hong Wang, Mai ElSherief, and Xifeng Yan. 2021. Lifelong learning of hate speech classification on social media. In Proceedings of the Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies. Association for Computational Linguistics, 2304–2314. DOI:
[77]
Md Mustafizur Rahman, Dinesh Balakrishnan, Dhiraj Murthy, Mucahid Kutlu, and Matthew Lease. 2021. An information retrieval approach to building datasets for hate speech detection. In Proceedings of the 35th Conference on Neural Information Processing Systems Datasets and Benchmarks Track (Round 2). Retrieved from: https://openreview.net/forum?id=jI_BbL-qjJN.
[78]
Dante Razo and Sandra Kübler. 2020. Investigating sampling bias in abusive language detection. In Proceedings of the 4th Workshop on Online Abuse and Harms. 70–78.
[79]
Margaret E. Roberts, Brandon M. Stewart, Dustin Tingley, Christopher Lucas, Jetson Leder-Luis, Shana Kushner Gadarian, Bethany Albertson, and David G. Rand. 2014. Structural topic models for open-ended survey responses. Amer. J. Polit. Sci. 58, 4 (2014), 1064–1082.
[80]
Jonathan Rosa. 2019. Looking Like a Language, Sounding Like a Race. Oxford University Press.
[81]
Björn Ross, Michael Rist, Guillermo Carbonell, Benjamin Cabrera, Nils Kurowsky, and Michael Wojatzki. 2017. Measuring the reliability of hate speech annotations: The case of the European refugee crisis. arXiv preprint arXiv:1701.08118 (2017).
[82]
Paul Rottger, Bertie Vidgen, Dirk Hovy, and Janet Pierrehumbert. 2022. Two contrasting data annotation paradigms for subjective NLP tasks. In Proceedings of the Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies. Association for Computational Linguistics, 175–190. DOI:
[83]
Koustuv Saha, Eshwar Chandrasekharan, and Munmun De Choudhury. 2019. Prevalence and psychological effects of hateful speech in online college communities. In Proceedings of the 10th ACM Conference on Web Science (WebSci’19). Association for Computing Machinery, New York, NY, 255–264. DOI:
[84]
Joni Salminen, Sercan Sengün, Juan Corporan, Soon Gyo Jung, and Bernard J. Jansen. 2020. Topic-driven toxicity: Exploring the relationship between online toxicity and news topics. PLoS One 15, 2 (Feb.2020), e0228723. DOI:
[85]
Maarten Sap, Dallas Card, Saadia Gabriel, Yejin Choi, and Noah A. Smith. 2019. The risk of racial bias in hate speech detection. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics. 1668–1678.
[86]
Maarten Sap, Gregory Park, Johannes Eichstaedt, Margaret Kern, David Stillwell, Michal Kosinski, Lyle Ungar, and H. Andrew Schwartz. 2014. Developing age and gender predictive lexica over social media. In Proceedings of the Conference on Empirical Methods in Natural Language Processing (EMNLP’14). 1146–1151.
[87]
Sheikh Muhammad Sarwar and Vanessa Murdock. 2022. Unsupervised domain adaptation for hate speech detection using a data augmentation approach. Proc. Int. AAAI Conf. We Soc. Media 16, 1 (May2022), 852–862. Retrieved from: https://ojs.aaai.org/index.php/ICWSM/article/view/19340.
[88]
Ulrich Schimmack. 2021. Invalid claims about the validity of implicit association tests by prisoners of the implicit social-cognition paradigm. Perspect. Psycholog. Sci. 16, 2 (Mar.2021), 435–442. DOI:
[89]
Anna Schmidt and Michael Wiegand. 2017. A survey on hate speech detection using natural language processing. In Proceedings of the 5th International Workshop on Natural Language Processing for Social Media. 1–10.
[90]
Tal Schuster, Darsh Shah, Yun Jie Serene Yeo, Daniel Roberto Filizzola Ortiz, Enrico Santus, and Regina Barzilay. 2019. Towards debiasing fact verification models. In Proceedings of the Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP’19). Association for Computational Linguistics, 3419–3425. DOI:
[91]
Deven Santosh Shah, H. Andrew Schwartz, and Dirk Hovy. 2020. Predictive biases in natural language processing models: A conceptual framework and overview. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics. Association for Computational Linguistics, 5248–5264. DOI:
[92]
Arthur K. Spears. 2021. African-American language use: Ideology and so-called obscenity. African-American English, Routledge, 249–276.
[93]
Luc Steels. 2016. Human language is a culturally evolving system. Psychonom. Bull. Rev. 24, 1 (July2016), 190–193. DOI:
[94]
Julia Maria Struß, Melanie Siegel, Josef Ruppenhofer, Michael Wiegand, and Manfred Klenner. 2019. Overview of GermEval task 2, 2019 shared task on the identification of offensive language. Preliminary Proceedings of the 15th Conference on Natural Language Processing (KONVENS’19, October 9–11, 2019 at Friedrich-Alexander-Universität Erlangen-Nürnberg), German Society for Computational Linguistics & Language Technology und Friedrich-Alexander-Universität Erlangen-Nürnberg, München [u.a.], 352–363. https://nbn-resolving.org/urn:nbn:de:bsz:mh39-93197.
[95]
J. Suler. 2004. The online disinhibition effect. Cyberpsychol. Behav.: Impact Internet, Multim. Virt. Real. Behav. Societ. 7, 3 (2004), 321–326.
[96]
Tony Sun, Andrew Gaut, Shirlyn Tang, Yuxin Huang, Mai ElSherief, Jieyu Zhao, Diba Mirza, Elizabeth Belding, Kai-Wei Chang, and William Yang Wang. 2019. Mitigating gender bias in natural language processing: Literature review. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics. Association for Computational Linguistics, 1630–1640. DOI:
[97]
Tony Sun, Andrew Gaut, Shirlyn Tang, Yuxin Huang, Mai ElSherief, Jieyu Zhao, Diba Mirza, Elizabeth Belding, Kai-Wei Chang, and William Yang Wang. 2019. Mitigating gender bias in natural language processing: Literature review. arXiv preprint arXiv:1906.08976 (2019).
[98]
Harini Suresh and John V. Guttag. 2019. A framework for understanding unintended consequences of machine learning. arXiv preprint arXiv:1901.10002 (2019).
[99]
Swabha Swayamdipta, Roy Schwartz, Nicholas Lourie, Yizhong Wang, Hannaneh Hajishirzi, Noah A. Smith, and Yejin Choi. 2020. Dataset cartography: Mapping and diagnosing datasets with training dynamics. In Proceedings of the Conference on Empirical Methods in Natural Language Processing (EMNLP’20). Association for Computational Linguistics, 9275–9293. DOI:
[100]
Shawn Tan, Yikang Shen, Chin-wei Huang, and Aaron Courville. 2019. Investigating biases in textual entailment datasets. arXiv preprint arXiv:1906.09635 (2019).
[101]
Rachael Tatman. 2017. Gender and dialect bias in YouTube’s automatic captions. In Proceedings of the 1st ACL Workshop on Ethics in Natural Language Processing. Association for Computational Linguistics, 53–59. DOI:
[102]
Daniel Trottier. 2020. Denunciation and doxing: towards a conceptual model of digital vigilantism. Global Crime 21, 3-4 (2020), 196–212. DOI:
[103]
Ameya Vaidya, Feng Mai, and Yue Ning. 2020. Empirical analysis of multi-task learning for reducing identity bias in toxic comment detection. Proc. Int. AAAI Conf. Web Soc. Media 14, 1 (May2020), 683–693. Retrieved from: https://ojs.aaai.org/index.php/ICWSM/article/view/7334.
[104]
Oskar Van Der Wal, Jaap Jumelet, Katrin Schulz, and Willem Zuidema. 2022. The birth of bias: A case study on the evolution of gender bias in an English language model. In Proceedings of the 4th Workshop on Gender Bias in Natural Language Processing (GeBNLP’22). Association for Computational Linguistics, 75–75. DOI:
[105]
Juliet Van Rosendaal, Tommaso Caselli, and Malvina Nissim. 2020. Lower bias, higher density abusive language datasets: A recipe. In Proceedings of the Workshop on Resources and Techniques for User and Author Profiling in Abusive Language. 14–19.
[106]
Sebastian Wachs, Manuel Gámez-Guadix, and Michelle F. Wright. 2022. Online hate speech victimization and depressive symptoms among adolescents: The protective role of resilience. Cyberpsychol., Behav., Soc.l Netw. 25, 7 (2022), 416–423. DOI:
[107]
Zeyu Wang, Klint Qinami, Ioannis Christos Karakozis, Kyle Genova, Prem Nair, Kenji Hata, and Olga Russakovsky. 2020. Towards fairness in visual recognition: Effective strategies for bias mitigation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 8919–8928.
[108]
Zeerak Waseem. 2016. Are you a racist or am I seeing things? Annotator influence on hate speech detection on Twitter. In Proceedings of the 1st Workshop on NLP and Computational Social Science. 138–142.
[109]
Zeerak Waseem and Dirk Hovy. 2016. Hateful symbols or hateful people? Predictive features for hate speech detection on Twitter. In Proceedings of the NAACL Student Research Workshop. 88–93.
[110]
Laura Weidinger, John Mellor, Maribeth Rauh, Conor Griffin, Jonathan Uesato, Po-Sen Huang, Myra Cheng, Mia Glaese, Borja Balle, Atoosa Kasirzadeh, Zac Kenton, Sasha Brown, Will Hawkins, Tom Stepleton, Courtney Biles, Abeba Birhane, Julia Haas, Laura Rimell, Lisa Anne Hendricks, William Isaac, Sean Legassick, Geoffrey Irving, and Iason Gabriel. 2021. Ethical and Social Risks of Harm from Language Models. arxiv:2112.04359 [cs.CL].
[111]
Maximilian Wich, Hala Al Kuwatly, and Georg Groh. 2020. Investigating annotator bias with a graph-based approach. In Proceedings of the 4th Workshop on Online Abuse and Harms. 191–199.
[112]
Maximilian Wich, Jan Bauer, and Georg Groh. 2020. Impact of politically biased data on hate speech classification. In Proceedings of the 4th Workshop on Online Abuse and Harms. 54–64.
[113]
Michael Wiegand, Josef Ruppenhofer, and Thomas Kleinbauer. 2019. Detection of abusive language: The problem of biased datasets. In Proceedings of the Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies. 602–608.
[114]
Michael Wiegand, Melanie Siegel, and Josef Ruppenhofer. 2018. Overview of the GermEval 2018 shared task on the identification of offensive language, Austrian Academy of Sciences.
[115]
Ellery Wulczyn, Nithum Thain, and Lucas Dixon. 2017. Ex machina: Personal attacks seen at scale. In Proceedings of the 26th International Conference on World Wide Web. 1391–1399.
[116]
Mengzhou Xia, Anjalie Field, and Yulia Tsvetkov. 2020. Demoting racial bias in hate speech detection. In Proceedings of the 8th International Workshop on Natural Language Processing for Social Media. Association for Computational Linguistics, 7–14. DOI:
[117]
Wenjie Yin and Arkaitz Zubiaga. 2021. Towards generalisable hate speech detection: A review on obstacles and solutions. PeerJ Comput. Sci. 7 (2021), e598.
[118]
Guanhua Zhang, Bing Bai, Junqi Zhang, Kun Bai, Conghui Zhu, and Tiejun Zhao. 2020. Demographics should not be the reason of toxicity: Mitigating discrimination in text classifications with instance weighting. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics. Association for Computational Linguistics, 4134–4145. DOI:
[119]
Jieyu Zhao, Tianlu Wang, Mark Yatskar, Vicente Ordonez, and Kai-Wei Chang. 2018. Gender bias in coreference resolution: Evaluation and debiasing methods. In Proceedings of the Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies. Association for Computational Linguistics, 15–20. DOI:
[120]
Xuhui Zhou, Maarten Sap, Swabha Swayamdipta, Yejin Choi, and Noah Smith. 2021. Challenges in automated debiasing for toxic language detection. In Proceedings of the 16th Conference of the European Chapter of the Association for Computational Linguistics: Main Volume. Association for Computational Linguistics, 3143–3155. Retrieved from: https://www.aclweb.org/anthology/2021.eacl-main.274.

Cited By

View all

Recommendations

Comments

Information & Contributors

Information

Published In

cover image ACM Computing Surveys
ACM Computing Surveys  Volume 55, Issue 13s
December 2023
1367 pages
ISSN:0360-0300
EISSN:1557-7341
DOI:10.1145/3606252
Issue’s Table of Contents

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 13 July 2023
Online AM: 20 January 2023
Accepted: 09 January 2023
Revised: 26 December 2022
Received: 26 January 2022
Published in CSUR Volume 55, Issue 13s

Permissions

Request permissions for this article.

Check for updates

Author Tags

  1. Toxic speech
  2. hate speech
  3. social networks
  4. unintended bias
  5. bias mitigation
  6. bias shift

Qualifiers

  • Survey

Funding Sources

  • Prime Minister Doctoral Fellowship (SERB India)
  • Ramanujan Fellowship (SERB, India)
  • Wipro Research Grant

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)1,090
  • Downloads (Last 6 weeks)89
Reflects downloads up to 01 Jan 2025

Other Metrics

Citations

Cited By

View all

View Options

Login options

Full Access

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Full Text

View this article in Full Text.

Full Text

Media

Figures

Other

Tables

Share

Share

Share this Publication link

Share on social media