poster

Poster: Adversarial Examples for Hate Speech Classifiers

Author:

Rajvardhan OakAuthors Info & Claims

CCS '19: Proceedings of the 2019 ACM SIGSAC Conference on Computer and Communications Security

Pages 2621 - 2623

https://doi.org/10.1145/3319535.3363271

Published: 06 November 2019 Publication History

Get Access

Abstract

With the advent of the Internet, social media platforms have become an increasingly popular medium of communication for people. Platforms like Twitter and Quora allow people to express their opinions on a large scale. These platforms are, however, plagued by the problem of hate speech and toxic content. Such content is generally sexist, homophobic or racist. Automatic text classification can filter out toxic content so some extent. In this paper, we discuss the adversarial attacks on hate speech classifiers. We demonstrate that by changing the text slightly, a classifier can be fooled to misclassifying a toxic comment as acceptable. We attack hate speech classifiers with known attacks as well as introduce four new attacks. We find that our method can degrade the performance of a Random Forest classifier by 20%. We hope that our work sheds light on the vulnerabilities of text classifiers, and opens doors for further research on this topic.

References

[1]

Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. 2018. Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805 (2018).

Google Scholar

[2]

Jiawei Han, Jian Pei, and Micheline Kamber. 2011. Data mining: concepts and techniques .Elsevier.

Digital Library

Google Scholar

[3]

Sepp Hochreiter and Jürgen Schmidhuber. 1997. Long short-term memory. Neural computation, Vol. 9, 8 (1997), 1735--1780.

Google Scholar

[4]

Tomas Mikolov, Kai Chen, Greg Corrado, and Jeffrey Dean. 2013. Efficient estimation of word representations in vector space. arXiv preprint arXiv:1301.3781 (2013).

Google Scholar

[5]

Christian Szegedy, Wojciech Zaremba, Ilya Sutskever, Joan Bruna, Dumitru Erhan, Ian Goodfellow, and Rob Fergus. 2013. Intriguing properties of neural networks. arXiv preprint arXiv:1312.6199 (2013).

Google Scholar

Cited By

View all

Alhazmi AMahmud RIdris NMohamed Abo MEke C(2024)A systematic literature review of hate speech identification on Arabic Twitter data: research challenges and future directionsPeerJ Computer Science10.7717/peerj-cs.196610(e1966)Online publication date: 2-Apr-2024
https://doi.org/10.7717/peerj-cs.1966
Rawat AKumar SSamant S(2024)Hate speech detection in social media: Techniques, recent trends, and future challengesWIREs Computational Statistics10.1002/wics.164816:2Online publication date: 11-Mar-2024
https://doi.org/10.1002/wics.1648
Alsmadi IAhmad KNazzal MAlam FAl-Fuqaha AKhreishah AAlgosaibi A(2023)Adversarial NLP for Social Network Applications: Attacks, Defenses, and Research DirectionsIEEE Transactions on Computational Social Systems10.1109/TCSS.2022.321874310:6(3089-3108)Online publication date: Dec-2023
https://doi.org/10.1109/TCSS.2022.3218743
Show More Cited By

Index Terms

Poster: Adversarial Examples for Hate Speech Classifiers

Recommendations

HateVersarial: Adversarial Attack Against Hate Speech Detection Algorithms on Twitter
UMAP '22: Proceedings of the 30th ACM Conference on User Modeling, Adaptation and Personalization

Machine learning (ML) models are commonly used to detect hate speech, which is considered one of the main challenges of online social networks. However, ML models have been shown to be vulnerable to well-crafted input samples referred to as adversarial ...
The Virality of Hate Speech on Social Media
CSCW

Online hate speech is responsible for violent attacks such as, e.g., the Pittsburgh synagogue shooting in 2018, thereby posing a significant threat to vulnerable groups and society in general. However, little is known about what makes hate speech on ...
A Measurement Study of Hate Speech in Social Media
HT '17: Proceedings of the 28th ACM Conference on Hypertext and Social Media

Social media platforms provide an inexpensive communication medium that allows anyone to quickly reach millions of users. Consequently, in these platforms anyone can publish content and anyone interested in the content can obtain it, representing a ...

Comments

Information & Contributors

Information

Published In

CCS '19: Proceedings of the 2019 ACM SIGSAC Conference on Computer and Communications Security

November 2019

2755 pages

ISBN:9781450367479

DOI:10.1145/3319535

General Chairs:
Lorenzo Cavallaro
King's College London, UK
,
Johannes Kinder
Bundeswehr University Munich, Germany
,
Program Chairs:
XiaoFeng Wang
Indiana University, USA
,
Jonathan Katz
George Mason University, USA

Permission to make digital or hard copies of part or all of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for third-party components of this work must be honored. For all other uses, contact the Owner/Author.

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 06 November 2019

Check for updates

Author Tags

Qualifiers

Poster

Conference

CCS '19

Sponsor:

SIGSAC

CCS '19: 2019 ACM SIGSAC Conference on Computer and Communications Security

November 11 - 15, 2019

London, United Kingdom

Acceptance Rates

CCS '19 Paper Acceptance Rate 149 of 934 submissions, 16%;

Overall Acceptance Rate 1,261 of 6,999 submissions, 18%

Upcoming Conference

CCS '25

Sponsor:
sigsac

ACM SIGSAC Conference on Computer and Communications Security

October 13 - 17, 2025

Taipei , Taiwan

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

6
Total Citations
View Citations
591
Total Downloads

Downloads (Last 12 months)30
Downloads (Last 6 weeks)3

Reflects downloads up to 14 Jan 2025

Other Metrics

View Author Metrics

Citations

Cited By

View all

Alhazmi AMahmud RIdris NMohamed Abo MEke C(2024)A systematic literature review of hate speech identification on Arabic Twitter data: research challenges and future directionsPeerJ Computer Science10.7717/peerj-cs.196610(e1966)Online publication date: 2-Apr-2024
https://doi.org/10.7717/peerj-cs.1966
Rawat AKumar SSamant S(2024)Hate speech detection in social media: Techniques, recent trends, and future challengesWIREs Computational Statistics10.1002/wics.164816:2Online publication date: 11-Mar-2024
https://doi.org/10.1002/wics.1648
Alsmadi IAhmad KNazzal MAlam FAl-Fuqaha AKhreishah AAlgosaibi A(2023)Adversarial NLP for Social Network Applications: Attacks, Defenses, and Research DirectionsIEEE Transactions on Computational Social Systems10.1109/TCSS.2022.321874310:6(3089-3108)Online publication date: Dec-2023
https://doi.org/10.1109/TCSS.2022.3218743
Mansur ZOmar NTiun S(2023)Twitter Hate Speech Detection: A Systematic Review of Methods, Taxonomy Analysis, Challenges, and OpportunitiesIEEE Access10.1109/ACCESS.2023.323937511(16226-16249)Online publication date: 2023
https://doi.org/10.1109/ACCESS.2023.3239375
Yao JWang CHu CHuang X(2022)Chinese Spam Detection Using a Hybrid BiGRU-CNN Network with Joint Textual and Phonetic EmbeddingElectronics10.3390/electronics1115241811:15(2418)Online publication date: 3-Aug-2022
https://doi.org/10.3390/electronics11152418
Li SLiu HDong TZhao BXue MZhu HLu JKim YKim JVigna GShi E(2021)Hidden Backdoors in Human-Centric Language ModelsProceedings of the 2021 ACM SIGSAC Conference on Computer and Communications Security10.1145/3460120.3484576(3123-3140)Online publication date: 12-Nov-2021
https://dl.acm.org/doi/10.1145/3460120.3484576
Beatty MAlhajj R(2020)Graph-based methods to detect hate speech diffusion on TwitterProceedings of the 12th IEEE/ACM International Conference on Advances in Social Networks Analysis and Mining10.1109/ASONAM49781.2020.9381473(502-506)Online publication date: 7-Dec-2020
https://dl.acm.org/doi/10.1109/ASONAM49781.2020.9381473

View Options

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Cited By

Index Terms

Recommendations

HateVersarial: Adversarial Attack Against Hate Speech Detection Algorithms on Twitter

The Virality of Hate Speech on Social Media

A Measurement Study of Hate Speech in Social Media