skip to main content
10.1145/3319535.3363271acmconferencesArticle/Chapter ViewAbstractPublication PagesccsConference Proceedingsconference-collections
poster

Poster: Adversarial Examples for Hate Speech Classifiers

Published: 06 November 2019 Publication History

Abstract

With the advent of the Internet, social media platforms have become an increasingly popular medium of communication for people. Platforms like Twitter and Quora allow people to express their opinions on a large scale. These platforms are, however, plagued by the problem of hate speech and toxic content. Such content is generally sexist, homophobic or racist. Automatic text classification can filter out toxic content so some extent. In this paper, we discuss the adversarial attacks on hate speech classifiers. We demonstrate that by changing the text slightly, a classifier can be fooled to misclassifying a toxic comment as acceptable. We attack hate speech classifiers with known attacks as well as introduce four new attacks. We find that our method can degrade the performance of a Random Forest classifier by 20%. We hope that our work sheds light on the vulnerabilities of text classifiers, and opens doors for further research on this topic.

References

[1]
Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. 2018. Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805 (2018).
[2]
Jiawei Han, Jian Pei, and Micheline Kamber. 2011. Data mining: concepts and techniques .Elsevier.
[3]
Sepp Hochreiter and Jürgen Schmidhuber. 1997. Long short-term memory. Neural computation, Vol. 9, 8 (1997), 1735--1780.
[4]
Tomas Mikolov, Kai Chen, Greg Corrado, and Jeffrey Dean. 2013. Efficient estimation of word representations in vector space. arXiv preprint arXiv:1301.3781 (2013).
[5]
Christian Szegedy, Wojciech Zaremba, Ilya Sutskever, Joan Bruna, Dumitru Erhan, Ian Goodfellow, and Rob Fergus. 2013. Intriguing properties of neural networks. arXiv preprint arXiv:1312.6199 (2013).

Cited By

View all

Recommendations

Comments

Information & Contributors

Information

Published In

cover image ACM Conferences
CCS '19: Proceedings of the 2019 ACM SIGSAC Conference on Computer and Communications Security
November 2019
2755 pages
ISBN:9781450367479
DOI:10.1145/3319535
Permission to make digital or hard copies of part or all of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for third-party components of this work must be honored. For all other uses, contact the Owner/Author.

Sponsors

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 06 November 2019

Check for updates

Author Tags

  1. adversarial machine learning
  2. hate speech

Qualifiers

  • Poster

Conference

CCS '19
Sponsor:

Acceptance Rates

CCS '19 Paper Acceptance Rate 149 of 934 submissions, 16%;
Overall Acceptance Rate 1,261 of 6,999 submissions, 18%

Upcoming Conference

CCS '25

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)30
  • Downloads (Last 6 weeks)3
Reflects downloads up to 14 Jan 2025

Other Metrics

Citations

Cited By

View all

View Options

Login options

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Media

Figures

Other

Tables

Share

Share

Share this Publication link

Share on social media