Similarizing the Influence of Words with Contrastive Learning to Defend Word-level Adversarial Text Attack

Pengwei Zhan, Jing Yang, He Wang, Chao Zheng, Xiao Huang, Liming Wang


Abstract
Neural language models are vulnerable to word-level adversarial text attacks, which generate adversarial examples by directly substituting discrete input words. Previous search methods for word-level attacks assume that the information in the important words is more influential on prediction than unimportant words. In this paper, motivated by this assumption, we propose a self-supervised regularization method for Similarizing the Influence of Words with Contrastive Learning (SIWCon) that encourages the model to learn sentence representations in which words of varying importance have a more uniform influence on prediction. Experiments show that SIWCon is compatible with various training methods and effectively improves model robustness against various unforeseen adversarial attacks. The effectiveness of SIWCon is also intuitively shown through qualitative analysis and visualization of the loss landscape, sentence representation, and changes in model confidence.
Anthology ID:
2023.findings-acl.500
Volume:
Findings of the Association for Computational Linguistics: ACL 2023
Month:
July
Year:
2023
Address:
Toronto, Canada
Editors:
Anna Rogers, Jordan Boyd-Graber, Naoaki Okazaki
Venue:
Findings
SIG:
Publisher:
Association for Computational Linguistics
Note:
Pages:
7891–7906
Language:
URL:
https://aclanthology.org/2023.findings-acl.500/
DOI:
10.18653/v1/2023.findings-acl.500
Bibkey:
Cite (ACL):
Pengwei Zhan, Jing Yang, He Wang, Chao Zheng, Xiao Huang, and Liming Wang. 2023. Similarizing the Influence of Words with Contrastive Learning to Defend Word-level Adversarial Text Attack. In Findings of the Association for Computational Linguistics: ACL 2023, pages 7891–7906, Toronto, Canada. Association for Computational Linguistics.
Cite (Informal):
Similarizing the Influence of Words with Contrastive Learning to Defend Word-level Adversarial Text Attack (Zhan et al., Findings 2023)
Copy Citation:
PDF:
https://aclanthology.org/2023.findings-acl.500.pdf