SemRoDe: Macro Adversarial Training to Learn Representations that are Robust to Word-Level Attacks

Page view(s)

Checked on Aug 27, 2025

Please use this identifier to cite or link to this item: https://oar.a-star.edu.sg/communities-collections/articles/21314

Title:

SemRoDe: Macro Adversarial Training to Learn Representations that are Robust to Word-Level Attacks

Journal Title:

Proceedings of the 2024 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 1: Long Papers)

DOI:

10.18653/v1/2024.naacl-long.443

Publication URL:

http://dx.doi.org/10.18653/v1/2024.naacl-long.443

Authors:

Brian Formento, Wenjie Feng, Chuan-Sheng Foo, Anh Tuan Luu, See-Kiong Ng

Keywords:

adversarial attacks, Natural Language Processing

Publication Date:

27 July 2024

Citation:

Formento, B., Feng, W., Foo, C.-S., Luu, A. T., Ng, S.-K. (2024). SemRoDe: Macro Adversarial Training to Learn Representations that are Robust to Word-Level Attacks. Proceedings of the 2024 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 1: Long Papers), 8005–8028. https://doi.org/10.18653/v1/2024.naacl-long.443

Abstract:

Language models (LMs) are indispensable tools for natural language processing tasks, but their vulnerability to adversarial attacks remains a concern. While current research has explored adversarial training techniques, their improvements to defend against word-level attacks have been limited. In this work, we propose a novel approach called Semantic Robust Defence (SemRoDe), a Macro Adversarial Training strategy to enhance the robustness of LMs. Drawing inspiration from recent studies in the image domain, we investigate and later confirm that in a discrete data setting such as language, adversarial samples generated via word substitutions do indeed belong to an adversarial domain exhibiting a high Wasserstein distance from the base domain. Our method learns a robust representation that bridges these two domains. We hypothesize that if samples were not projected into an adversarial domain, but instead to a domain with minimal shift, it would improve attack robustness. We align the domains by incorporating a new distance-based objective. With this, our model is able to learn more generalized representations by aligning the model’s high-level output features and therefore better handling unseen adversarial samples. This method can be generalized across word embeddings, even when they share minimal overlap at both vocabulary and word-substitution levels. To evaluate the effectiveness of our approach, we conduct experiments on BERT and RoBERTa models on three datasets. The results demonstrate promising state-of-the-art robustness.

License type:

Attribution 4.0 International (CC BY 4.0)

Funding Info:

This research / project is supported by the National Research Foundation - Industry Alignment Fund - Pre-positioning (IAF-PP)
Grant Reference no. : NA

Description:

URI:

https://oar.a-star.edu.sg/communities-collections/articles/21314

ISSN:

Nil

Collections:

Institute for Infocomm Research

Files uploaded:

Manuscripts in This Item:

File	Size	Format	Action
2024naacl-long443.pdf	1.48 MB	PDF	Open