1 Introduction

With the growing popularity of Deep Learning (DL), DL-based systems are being applied in a wide variety of areas [1, 2]. While the performance of DL-based systems is compelling for their widespread adoption and deployment, they have their own limitations. When these systems are introduced to the adversary, the large number of domains they serve often become vulnerable [3]. These systems are susceptible to adversarial attacks, which are specific input samples to DL-models that an attacker has designed to cause a misclassification. Moreover, the generated adversarial examples for one model can be transferred to attack other models [4]. Different defense methods have been proposed to protect models against these attacks. Among these methods, Adversarial Training (AT) [5] is the most popular defense, which works by training the model not only on clean samples but on generated adversarial samples as well. AT is considered a data augmentation method, where we augment the training data with samples, lead to improving the robustness of the model.

Data augmentation has a proven efficiency in increasing models generalization by expanding the training dataset with creating additional synthesized samples. Managing the quality of training data plays an essential role in improving models performance. To achieve this, many data augmentation has been provided [6]. Such techniques improve the generalization of the model on out-of-distribution samples, and adversarial samples belong to this family of samples. The area of data augmentation has drawn a remarkable improvement recently [7]. Different methods have been proposed to augment data for any modality in a generic way [6, 8]. However, it is important to consider not using all data samples for augmenting, but only important samples that cause an increment in the generalization of the model [9]. To achieve this goal, the dataset should be analyzed to find important influential samples for predicting a test sample.

Our goal is to design a robust selective data augmentation (RSDA) approach for optimizing training data that leads to increasing the neural networks models’ robustness without a big drop in the accuracy on clean samples. The idea is to generate new training samples as a shift of other original samples in the direction that helps in increasing the model’s robustness. The generated samples are not intended to be produced to resemble the original samples but instead, have features that increase the model’s generalization on adversarial examples. The approach is supposed to be model-agnostic and works for any data modality.

We propose RSDA to train the model on robustly augmented training data to achieve the most accurate model on both clean and adversarial samples. For each training example, we construct an adversarial example, find the training samples that lead to this incorrect classification of the adversarial sample in the adversarial class, and apply transformations in the latent space in the directions that lead to less incorrect classification without affecting the accuracy on the test set. The approach stems from the point that there could be different training samples from the original training sample that formulate the decision boundary of the model and lead to this incorrect classification.

Accordingly, in this paper, we propose a novel robust selective data augmentation method, which uses data transformations in the latent space to boost AT. Instead of simply considering the classification loss on adversarial and clean samples, as in AT, we consider also finding an optimum transformation of the specific samples, which helps in decreasing the sample space for adversarial examples. The overall architecture consists of three components: a feature generator, a discriminator, and a classifier. These components help to produce a robust compact latent space.

In short, the contributions of this paper are directed at improving the generalization of AT on both adversarial and clean samples by formulating the problem as a data augmentation task where different transformations are performed in the latent space on samples that lead to better robustness. Since these transformations are performed in the latent space, our method can be applied to any input modality, including images, texts, graphs, etc. We evaluate RSDA using MNIST and CIFAR-10 datasets on FGSM, PGD, BIM adversarial attacks. Experimental results show that RSDA enhances the generalization of AT on all conducted tests.

2 Related Work

In this section, we review the existing works in deep learning robustness, automatic data augmentation, and finding influential samples since RSDA depends on these works to boost adversarial training.

2.1 Adversarial Attacks and Defenses

The vulnerability of DL models to adversarial attacks was first introduced by [10]. The existence of adversarial attacks was explained by the low occurrence probability of adversarial samples in real data, which makes classifying them difficult. After that, Fast Gradient Sign Method (FGSM) attack was introduced in [5] and the existence of adversarial samples was attributed to the high dimensional linearity of DL models. FGSM is very fast because it performs one-step attack, but sometimes its success rate is low. Therefore, an iterative FGSM (I-FGSM) was proposed by [11] where the loss function is increased in multiple small steps instead of one large step. Subsequently, many adversarial attacks algorithms were proposed including the basic iterative method (BIM), projected gradient descent (PGD) [11], Carlini and Wagner (C &W) attacks [12], and others.

Meanwhile, many defenses against adversarial attacks were introduced recently, including heuristic and certificated defense [13]. Heuristic defense refers to a defense that increases the robustness of the model against specific attacks without giving theoretical guarantees. The most effective heuristic defense is adversarial training (AT) [5], which augments the training data with adversarial samples generated by the previously mentioned attacks. Empirically, PGD adversarial training achieves the best accuracy against a wide range of adversarial attacks on several datasets [14]. Many other heuristic defenses were proposed in the literature which mainly depends on performing transformations and denoising in the input space and feature space, but these defenses are shown to be helpless against adaptive attacks [15]. On the other hand, certified defenses can provide a guarantee for their lowest accuracy under a pre-defined group of adversarial attacks. A popular certified defense is to formulate an adversarial polygon and to convexly relax the problem to define its upper bounds [16]. This upper bound guarantees that no attack with specific limitations can surpass the certificated attack success rate. However, these defenses are still restricted to small datasets and small models, making them inapplicable to real life scenarios.

Since AT is the most successful applicable defense, we use it as a baseline and propose a new method to boost it.

2.2 Automatic Data Augmentation

In practice, many transformations are performed on the original samples to augment the training data. For image data, for example, different processing techniques are used like flipping, cropping, color shifting, and others. The success of image data augmentation motivated applying data augmentation in all possible domains where DL is applied.

To make the augmentations universal for all data modalities, [8] proposes a generic automated data augmentation approach called MODALS. MODALS tries to perform a smooth transition from real samples in the embedding space to artificial samples with preserving their class identity. MODALS exploits four universal automated data transformations in the latent space: Hard example interpolation, Hard example extrapolation, Gaussian Noise, and Difference Transform. Deep learning models, however, may not work properly if they are trained on all training data, and the model generalization can be instead enhanced by dropping some unfavorable samples [9]. In the first round of the model training, the unfavorable samples are inspected, then they are dropped, and then retrain the model from scratch with the reduced training dataset in the second round. Finding important examples could also be done early in training instead of waiting until the end of training, as in [17]. For finding important examples early in training, a scoring method is used to identify important and difficult examples early in training, and prune the training dataset without enormous sacrifices in test accuracy. Another direction is to learn the importance or weight of the samples and assign these weights to training examples based on their gradient directions, as in [18]. Ren et al. [18] performs validation at every training iteration to determine the example weights of the current batch and assigns importance weights to examples in every iteration.

2.3 Influential Samples

ML systems are complicated, and explaining where the model came from to make the predictions is very challenging. This problem can be tackled using influence functions to identify training points most responsible for a given prediction [19]. To formalize the impact of a training point on a prediction, the authors of [19] ask the counterfactual: “what would happen if this training point did not exist, or if the values of this training point were changed slightly?”. The authors study the effect of a training sample on a test sample by approximating the effect of removing this training sample on the loss of the test sample. Another approach for finding prediction important samples is by decomposing the pre-activation prediction of a neural network into a linear combination of activations of training points, with the weights corresponding to what is called “representer values”, which thus capture the importance of that training point on the learned parameters of the network with positive representer values corresponding to excitatory training points, and negative values corresponding to inhibitory point [20]. A positive representer value indicates that a similarity to that training point is excitatory, while a negative representer value indicates that a similarity to that training point is inhibitory. A new approach (TracIn) for calculating the influence of a training sample on the prediction of a test sample is introduced in [21] inspired by the fundamental theorem of calculus. The fundamental theorem of calculus decomposes the difference between a function at two points using the gradients along the path between the two points. Analogously, TracIn decomposes the difference between the loss of the test point at the end of training versus at the beginning of training along the path taken by the training process. For a particular training example \(x_{i}\), the authors approximate the idealized influence by summing up an approximation in all the iterations in which \(x_{i}\) was used to update the parameters.

In this paper, we take advantage of the success of the previously mentioned methods to arrive at a robust augmentation approach to enhance AT. We will compare different approaches for finding influential samples and discuss which approach might work better in the case of adversarial attacks. Also, we propose different data augmentations inspired by [8].

3 The Proposed Approach

In this section, we describe our proposed RSDA approach to boost AT. We are given an original training dataset \(D_\mathrm{{tr}}={(x_{(i)},y_{(i)}})_{i=1}^{m}\), where \(x_{(i)} \in X\) is the input sample, \(y_{(i)}\) is the corresponding label, and m is the size of the dataset. In clean scenarios, we need to learn a predictive function \(f: X \rightarrow Y\) that reduces the loss on test dataset \(D_\mathrm{{tst}}\). However, the function f is not robust against adversarial attacks. Adversarial training considers reducing the loss on both original samples from X and adversarial samples from \(X^\mathrm{{adv}}\) by augmenting the dataset \(D_\mathrm{{tr}}\) with perturbed data \(X^\mathrm{{adv}}\) received from attacking the original dataset. However, AT augments the dataset with only adversarial labels and does not consider adding any other artificial samples that may increase the robustness. RSDA makes two observations to boost the model’s robustness. Firstly, we observe that for each \(x \in X^\mathrm{{adv}}\) there may be a set of samples \(S\subseteq D_\mathrm{{tr}}\) that, if altered, may increase the model robustness against x. Examples of such samples are mislabeled samples, multi-labelled samples, and highly similar samples from two different classes, etc. Secondly, we observe that if the latent space of a deep neural network is learned properly, the class region is mostly convex and clustered. This allows to make smooth alterations to the selected samples by preforming linear transformations on them in the latent space without changing their class identity. We empirically prove that these observations are valid using MNIST and CIFAR datasets.

Accordingly, in this work, we propose, implement and validate a novel RSDA approach to boost/complement AT. In RSDA, for each training example \(x \in D_\mathrm{{tr}}\), first we construct an adversarial sample \({x^\mathrm{{adv}}}\). Next, we find its corresponding S that should be altered to help the model correctly classify \({x^\mathrm{{adv}}}\). Finally, for each \(x \in S\) we move it in the direction that leads to a reduction in the model’s classification error on \(x^\mathrm{{adv}}\) without affecting its accuracy on the test set.

The three steps of RSDA are shown using a toy example in Fig. 1. The approach stems from the point that for each adversarial sample, there could be some interesting samples in the training data that formulate the decision boundary of the model near the adversarial sample. Identifying these samples and making some targeted alterations to them might assist the model in correctly classifying the corresponding adversarial sample. Therefore, to implement RSDA we need to solve three issues:

  1. 1.

    First, we need to generate the adversarial samples on the fly during training. For each training sample \(x \in D_\mathrm{{tr}}\), we can move in the gradients’ direction that maximizes the loss (similar to FGSM), or we can use stronger attack methods (Like PGD, BIM). We discuss this in Sect. 3.1.

  2. 2.

    For each adversarial sample \({x^\mathrm{{adv}}}\), we need to find the influential training set S. Finding S should be done effectively without the need to retrain the model for every sample. In Sect. 3.1.1, we discuss our approach for finding S.

  3. 3.

    Finally, we need to find the direction in which to move the influential samples. The movement should lead to not only classifying the adversarial examples correctly, but also to preserving the loss on clean samples as much as possible. The possible directions in which such samples can be moved are discussed in Sect. 3.1.2.

Fig. 1
figure 1

Our proposed RSDA approach presented in a toy example. a A binary classifier classifying between two classes, o and x. b Executing an adversarial attack on a sample from the o class, resulting in an adversarial sample \({x^\mathrm{{adv}}}\) (with red color) from the x class. c Finding influential set S (with blue color) for the adversarial sample, and finding the proper directions (represented with blue arrows) in which to move these influential samples. d Moving the influential samples and retraining the model so that the adversarial sample is classified correctly from the o class (the old decision boundary is represented with dashed line and the new decision boundary with solid line)

3.1 Finding Adversarial Samples

Several attack methods have been introduced in literature to find \(x^\mathrm{{adv}}\). In our work, we use and compare three such techniques. The simplest method is fast gradient sign method (FGSM) [5]. It tries to maximize the loss function by finding the gradients of the loss with respect to the input sample and update along the direction of the gradient with a restriction on the \(L_{\infty }\) norm of perturbation so that the difference between adversarial and clean sample is imperceptible. The \(L_{\infty }\) norm is used over other \(L_{p}\) norms because it restricts the maximum amount of perturbation added, and this makes the perturbation less perceptible, especially for images, but we can easily generalize the attack to other \(L_{p}\) norms. Mathematically:

$$\begin{aligned} x^\mathrm{{adv}}= & {} x + \epsilon * \mathrm{{sign}}(\nabla _{x}J(h(x), y_\mathrm{{true}}))\nonumber \\&\mathrm{{s.t}}: \left\| x^\mathrm{{adv}} - x \right\| _{\infty } \le \Delta . \end{aligned}$$
(1)

The second method that we employ is a stronger iterative attack called the basic iterative method (BIM) [11]. It is similar to FGSM, but runs for multiple iterations. It creates iterative perturbations as:

$$\begin{aligned} x_{0}^\mathrm{{adv}} = x \end{aligned}$$
(2)
$$\begin{aligned} x_{t+1}^\mathrm{{adv}} = x_{t}^\mathrm{{adv}} + \alpha * \mathrm{{sign}}(\nabla _{x_{t}^\mathrm{{adv}}}J(h(x_{t}^\mathrm{{adv}}), y_\mathrm{{true}})) \end{aligned}$$
(3)
$$\begin{aligned} x_{t+1}^\mathrm{{adv}} = \mathrm{{clip}}(x_{t+1}^\mathrm{{adv}},x_{t+1}^\mathrm{{adv}} - \epsilon ,x_{t+1}^\mathrm{{adv}} + \epsilon ), \end{aligned}$$
(4)

where t determines the iteration number, \(\epsilon\) is the attack step or the attack learning rate, and \(\mathrm{{clip}}(\mathrm{{input}}, a, b)\) restricts the adversarial sample to reside in the range [ab], and how this constraint is satisfied depends on the \(L_{p}\) norm used.

Finally, we also use the Projected Gradient Descent (PGD) [11] attack, which also works in an iterative manner and restricts the maximal perturbation by projecting the perturbed sample into a feasible area. Differently from BIM, which initializes the first point as the original sample, PGD initializes the first point randomly within the area around the original sample. The noisy initialization of PGD leads to a stronger attack that converges better [11].

3.1.1 Finding Influential Samples

Among those discussed previously in related works, we study three potential approaches that might be applied in the study to find influential set S:

  1. 1.

    Using influential functions [19], which is based on finding the change in the parameters of the model as a training sample is removed from the training set. To formalize the impact of a training point on a prediction, the authors of [19] ask the counterfactual: “what would happen if this training point did not exist, or if the values of this training point were changed slightly?”. The authors study the effect of a training sample on a test sample by approximating the effect of removing this training sample on the loss of the test sample. However, one of the problems of this method is that it assumes that the neural network is optimally trained before finding the influence, but optimality in deep neural networks can rarely be relied upon. We tested the utilization of this method to find influential samples for adversarial samples, but this method is extremely slow compared to the other two methods we discuss here. Considering that we will find adversarial samples and their influence samples iteratively during training, we decided to exclude this method from the experiments.

  2. 2.

    Using Representer Point Selection [20], which decompose the pre-softmax logits values into a linear combination of training point activations. The representer values are the weights in this linear combination, and they correspond to the importance of each training sample. Positive representer values indicate excitatory influence of the training sample on the prediction of a test sample, while a negative value means it has a negative influence. The decomposition looks like: \(\phi (x_{t},\theta ^*) = \sum _{i}^{n} k(a_{i}, x_{t}, x_{i})\) where \(\theta ^*\) is the optimal parameters of the model, n is the total number of training samples, \(\phi\) is the logits layer of the neural network, \(a_{i} = -\frac{1}{2 \lambda n }\frac{\partial L(x_{i},y_{i},\theta )}{\partial \phi (x_{i},\theta )}\), and \(k(x_{t},x_{i},a_{i}) = a_{i}f^{T}_{i} f_{t}\), and \(f_{i}\) is the last intermediate features layer for input \(x_{i}\). Given such a representer theorem, \(k(x_{t},x_{i},a_{i})\) can be seen as the contribution of the training data \(x_{i}\) on the testing prediction of \(x_{t}\). Such formulation provides insight both as to why the neural network prefers a particular prediction, as well as why it does not, which is typically difficult to obtain via other sample-based explanations. This approach works for any model with linear matrix multiplication before the activation \(\sigma\), and with L2 weight decay applied as a regularization term in the loss function.

  3. 3.

    Estimating Training Data Influence by TracIn (Tracing Gradient Descent) [21], which approximates the influence of a training sample on a test sample by finding how much the loss at the test point changes when training the model with the specific training sample only. So, training samples that reduce the loss are proponents and samples that cause an increment in the loss are opponents. In order not to replay the training process for each training sample, we use first-order heuristic approximation via checkpoints and the primary notion of influence of a training sample z on a test sample \({z}'\) becomes: \(\mathrm{{TracInCP}}(z,{z}') =\sum _{i=1}^{k}\eta _{i} \triangledown L(w_{t_i},z). \triangledown L(w_{t_i},{z}')\) where k is the number of training iterations, \(\eta _{i}\) is the step size used between checkpoints \(i-1\) and i.

We use the same naming convention used in representer point selection [20]. We term the group of training samples from S that have positive influence on the prediction of a test sample for a particular class as excitatory samples, and we term the group of training samples from S that have negative influence as inhibitory samples. Using this terminology, the excitatory samples for a test sample excite the activation values and push the prediction towards a particular class, while the inhibitory samples suppress the activation values and indicate training points that lead the network away from a particular class.

3.1.2 Finding Movement Direction

Since we aim for RSDA to work for any data modality, performing label-preserving augmentation in the input space will be inappropriate for discrete data, such as text, graphs, etc. Therefore, we consider applying different modality-agnostic transformations in the latent space. The idea is to find a direction to move the sample in the latent space and apply it to the sample such that augmenting the training data with that newly gained sample leads to an increment in the robustness without changing the class identity.

Thus, to achieve robust automatic data augmentations, we need to learn a continuous latent space where we can apply class-preserving transformations, and we need to find effective movement direction. We leverage and compare the following four latent space transformations on the influential samples to find the best movement direction:

  1. 1.

    Excitatory samples interpolation: Let \(z_{i}^{c}\) be the latent representation of the \(x_{i}\) sample in class c, and let \(z_{i_\mathrm{{adv}}}^{{c}'}\) be the latent representation of the adversarial sample coming from an adversarial attack on \(x_{i}\) in adversarial class \({c}'\). Let \(U= \left\{ u \right\} _{j=1}^{q}\) be the latent representation of the q excitatory samples for \(x_{i_\mathrm{{adv}}}^{{c}'}\) (U contains the first q samples that lead to the adversarial sample being classified in class \({c}'\)). Let \(P= \left\{ p \right\} _{l=1}^{q}\) the latent representation of the q excitatory samples for \(x_{i_\mathrm{{adv}}}^{c}\) (P contains the first q samples that lead to the adversarial sample being classified in the correct class c). For every sample \(u_{i} \in U\), we find the nearest sample \(p \in P\) such that \(u_{i}\) and p have the same ground-truth class., then we move \(u_{i}\) towards p. Therefore, excitatory samples interpolation is expressed as:

    $$\begin{aligned} \hat{u_{i}} = u_{i} + \lambda _{1}( p- u_{i}) \end{aligned};$$
    (5)
    $$\begin{aligned} p = \mathrm{{arg}}\min _{p \in P^{c}}\frac{p.u_{i}}{\left\| p \right\| \left\| u_{i} \right\| } \end{aligned}$$
    (6)

    where \(P^{c}\) represents items from P that have the same ground-truth class of \(u_{i}\) and \(\lambda _{1}\) is a scaling factor.

  2. 2.

    Inhibitory samples interpolation: Let \(\Upsilon = \left\{ \upsilon \right\} _{j=1}^{q}\) be the latent representation of the q inhibitory samples for \(x_{i_\mathrm{{adv}}}^{{c}'}\) (\(\Upsilon\) contains the first q samples that prevent the adversarial sample from being classified in adversarial class \({c}'\)). Let \(K= \left\{ k \right\} _{l=1}^{q}\) be the latent representation of the q inhibitory samples for \(x_{i_\mathrm{{adv}}}^{c}\) (K contains the first q samples that prevent the adversarial sample from being classified in the correct class c). For every sample \(k_{i}\) from K, we find the nearest sample \(\upsilon\) in \(\Upsilon\) such that \(k_{i}\) and \(\upsilon\) have the same ground-truth class. Then, we move \(k_{i}\) towards \(\upsilon\). Therefore, inhibitory samples interpolation is expressed as:

    $$\begin{aligned} \hat{k_{i}} = k_{i} + \lambda _{2} ( \upsilon - k_{i}) \end{aligned};$$
    (7)
    $$\begin{aligned} \upsilon = \mathrm{{arg}} \min _{\upsilon \in K^{c}}\frac{\upsilon .k_{i}}{\left\| \upsilon \right\| \left\| k_{i} \right\| } \end{aligned}$$
    (8)

    where \(K^{c}\) represents items from K that have the same ground-truth class of \(k_{i}\) and \(\lambda _{2}\) is a scaling factor.

  3. 3.

    Excitatory samples class center interpolation and extrapolation: We attempt to send samples that cause the adversarial sample being classified incorrectly in a direction that reduces this influence. We perturb the excitatory sample \(u_{i}\) by sending it towards its class mean:

    $$\begin{aligned} \hat{u_{i}} = u_{i} + \lambda _{3} ( \mu _{c1} -u_{i}) \end{aligned}$$
    (9)

    On the other side, we attempt to send samples that cause the adversarial sample being classified correctly in a direction that increases this influence. We perturb the excitatory sample \(p_{i}\) by extrapolating it from its class mean:

    $$\begin{aligned} \hat{p_{i}} = p_{i} + \lambda _{4}( p_{i} - \mu _{c2} ) \end{aligned}$$
    (10)
  4. 4.

    Inhibitory samples class center interpolation and extrapolation: We attempt to send samples that prevent the adversarial sample from being classified correctly in a direction that reduces this influence. We perturb the inhibitory sample \(k_{i}\) by sending it towards its class mean:

    $$\begin{aligned} \hat{k_{i}} = k_{i} + \lambda _{5}( \mu _{c3} - k_{i} ) \end{aligned}$$
    (11)

    On the other side, We attempt to send samples that prevent the adversarial sample from being classified incorrectly in a direction that increases this influence. We perturb the inhibitory sample \(\upsilon _{i}\) by extrapolating it from its class mean:

    $$\begin{aligned} \hat{\upsilon _{i}} = \upsilon _{i} + \lambda _{6} ( \upsilon _{i} - \mu _{c4} ) \end{aligned}$$
    (12)

4 Model

Our goal is to train a classification model with different data augmentations on the latent space to achieve the best performance on clean and adversarial samples. Given an input space X and an output space Y, our classification model can be divided into a feature generator \(G_{f}\), a classifier \(G_{y}\), and a discriminator \(G_{d}\). The feature extractor takes a sample from the input space and maps it to the latent space: \(G_{f}(x,\theta ): X \rightarrow Z\) where Z is the latent space, and then the classifier maps form Z to Y. Thus, during training, for each input sample \(x_{i} \in X\), we use the current state of \(G_{f}, G_{y}\) to generate the adversarial example \(x_{i}^\mathrm{{adv}}, y_{i}^\mathrm{{adv}}\), then we find the previously explained excitatory and inhibitory influential samples set S for the adversarial sample \(x_{i}^\mathrm{{adv}}\), and then the label-preserving transformation operations are performed on the latent sample of these influential samples to produce \({Z}'\). We employ three losses to produce the required augmentation results. The overall architecture is shown in Fig. 2, and the algorithm is given in Algorithm 1.

4.1 Adversarial Loss

To produce class-preserving data augmentation, we need our latent space to preserve continues transformation inside each class in the latent space, such that the produced data do not exist in invalid regions outside the demanded class. This problem can be solved by enforcing the latent samples to be produced from a Gaussian distribution similar to VAE [22]. Hence, we add a discriminator \(G_{d}(z,\phi )\) component to the model to discriminate the samples produced by the feature generator in the latent space and samples produced from Gaussian distribution. By playing this game between the feature generator and the discriminator, the feature generator will be enforced to produce samples that are sampled from a Gaussian distribution to fool the discriminator. Thus, the adversarial loss of the feature generator \(L_\mathrm{{adv}}(\theta )\) and the discriminator \(L_{G_{d}}(\phi )\) becomes

$$\begin{aligned} L_\mathrm{{adv}}(\theta )&= - \frac{1}{M} \sum _{i=1}^{M} \mathrm{{log}} G_{d}(z_{i}) \end{aligned}$$
(13)
$$\begin{aligned} L_{G_{d}}(\phi )&= - \frac{1}{M} \sum _{i=1}^{M} (\mathrm{{log}} G_{d}(\epsilon _{i}) + \mathrm{{log}} [1- G_{d}(z_{i}) ]) ; \forall \, \epsilon _{i} \sim N(0,I) \end{aligned}$$
(14)

4.2 Triplet Loss

As mentioned before, to accomplish label-preserving robust transformations, we need to decrease the probability of the data being outside the class hull in the latent space corresponding to a valid sample of this class. Thus, the class distribution in the latent space should be as compact as possible. Triplet loss helps to achieve that by pushing latent representations of the same class together and pushing away representations of other classes [23]. Triplet loss operates on three samples as input: an anchor (in our case the latent representation of the sample) \(z_{a}\), a positive sample \(z_{p}\) which is the latent representation of a sample has the same class as the anchor, and a negative latent representation \(z_{n}\) that has a different class from the anchor. Triplet loss works by minimizing the distance in the feature space between the anchor sample and positive sample and maximizing the distance of the anchor sample \(z^{a}\) and negative sample. Mathematically, the triplet loss function is defined as follows:

$$\begin{aligned} L_\mathrm{{trip}}=\sum _{z^{a},z^{p},z^{n}}\max (d({z^{a},z^{p}})- d({z^{a},z^{n}}) + m, 0), \end{aligned}$$
(15)

where m is the margin by which the distance between the anchor and the positive sample is at least larger than the distance between the anchor and the negative sample.

4.3 Classification Loss

The ultimate goal is to have the input sample being classified correctly, whether the sample is adversarial or clean. As mentioned before, the feature extractor \(G_{f}\) operates on the input sample to get \(z_{i}\), then we augment \(z_{i}\) using one of the augmentation strategies to get \({z}'\), and the classifier \(G_{y}\) operates on \({z}'\) as an input and produces the final class for the input sample. The classifier contains many fully connected layers. The classification loss is thus given by:

$$\begin{aligned} L_\mathrm{{cls}}=\sum _{(x_{i},y_{i})\in (D_{c} \cup D_{a}^{d})} H(G_{y}({z}'),y_{i}) \end{aligned}$$
(16)

H(.) here is the classification loss function, and we use the cross-entropy loss, and \({z}'\) is the latent representation for any sample from the clean and adversarial domains (Fig. 2).

Fig. 2
figure 2

The architecture of the proposed method

Our final objective function becomes as follow:

$$\begin{aligned} \min _{\theta _{g},\theta _{c}}\left\{ \lambda _{7} L_\mathrm{{cls}} +\lambda _{8} L_\mathrm{{adv}} + \lambda _{9} L_\mathrm{{trip}} \right\} \end{aligned}$$
(17)

And we should not forget also that we have the discriminator \(G_{d}(z,\phi )\) component of the model that is trained to reduce the discriminator loss \(L_{G_{d}}(\phi )\) mentioned in the adversarial loss subsection.

figure a

5 Experiments

In principle, since we are applying the augmentation in the latent space, our method can be applied to any dataset and any adversarial attack. For comparison against Adversarial Training, we focus on two datasets: MNIST [24] and CIFAR10 [25], and three adversarial attacks: PGD, BIM, and FGSM. For all experiments, we normalize the pixel values to the range [0, 1].

5.1 Experiment Setup

In every training iteration, we use FGSM, PGD, and BIM to generate three untargeted adversarial samples on the fly. To evaluate the method performance, we compare RSDA with:

  1. 1.

    Normal Training (NT) with cross-entropy loss [26] on the clean training data.

  2. 2.

    Adversarial Training (AT) with the cross-entropy loss on the clean training data and the adversarial examples from the FGSM, PGD, and BIM.

For each dataset, we train a vanilla model (NT), Adversarial Training (AT), and the four kinds of selective augmentation represented before, with perturbation \(\epsilon\) for the generated adversarial samples, and evaluate these models on FGSM, PGD, and BIM attacks bounded by the same \(\epsilon\). We consider \(L_{\infty }\) as a measure of perturbation in all attacks. The experiments were implemented on a single GeForce GTX 1080 Ti.

In principle, any conventional image classification model can be used. The features’ generator comprises a stack of convolutional layers, while the discriminator and classifier comprise a stack of fully connected layers. While any optimization method can be used for training, we choose Adam optimization [27] for training all the components with a batch size of 1024, 200 epochs and (\(\beta _{1} = 0.9, \beta _{2} = 0.99\)). The learning rate starts as 0.005 and decayed by 2 every 30 epochs. After training, the discriminator can be removed and the robust feature generator and the classifier can be used instead of the conventional image classifier.

5.2 Experimental Results

5.2.1 Results on MNIST

Since MNIST is a comparatively easy dataset, we use simple network architectures for the different components, as shown in Table 1.

Table 1 Components architecture for MNIST

The allowed adversarial perturbation \(\epsilon\), in this case, is 0.3, and the maximum number of iteration for BIM and PGD is 30.

Finding influential samples To examine the efficiency of finding influential samples set S, we present an example of a test image of digit 2 and we perform an untargeted PGD attack ending in an adversarial sample classified by the classifier as digit 1. The first thing we notice from Fig. 3 is that the majority of the excitatory samples for the clean test sample using both Representer and TracIn are from the same class (Digit 2 in our example) which seems reasonable since the samples from this class formulate the decision boundary near the test sample. However, more interestingly, we see that from the four most inhibitory samples for the clean test sample, three of them using Representer and TracIn are from digit 1 which is the same class as that of the untargeted adversarial sample derived from the test sample, and these same samples appear in the most excitatory samples for the adversarial sample being classified in the adversarial class 1, as shown in Fig. 4. Furthermore, they appear in the most inhibitory samples that prevent the adversarial sample from being classified in the clean sample, as shown in Fig. 5. This is more evident using TracIn than with Representer. This experiment was repeated many times, and we noticed the following pattern.

  • The adversarial class for a particular clean sample appears noticeably in the inhibitory samples for the clean samples.

  • Among the inhibitory samples for the adversarial sample being classified in the adversarial class (Fig. 4), we see the majority are from class 2 (the clean class) and class 6, and in this case class 6 is the second most probable class for untargeted adversarial attack after class 1 (The untargeted attack finds the easiest way to fool the model, which in this case leads to an adversarial sample from original class 2 being classified in class 1, so the easiest adversarial class is 1. The next most easy class is 6).

  • This means that what prevents the adversarial sample from being classified in the adversarial class is the excitatory samples for the clean sample and the excitatory samples for the second proper adversarial sample. This makes our proposed method work properly since magnifying the clean sample’s excitatory samples not only leads to increasing the accuracy on clean samples, but also reduces the effect of adversarial samples

Fig. 3
figure 3

The test sample from class 2 with the excitatory and the inhibitory samples

Fig. 4
figure 4

The adversarial sample classified in class 1 with the excitatory samples leading to misclassifying the sample as an adversarial sample and the inhibitory samples that prevent the adversarial sample from being classified into the adversarial 1 class

Fig. 5
figure 5

The adversarial sample classified in class 1 with the excitatory samples leading to classifying the sample correctly as a clean sample from class 2 and the inhibitory samples that prevent the adversarial sample from being classified correctly as a clean sample from class 2

Finding movement direction We take q to be equal to 10. As we enlarge q, we get better performance, but training the model needs more time. We test different values for \(\lambda _{1}\), \(\lambda _{2}\), \(\lambda _{3}\) and \(\lambda _{4}\). Obviously, there is a tradeoff between robustness and accuracy here. As we increase \(\lambda\), we get farther from the real latent sample and decrease the effect of the samples that cause the adversarial sample to be classified incorrectly; but also at the same time, we produce latent representations that are less realistic. We choose here \(\lambda _{1} = \lambda _{2} = \lambda _{3} = \lambda _{4} = 0.2\)

Results To see what the best method we can use for choosing the influential samples, we repeat the experiments using both Representer and TracIn. The accuracy results are reported in Table 2 for Representer, and in Table 3 for TacIn, and the best results are shown in bold. Using both, NT has the best accuracy on the clean data, but has the worst robustness or accuracy on adversarial samples. The adversarial accuracy on clean samples is almost the same between AT and our method. The best accuracy is achieved with excitatory samples class center interpolation and extrapolation. Our methods reduce the decrement of clean accuracy significantly. From Tables 2 and 3, we notice that both Representer and TracIn give good results, although TracIn works better when the augmentation is related to inhibitory samples.

Table 2 Accuracy results on MNIST using TracIn
Table 3 Accuracy results on MNIST using Representer

5.2.2 Results on CIFAR10

Since classifying CIFAR10 is harder than MNIST, we use the VGG architecture. The convolution layers compose the feature extractor, the last fully connected layers form the classifier, and the discriminator, which do not change from MNIST as shown in Table 4.

Table 4 Components architecture for CIFAR10

Here, the maximum allowed perturbation is \(\epsilon = 0.031\) and the number of iterations for BIM and PGD is 30.

Finding influential samples We repeat the same experiment we performed on MNIST, we present an example of a test sample from class 9 (truck) and perform an untargeted PGD attack ending in an adversarial sample classified by the classifier as class 1 (automobile). The same observations we noticed on MNIST are confirmed here. We notice from Fig. 6 that the majority of the excitatory samples for the clean test sample using both Representer and TracIn are from the same class (Truck). We also observe that from the four most inhibitory samples for the clean test sample, three of them using Representer and four of them using TracIn are from class 1 (automobile) which is the same class of the untargeted adversarial sample derived from the test sample, and these same samples appear in the most excitatory samples for the adversarial sample being classified in the adversarial class one as shown in Fig. 7, and they appear in the most inhibitory samples that prevent the adversarial sample from being classified in the clean sample in Fig. 8 (This is more evident using TracIn than with Representer). This experiment was repeated many times, and we noticed a tendency that the adversarial class appears noticeably in the inhibitory samples for the clean sample. Among the inhibitory samples for the adversarial sample being classified in the adversarial class (Fig. 7), we see the majority are from class 9 (the clean class) and class 8 (ship), and in this case class eight is the second most probable class for untargeted adversarial attack after class 1.

Fig. 6
figure 6

The test sample from class nine (truck) with the excitatory and the inhibitory samples

Fig. 7
figure 7

The adversarial sample classified in class 1 (automobile) with the excitatory samples leading to misclassifying the sample as an adversarial sample and the inhibitory samples that prevent the adversarial sample from being classified into the adversarial 1 class

Fig. 8
figure 8

The adversarial sample classified in class 1 (automobile) with the excitatory samples leading to classifying the sample correctly as a clean sample from class 9 (The truck class) and the inhibitory samples that prevent the adversarial sample from being classified correctly as a clean sample from class 9

Finding movement direction The same configurations are applied as in MNIST

Results The accuracy results are reported in Table 5 for Representer, and in Table 6 for TacIn. Using both, NT has the best accuracy on the clean data, but has the worse robustness or accuracy on adversarial samples. However, our augmentations enhance the performance of AT on both clean samples and adversarial samples. The best accuracy is achieved with excitatory samples class center interpolation and extrapolation. However, our methods reduce the decrement of clean accuracy significantly. From Tables 2 and 6, we notice that both Representer and TracIn give good results, although TracIn works better when the augmentation is related to inhibitory samples.

Table 5 Accuracy results on CIFAR using TracIn
Table 6 Accuracy results on CIFAR using representer

6 Ablation Study

We conduct an ablation study to show the effect of additional losses we propose. We also show the impact of adding these additional losses to AT in the first row of table 7. We used excitatory samples class center interpolation and extrapolation augmentation and used TracIn for finding the influential samples. The experiments on CIFAR show that both adversarial loss and triplet loss have positive impact on improving the model performance against FGSM adversarial attacks in both At and RSDA. However, the impact that these losses add is larger in RSDA since our method take advantage of these looses to provide correct robust augmentation in the latent space.

Table 7 Comparison of model’s average accuracy on FGSM adversarial samples when trained with different losses

7 Conclusion

In this paper, we design a RSDA approach to boost the performance of adversarial training on adversarial samples and clean samples. The proposed approach reduces the effect of adversarial attacks by finding robust transformations in the feature embedding space. The investigation of samples that affect the robustness of the models is established theoretically and proved experimentally. The experimental results show that RSDA increases the generalization of the model on both clean and adversarial samples.