In the following we report results of an extensive set of experimental validations. First, we conduct an ablation study to investigate the effects of each component of our defense, that are evaluated under the generalized BPDA attack. Then, we provide a comprehensive view by reporting results on three different white-box attacks, that is DeepFool [
38], CW [
8], and a modified version of BPDA equipped with EOT, that we refer to as EOT-BPDA [
4]. The latter is specifically tailored to address all aspects of our defense; in particular, we use BPDA to approximate the non-differentiable JPEG operator, in conjunction with EOT to tackle the randomization applied on the quality factors. We finally explore two strong black-box attacks, namely Nattack [
32] and SquareAttack [
3]. We performed the evaluation using ResNet50 [
24] pre-trained on ImageNet. For BPDA and EOT-BPDA, in oder to showcase the versatility of our defense, we also evaluate DenseNet [
26] and MobileNet [
25]. To perform the attacks, we used the implementations of the
DeepRobust PyTorch library [
31], except for DeepFool and SquareAttack, for which we used the original codes.
All the experiments are conducted under the most difficult possible setting. For white-box attacks i.e., BPDA, C&W, DeepFool, the attacks have full access to all the network parameters and gradients, including the classification module and the AR-GAN (for each QF). Black-box attacks instead do not exploit gradients; however, we provide the attacker with the model including the defense.
4.1 Dataset and Common Evaluation Metrics
We evaluate our defense strategy on the ILSVRC validation set [
42]. Following the protocol of previous works e.g., [
32,
39], we randomly choose a subset of 1,000 images (one image per class) from the validation set such that the respective classifier achieves rank-1 100% accuracy on clean, non-attacked images. As noted in [
39], it would be pointless to use already misclassified images to evaluate the defense, since an attack on a misclassified image is successful by definition.
To assess the effectiveness of our approach, we evaluate it using various, standard metrics. First, we evaluate the
degradation, that is the impact of our defense strategy on the classification accuracy when applied to clean images. To obtain a comprehensive view of the attacks and defense performance, when possible we follow the evaluation criteria suggested in the recent works of Dong et al. [
15] and Carlini et al. [
7]: for white box attacks, we let the attack run until a fixed perturbation budget is reached, and report results in terms of
accuracy vs.
perturbation budget. For black-box attacks, we fix the perturbation budget and let the attack run until a predefined amount of queries is reached.
In order to show that our defense works properly for both the widely used
\(\ell _2\) and
\(\ell _\infty\) metrics, we use both depending on the attack. Following the standard convention e.g., [
4,
15], we use the
\(\ell _2\) norms normalized by the total number of pixels. The standard thresholds used for such metrics are,
\(\ell _2 = 0.005\) and
\(\ell _{\infty } = 0.031 = 8/255\). To demonstrate the robustness of our defense, we consider a less strict constraint for the attacks, and use also
\(\ell _2 = 0.01\) and
\(\ell _{\infty } = 0.062 = 16/255\). The detailed configurations of each attack are clarified in the following.
4.2 Ablation Study
In this section we analyze each module of our defense separately considering the BPDA attack as reference. BPDA is a powerful attack that estimates obfuscated gradients by approximation. These are then used to craft adversarial images in a gradient descent fashion. Even though it can be extended with EOT in the case of randomized defenses, here we chose to use the generalized BPDA attack. In fact, the goal of this ablation study is showing that the BPDA attack is strong enough to break the JPEG compression defense in all the cases, and the JPEG is followed by the AR-GAN restoration module as well. In order to make our contribution explicit, we will show that changing the AR-GAN parameters randomly depending on the current quality factor is of fundamental importance to make the defense robust.
We evaluate the robustness of our solution by applying the defense and classifying adversarial images generated at different perturbation thresholds to provide a more thorough analysis. We set the maximum perturbation to be
\(\ell _{\infty } = 0.062 = 16/255\). In particular, we let the attack run until the budget or the maximum number of iterations (200) is reached, using a step size of
\(10^{-3}\). We collect the results of the defense at each iteration of the attack. These experiments have been conducted using ResNet50 on a subset of 200 random correctly classified images of the ILSVRC validation set. In all the cases, the gradient of the JPEG is approximated by implementing the backward pass as the identity function, as done in the original article [
4].
4.2.1 JPEG Compression.
First, we analyze the effect of different quality factors when using the sole JPEG compression as a defense. As discussed, BPDA approximates the backward pass of the JPEG operator with the identity function. This trick is effective whenever, for a non-differentiable operator
\(h()\), we have
\(h(\mathbf {x}) \approx \mathbf {x}\). So, for breaking this attack one must ensure that
\(h(\mathbf {x})\) deviates largely from
\(\mathbf {x}\). One can achieve this by applying a strong compression (e.g., QF = 5); however, as reported in Figure
3 (left), in this case the accuracy on clean images drops dramatically. Overall, this verified that BPDA can circumvent JPEG, and that approximating the backward pass with the identity function is indeed effective.
1Multiple Compression Steps: Before analyzing the effect of introducing AR-GAN, we first aim at verifying how performing multiple JPEG compression impacts on the results. It can be observed from Figure
3 (middle) that iterating the JPEG compression
\(J(\mathbf {x})\) does not provide any improvement. In fact, by applying a cascade of 3 compressions, we still have
\(J(J(J(\mathbf {x}))) \approx \mathbf {x}\), and the identity approximation performed by BPDA circumvents the defense.
Randomization: The other module of our defense consists in randomizing the JPEG quality factor at each iteration. We randomize the QF within the range \(\delta _{q} = [20, 60]\). Differently from the previous case, adding the randomization results effective, which suggests our intuition is valid. Still, the sole JPEG does not yet provide enough robustness.
4.2.2 AR-GAN Restoration.
In this section, we explore how restoring the compressed images with AR-GAN impacts on the defense.
Different Quality Factors: In Figure
4 (left) we show that for lower quality factors e.g., 5 or 10, the restoration step improves the accuracy for strong perturbations, although we still cannot achieve a perfect classification on clean images, making this strategy not useful. The curves instead drop faster for larger quality factors, despite an almost perfect accuracy on non-attacked samples. This evidences that BPDA can effectively approximate the input transformation and break the defense.
Iterating the process: Figure
4 (middle) shows the effect of iterating the compression-restoration steps with a fixed quality factor (QF = 40). Differently from the case of sole JPEG, repeating the compression-restoration process instead provides a slight increase of robustness. However, the defense accuracy is still low i.e., around 25% classification accuracy at
\(\ell _\infty = 0.031\).
Full Defense: Finally, Figure
4 (right) shows the effect of randomizing the quality factor for the JPEG compression and changing the AR-GAN parameters accordingly. We randomly sample the quality factors in a range
\(\delta _q = [20, 60]\) at each iteration of the defense. To restore the image, we use three different AR-GANs, trained with quality factors of 20, 40, and 60. At each iteration, we pick the one that is closest to the current compression factor. The complete defense with ResNet50 obtains an accuracy of
\(80.3\%\) at a perturbation of
\(\ell _\infty = 0.031\), and of
\(59.4\%\) at a perturbation of
\(\ell _\infty = 0.062\).
To aid the reader with understanding the entity of the distortions, in Figure
5 we show some examples of images corrupted by the attack when different defense methods are applied. Our solution forces the attacker to inject a significantly stronger noise to make the network predict a wrong class (Figure
5(e)). In addition, we might observe an interesting side-effect; differently from the uniform noise induced by JPEG, the adversarial noise resulting from our defense forms small cross shaped patterns, that are easier to spot. Finally, in Figure
5(b) we show the effect of our defense applied to the clean image. The defense does not degrade the image quality, preserving its details and allowing us to perform a correct classification.
4.3 White-box Attacks
In this section, we present results on three white-box attacks, namely BPDA and EOT-BPDA [
4], C&W [
8], and DeepFool [
38] using our full defense. All the following experiments are conducted on the full set of 1,000 images.
EOT-BPDA. Our defense includes the following components: non-differentiable operators, i.e., JPEG, a deterministic module with its own parameters, i.e., the AR-GAN, and a randomization criterion acting on both the quality factors and the parameters of the GAN. When using a defense mechanism that applies a random input transformation \(h()\), drawn from a distribution \(\mathcal {T}\), before the classifier \(f()\), EOT optimizes with respect to the expectation over the transformations \(\mathbb {E}_{h\sim \mathcal {T}}f(h(\mathbf {x}))\). A PGD-like attack, such as BPDA, can then be applied observing that \(\nabla \mathbb {E}_{h\sim \mathcal {T}}f(h(\mathbf {x})) = \mathbb {E}_{h\sim \mathcal {T}} \nabla f(h(\mathbf {x}))\). The expectation is approximated by sampling from the distribution of \(h()\).
Randomizing the quality factor can easily result ineffective against this strategy; the result of applying JPEG to an input image is deterministic once the QF is fixed. So, an attacker who applies EOT sampling from a set of quality factors can get meaningful gradients. The same applies when restoring the compressed image with the AR-GAN, as its output is conditioned on the image, and so in turn on the quality factor. However, we argue that changing the parameters \(\theta\) of \(g(\mathbf {x}, \theta)\) prevents an effective estimation of the EOT. A GAN \(g(\mathbf {x}, \theta)\) can be seen as a process that generates samples from an underlying data distribution, that is parameterized by \(\theta\). So, changing the parameters \(\theta\) is equivalent to sampling from a different data distribution. This implies that EOT should approximate the expectation from multiple distributions \(g(\mathbf {x}, \theta {q1}), \dots , g(\mathbf {x}, \theta {qn})\).
We perform EOT in our experiments by applying the defense for 10 repetitions at each iteration of the attack, so to average gradient information over different parameterizations
\(\theta\). Results are reported in Table
1. It turns out evidently that using EOT, BPDA can break our defense when using a fixed quality factor (QF = 40). Instead, when randomizing the quality factor, and so using multiple AR-GANs, the robustness increases substantially, proving our proposal is fundamental to achieve robustness to this attack. Note that we could not randomize the QF without changing the AR-GAN as each one is trained for a specific factor.
We report in Table
1 also the accuracy obtained against generalized BPDA, which proved more effective than EOT against our defense. This reinforces our claim that changing the AR-GAN iteratively leads to a difficulty in estimating the expectation from multiple distributions. Our defense cannot completely protect from BPDA though; as discussed in [
4], similarly to natural images, adversarial examples can still be found in a GAN manifold; indeed, iterating the compression-restoration steps with fixed QF provided only a marginal improvement. However, when using multiple AR-GANs, finding an effective adversarial sample i.e., lying in the intersection of the different manifolds, is much more complex, and leads to significant image distorion as shown in Figure
5. Other than baseline results reported in Figures
3 and
4, we also report results of some other previous methods, namely Bit Depth reduction [
51], Quilting and
Total Variation Minimization (
TVM) [
23], and the DNN-oriented JPEG compression method of [
35]. The latter is particularly related to our method as it tries to specialize the JPEG operator by increasing the quantization of malicious features. While methods based on simple input transformations are completely circumvented by BPDA, the DNN-oriented JPEG method shows a certain degree of robustness to the attack. Still, our method outperforms it, obtaining a higher classification accuracy.
Our defense mechanism tries to obfuscate the gradient by changing the underlying AR-GAN parameters at each iteration. An attacker that is aware of this strategy could try to circumvent it by ignoring the gradient and applying the same solution used for JPEG i.e., approximating the backward pass with the identity function. We report results for this attempt in Table
1 (BPDA-ID). As expected, in this scenario the defense is more robust, and the classification accuracy increases.
Carlini & Wagner (C&W) [
8] is a strong iterative attack that aims at finding adversarial samples by minimizing the
\(\ell _2\) perturbation with respect to an auxiliary variable instead of the original image directly. A constant
\(c\) controls the tradeoff between perturbation and effectiveness of the attack, which is usually found by grid-search. We found the
\(c\) value resulting in perturbations within the range 0.005 – 0.01 being
\(c=0.1\), and used it in our experiments. Results are reported in Table
2 (left). The proposed defense demonstrates robust, comparing to the baseline. We remark that we experimented with the defense in full white-box, differently from other approaches that cannot be directly compared as they were conceived to be used in a gray-box setting [
23,
39]. We instead report results obtained with the recent ComDefend [
28] method. It uses a trained network instead of simple JPEG to compress the image, and then restores it with an additional trained module. In [
28], the networks are trained either including adversarial samples in the training, or not. For a fair comparison, since we do not use adversarial images to train the AR-GAN, we report the results obtained in the latter case. ComDefend demonstrates fairly robust to the attack, reporting a loss of classification accuracy of 16 points. In comparison, our method only loses 1.4% accuracy with respect to non-attacked images, which demonstrates a substantial improvement.
DeepFool [
38] aims at minimizing the
\(\ell _2\) between the image and the adversarial counterpart. It is specifically designed to apply the minimum possible perturbation to make the classifier commit a mistake. Because of its particular design, we cannot force it to reach a given perturbation budget. Results are reported in Table
2 (right). Our full defense attains good robustness compared to the baseline.
4.4 Black-Box Attacks
Our method was specifically designed to deal with the white-box attacks that makes use of gradient information to craft adversarial examples. However, following the standard practice, we also evaluate it against black-box attacks. We choose two recent state-of-the-art attacks, namely Nattack [
32] and SquareAttack [
3]. For both, even though they do not exploit gradient information, we provide the attacker with the model including the defense to perform the queries.
Nattack [
32] is a state-of-the-art attack that aims at finding a probability density distribution over a small region centered around the input, such that a sample drawn from this distribution is likely an adversarial example. We set the hyperparameters suggested in [
32], that is the variance of the Gaussian
\(\sigma ^2 = 0.01\) and the learning rate
\(\eta = 0.008\). The maximum iterations are set to 300, and the sample size to 100. We use the
\(\ell _\infty\) version, with a perturbation budget of
\(\ell _\infty = 0.062\). Results in Table
3 show the effectiveness of our defense, which protects even for larger distortions with respect to other defenses.
2 Square Attack [
3] is a state-of-the-art, score-based attack based on a randomized search scheme, which selects localized square-shaped updates at random positions so that at each iteration the perturbation is situated approximately at the boundary of the feasible set. We run the attack implementing the
\(\ell _2\) versions from the original code.
3 We use the hyperparameters suggested in [
3], that is
\(p=0.1\), and perturbation budget of
\(\epsilon _2 = 5\) (note that we use the
\(\epsilon _2\) notation here as the attack considers the non-normalized version of
\(\ell _2\)). For consistency with the previous settings, we also use a larger perturbation budget, that is
\(\epsilon _2 = 10\). We let the attack run for 5,000 iterations, i.e., queries, and collect the generated adversarial images. We then apply the defense and report the classification accuracy. From the results in Table
4, it turns out clearly that our defense is extremely robust against this attack, even for a larger distortion.