Academia.eduAcademia.edu

Understanding Impacts of Task Similarity on Backdoor Attack and Detection

2022, arXiv (Cornell University)

arXiv:2210.06509v1 [cs.CR] 12 Oct 2022 Understanding Impacts of Task Similarity on Backdoor Attack and Detection Di Tang Indiana University Bloomington Rui Zhu Indiana University Bloomington XiaoFeng wang Indiana University Bloomington Haixu Tang Indiana University Bloomington Yi Chen Indiana University Bloomington has never been fully understood. Even though new attacks and detections continue to show up, they are mostly responding to some specific techniques, and therefore offer little insights into the best the adversary could do and the most effective strategies the detector could possibly deploy. Such understanding is related to the similarity between the primary task that a benign model is supposed to accomplish and the backdoor task that a backdoored model actually performs, which is fundamental to distinguishing between a backdoored model and its benign counterpart. Therefore, a Task Similarity Analysis (TSA) between these two tasks can help us calibrate the extent to which a backdoor is detectable (differentiable from a benign model) by not only known but also new detection techniques, inform us which characters of a backdoor trigger contribute to the improvement of the similarity, thereby making the attack stealthy, and further guides us to develop even stealthier backdoors so as to better understand what the adversary could possibly do and what the limitation of detection could actually be. Abstract With extensive studies on backdoor attack and detection, still fundamental questions are left unanswered regarding the limits in the adversary’s capability to attack and the defender’s capability to detect. We believe that answers to these questions can be found through an in-depth understanding of the relations between the primary task that a benign model is supposed to accomplish and the backdoor task that a backdoored model actually performs. For this purpose, we leverage similarity metrics in multi-task learning to formally define the backdoor distance (similarity) between the primary task and the backdoor task, and analyze existing stealthy backdoor attacks, revealing that most of them fail to effectively reduce the backdoor distance and even for those that do, still much room is left to further improve their stealthiness. So we further design a new method, called TSA attack, to automatically generate a backdoor model under a given distance constraint, and demonstrate that our new attack indeed outperforms existing attacks, making a step closer to understanding the attacker’s limits. Most importantly, we provide both theoretic results and experimental evidence on various datasets for the positive correlation between the backdoor distance and backdoor detectability, demonstrating that indeed our task similarity analysis help us better understand backdoor risks and has the potential to identify more effective mitigations. 1 Methodology and discoveries. This paper reports the first TSA on backdoor attacks and detections. We formally model the backdoor attack and define backdoor similarity based upon the task similarity metrics utilized in multi-task learning to measure the similarity between the backdoor task and its related primary task. On top of the metric, we further define the concept of α-backdoor to compare the backdoor similarity across different backdoors, and present a technique to estimate the α for an attack in practice. With the concept of α-backdoor, we analyze representative attacks proposed so far to understand the stealthiness they intend to achieve, based upon their effectiveness in increasing the backdoor similarity. We find that current attacks only marginally increased the overall similarity between the backdoor task and the primary tasks, due to that they failed to simultaneously increase the similarity of inputs and that of outputs between these two tasks. Based on this finding, we develop a new attack/analysis technique, called TSA attack, to automatically generate a backdoored model under a given similarity constraint. The new technique is found to be much stealthier than existing attacks, Introduction A backdoor is a function hidden inside a machine learning (ML) model, through which a special pattern on the model’s input, called a trigger, can induce misclassification of the input. The backdoor attack is considered to be a serious threat to trustworthy AI, allowing the adversary to control the operations of an ML model, a deep neural network (DNN) in particular, for the purposes such as evading malware detection [67], gaming a facial-recognition system to gain unauthorized access [50], etc. Task similarity analysis on backdoor. With continued effort on backdoor attack and detection, this emerging threat 1 not only in terms of backdoor similarity, but also in terms of its effectiveness in evading existing detections, as observed in our experiments. Further, we demonstrate that the backdoor with high backdoor similarity is indeed hard to detect through theoretic analysis as well as extensive experimental studies on four datasets under six representative detections using our TSA attack together with five representative attacks proposed in prior researches. and model inputs. This categorization has been used in our research to analyze different detection approaches (Section 4). More specifically, detection on model outputs captures backdoored models through detecting the difference between the outputs of backdoored models and benign models on some inputs. Such detection methods include NC [77], K-ARM [68], MNTD [83], Spectre [27], TABOR [26], MESA [58], STRIP [22], SentiNet [13], ABL [43], ULP [38], etc. Detection of model weights finds a backdoored model through distinguishing its model weights from those of benign models. Such detection approaches include ABS [48], ANP [80], NeuronInspect [31], etc. Detection of model inputs identifies a backdoored model through detecting difference between inputs that let a backdoored model and a benign model output similarly. Prominent detections in this category include SCAn [72], AC [11], SS [74], etc. Contributions. Our contributions are as follows: • New direction on backdoor analysis. Our research has brought a new aspect to the backdoor research, through the lens of backdoor similarity. Our study reveals the great impacts backdoor similarity has on both backdoor attack and detection, which can potentially help determine the limits of the adversary’s capability in a backdoor attack and therefore enables the development of the best possible response. • New stealthy backdoor attack. Based upon our understanding on backdoor similarity, we developed a novel technique, TSA attack, to generate a stealthy backdoor under a given backdoor similarity constraint, helping us better understand the adversary’s potential and more effectively calibrate the capability of backdoor detections, 2 2.1 2.3 We focus on backdoors for image classification tasks, while assuming a white-box attack scenario where the adversary can access the training process. The attacker inject the backdoor to accomplish the goal formally defined in Section 3.2 and evade from backdoor detections. The backdoor defender aim to distinguish backdoored models from benign models. She can white-box access those backdoored models and owns a small set of benign inputs. Besides, the defender may obtain a set of mix inputs containing a large number of benign inputs together with a few trigger-carrying inputs, however which inputs carried the trigger in this set is unknown to her. Background Neural Network We model a neural network model f as a mapping function from the input space X to the output space Y , i.e., f : X 7→ Y . Further, the model f can be decomposed into two sub-functions: f (x) = c(g(x)). Specifically, for a classification task with L classes where the output space Y = {0, 1, ..., L − 1}, we define g : X 7→ [0, 1]L , c : [0, 1]L 7→ Y and c(g(x)) = arg max j g(x) j where g(x) j is the j-th element of g(x). According to the common understanding, after well training, g(x) approximates the conditional probability of presenting y given x, i.e., g(x)y ≈ Pr(y|x), for y ∈ Y and x ∈ X . 2.2 Threat Model 3 TSA on Backdoor Attack Not only does a backdoor attack aim at inducing misclassification of trigger-carrying inputs to a victim model, but it is also meant to achieve high stealthiness against backdoor detections. For this purpose, some attacks [17, 49] reduce the L p -norm of the trigger, i.e., kA(x) − xk p , to make triggercarrying inputs be similar to benign inputs, while some others construct the trigger using benign features [46, 66]. All these tricks are designed to evade specific detection methods. Still less clear is the stealthiness guarantee that those tricks can provide against other detection methods. Understanding such stealthiness guarantee requires to model the detectability of backdoored models, which depends on measuring fundamental differences between backdoored and benign models that was not studied before. To fill in this gap, we analyze the difference between the task a backdoored model intends to accomplish (called backdoor task) and that of its benign counterpart (called primary task), which indicates the detectability of the backdoored model, as demonstrated by our experimental study (see Section 4). Between these two tasks, we define the concept of backdoor similarity – the similarity between the primary and the backdoor task, by leveraging the task similarity metrics Backdoor Attack & Detection Backdoor attack. In our research, we focus on targeted backdoors that cause the backdoor infected model fb to map trigger-carrying inputs A(x) to the target label t different from the ground truth label of x [5, 59, 77, 82]: fb (A(x)) = t 6= fP (x) (1) where fP is the benign model that outputs the ground truth label for x and A is the trigger function that transfers a benign input to its trigger-carrying counterpart. There are many attack methods have been proposed to inject backdoors, e.g., [12, 14, 20, 47, 49, 50, 61, 72]. Backdoor detection. The backdoor detection has been extensively studied recently [21, 25, 35, 44, 78]. These proposed approaches can be categorized based upon their focuses on different model information: model outputs, model weights 2 used in multi-task learning studies, and further demonstrate how to compute the similarity in practice. Applying the metric to existing backdoor attacks, we analyze their impacts on the backdoor similarity, which consequently affects their stealthiness against detection techniques (see Section 4). We further present a new algorithm that automatically generates a backdoored model under a desirable backdoor similarity, which leads to a stealthier backdoor attack. separate two distributions can be approximated with a neural network to distinguish them. Using the dH −W 1 distance, we can now quantify the similarity between tasks. In particular, dH −W 1 (DT 1 , DT 2 ) = 0 indicates that tasks T 1 and T 2 are identical, and dH −W 1 (DT 1 , DT 2 ) = 1 indicates that these two tasks are totally different. Without further notice, we consider the task similarity between T 1 and T 2 as 1 − dH −W 1 (DT 1 , DT 2 ). 3.1 3.2 Task Similarity Backdoor detection, essentially, is a problem about how to differentiate between a legitimate task (primary task) a model is supposed to perform and the compromised task (backdoor task), which involves backdoor activities, the backdoored model actually runs. To this end, a detection mechanism needs to figure out the difference between these two tasks. According to modern learning theory [54], a task can be fully characterized by the distribution on the graph of the function [8] – a joint distribution on the input space X and the output space Y . Formally, a task T is characterized by the joint distribution DT : T := DT (X , Y ) = {PrDT (x, y) : (x, y) ∈ X × Y }. Note that, for a well-trained model f = c ◦ g (defined in Section 2.2) for task T , we have g(x)y ≈ PrDT (y|x) for all (x, y) ∈ X × Y . With this task modeling, the mission of backdoor detection becomes how to distinguish the distribution of a backdoor task from that of its primary task. The Fisher’s discriminant theorem [52] tells us that two distributions become easier to distinguish when they are less similar in terms of some distance metrics, indicating that the distinguishability (or separability) of two tasks is positively correlated with their distance. This motivates us to measure the distance between the distributions of two tasks. For this purpose, we define the dH −W 1 distance, which covers both Wasserstein-1 distance and H-divergence, two most common distance metrics for distributions. Following we first define primary task and backdoor task and then utilize dH −W 1 to specify backdoor similarity, that is, the similarity between the primary task and the backdoor task. Backdoor attack. As mentioned earlier (Section 2.2), the well-accepted definition of the backdoor attack is specified by Eq. 1 [5, 13, 57, 59, 72, 77, 82]. According to the definition, the attack aims to find a trigger function A(·) that maps benign inputs to their trigger-carrying counterparts and also ensures that these trigger-carrying inputs are misclassfied to the target class t by the backdoor infected model fb . In particular, Eq. 1 requires the target class t to be different from the source class of the benign inputs, i.e., t 6= fP (x). This definition, however, is problematic, since there exists a trivial trigger function satisfying Eq. 1, i.e., A(·) simply replaces a benign input x with another benign input xt in the target class t. Under this trigger function, even a completely clean model fP (·) becomes “backdoored”, as it outputs the target label on any “trigger-carrying” inputs xt = A(x). Clearly, this trivial trigger function does not introduce any meaningful backdoor to the victim model, even though it satisfies Eq. 1. To address this issue, we adjust the objective of the backdoor attack (Eq. 1) as follows: fb (A(x)) = t, where fP (x) 6= t 6= fP (A(x)). where H = {h : X × Y 7→ [0, 1]}. (2) Proposition 1. (D , D ′ ) 0 ≤ dH −W 1 ≤ 1, dW 1 (D , D ′ ) ≤ dH −W 1 (D , D ′ ) = 21 dH (D , D ′ ), (4) Here, the constraint fP (x) 6= t 6= fP (A(x)) requires that under the benign model fP , not only the input x but also its triggercarrying version A(x) will not be mapped to the target class t, thereby excluding the trivial attack mentioned above. Generally speaking, the trigger function A(·) may not work on a model’s whole input space. So we introduce the concept of backdoor region: Definition 1 (dH −W 1 distance). For two distributions D and D ′ defined on X × Y , dH −W 1 (D , D ′ ) measures the distance between them two as: dH −W 1 (D , D ′ ) = sup [EPrD (x,y) h(x, y) − EPrD ′ (x,y) h(x, y)], h∈H Backdoor Similarity Definition 2 (Backdoor region). The backdoor region B ⊂ X of a backdoor with the trigger function A(·) is the set of inputs on which the backdoored model fb satisfy Eq. 4, i.e., ( t, 6= fP (A(x)), 6= fP (x), ∀x ∈ B fb (A(x)) = fP (A(x)), ∀x ∈ X \ B . (5) Accordingly, we denote A(B) = {A(x) : x ∈ B} as the set of trigger-carrying inputs. (3) where dW 1 (D , D ′ ) is the Wasserstein-1 distance [4] between D and D ′ , and dH (D , D ′ ) is their H -divergence [7]. Proof. See Appendix 10.1 Proposition 1 shows that dH −W 1 is representative: it is the upper-bound of the Wasserstein-1 distance and the half of the H -divergence. More importantly, dH −W 1 can be easily computed: the optimal function h in Eq. 2 that maximally For example, the backdoor region of a source-agnostic backdoor, which maps the trigger-carrying input A(x) whose label under the benign model is not t into t, is B = X \ (X fP (x)=t ∪ 3 dH −W 1 (DP , DA,B ,t ) = X fP (A(x))=t ), while the backdoor region for a source-specific backdoor, which maps the trigger-carrying input A(x) with the true label of the source class s (6= t) into t, is B = X fP (x)=s \ X fP (A(x))=t . Here, we use XC to denote the subset of all elements in X that satisfy the condition C: XC = {x|x ∈ X , C is True}, e.g., X fP (x)=t = {x|x ∈ X , fP (x) = t}. Definition of the primary and backdoor tasks. Now we can formally define the primary task and the backdoor task for a backdoored model. Here we denote the prior probability of input x (also the probability of presenting x on the primary task) by Pr(x). Definition 3 (Primary task & distribution). The primary task of a backdoored model is TP , the task that its benign counterpart learns to accomplish. TP is characterized by the primary distribution DP , a joint distribution over the input space X and the output space Y . Specifically, PrDP (x, y) is the probability of presenting (x, y) in benign scenarios, and thus PrDP (y|x) = PrDP (x, y)/Pr(x) is the conditional probability that a benign model strives to approximate. R max(Prgain (x, y), 0), (x,y)∈A(B )×Y where Prgain (x, y) = PrDA,B ,t (x, y) − PrDP (x, y). Proof. See Appendix 10.2. Theorem 2 shows that the calculation of backdoor distance dH −W 1 (DP , DA,B ,t ) can be reduced to the calculation of the probability gain of PrDA,B ,t (x, y) over PrDP (x, y) on those trigger-carrying inputs A(B ), when ZA,B ,t ≥ 1. Notably, because ZA,B ,t = 1 − Pr(A(B )) + β Pr(B ), ZA,B ,t ≥ 1 is satisfied if Pr(A(B )) ≤ β Pr(B ). This implies that if those triggercarrying inputs show up more often on the backdoor distribution than on the primary distribution, we can use the aforementioned method to compute the backdoor distance. Parametrization of backdoor distance. The following Lemma further reveals the impacts of two parameters β and κ on the backdoor distance: Lemma 3. When, ZA,B ,t ≥ 1 and Pr(B ) = κ Pr(A(B )), R ] dH −W 1 (DP , DA,B ,t ) = Pr(B ) max(Pr gain (x, y), 0), Definition 4 (Backdoor task & distribution). The backdoor task of a backdoored model is denoted by TA,B ,t , the task that the adversary intends to accomplish by training a backdoored model. TA,B ,t is characterized by the backdoor distribution DA,B ,t , a joint distribution over X × Y . Specifically, the probability of presenting (x, y) in DA,B ,t is PrDA,B ,t (x, y) = P(x, y)/ZA,B ,t , where ZA,B ,t = R (x,y)∈X ×Y P(x, y) = 1 − Pr(A(B )) + β Pr(B ) and ( PrDA,B ,t (y|x) Pr(A−1 (x))β, x ∈ A(B ) P(x, y) = PrDP (x, y), x ∈ X \ A(B ). (6) Here, A−1 (x) = {z|A(z) = x} represents the inverse of the trigger function, PrDA,B ,t (y|x) is the conditional probability that the adversary desires to train a backdoored model to approximate, β is a parameter selected by the adversary to amplify the probability that the trigger-carrying inputs A(x) are presented β to the backdoor task. Actually, we consider 1+β as the poisoning rate with the assumption that poisoned training data is randomly drawn from the backdoor distribution. Finally, it is worth noting that PrDA,B ,t (x, y) is proportional to PrDP (x, y) except on those trigger-carrying inputs A(B ). (x,y)∈A(B )×Y ] where Pr gain (x, y) equals to PrD (x) β A,B ,t ZA,B ,t PrD (A(B )) A,B ,t Pr(x) PrDA,B ,t (y|x) − κ1 Pr(A( B )) PrDP (y|x). (7) Proof. The derivation is straightforward, thus we omit it. As demonstrated by Lemma 3, the two parameters β and κ are important to the backdoor distance, where β is related to the poisoning rate (Definition 4) and κ describes how close is the probability of presenting trigger-carrying inputs to the probability of showing their benign counterparts on the primary distribution (the bigger κ the farther away are these two probabilities). Let us first consider the range of β. Intuitive, a large β causes the trigger-carrying inputs more likely to show up on the backdoor distribution, and therefore could be easier detected. A reasonable backdoor attack should keep β smaller than 1, which is equivalent to constraining the poisoning rate β ( 1+β ) below 50%. On the other hand, a very small β will make the backdoor task more difficult to learn by a model, which eventually reduces the attack success rate (ASR). A reasonable backdoor attack should use a β greater than κ1 : that is, the chance of seeing trigger-carrying inputs on the backdoor distribution no lower than that on the primary distribution. Therefore, we assume κ1 ≤ β ≤ 1. Next, we consider the range of κ. A reasonable lower-bound of κ is 1; if κ < 1, triggercarrying inputs show up even more often than their benign counterparts on the primary distribution, which eventually lets the backdoored model outputs differently from benign models on such large portion of inputs and make the backdoor be easy detected. So, we assume κ ≥ 1. With above assumptions on the range of β and κ, we get the following theorem to describe the range of backdoor distance. Formalization of backdoor similarity. Putting together the definitions of the primary task, the backdoor task, and the dH −W 1 distance between the two tasks (Eq. 2), we are ready to define backdoor similarity as follows: Definition 5 (Backdoor distance & similarity). We define dH −W 1 (DP , DA,B ,t ) as the backdoor distance between the primary task TP and the backdoor task TA,B ,t and 1 − dH −W 1 (DP , DA,B ,t ) as the backdoor similarity Theorem 2 (Computing backdoor distance). When ZA,B ,t ≥ 1, where ZA,B ,t is defined in Eq. 6, the backdoor distance between DP and DA,B ,t is 4 Theorem 4 (Backdoor distance range). Supposing Pr(B ) = κ Pr(A(B )), when κ ≥ 1 and κ1 ≤ β ≤ 1, we have ZA,B ,t ≥ 1, ( Z β − κ1 (1 − S)) Pr(B ) ≤ dH −W 1 (DP , DA,B ,t ) ≤ A,B ,t R where S = β ZA,B ,t max{∆ prob , 0} and (x,y)∈A(B )×Y ∆ prob = PrD PrD A,B ,t A,B ,t (x) (A(B )) Pr(x) PrDA,B ,t (y|x) − Pr(A( B )) PrDP (y|x). Also, PrDA,B ,t (y|x) and PrDP (y|x) can be approximated by a well-trained backdoored model fb = cb ◦ gb and a welltrained benign model fP = cP ◦ gP , respectively, i.e., gb (x)y ≈ Pr(B ), PrDA,B ,t (y|x) and gP (x)y ≈ PrDP (y|x). Supposing that we have sampled m trigger-carrying inputs {A(x1 ), A(x2 ), ..., A(xm )}, α can be approximated by: β L−1 gb (A(xi ))y − κ1 gP (A(xi ))y , 0}. α ≈ ∑m i=1 ∑y=0 max{ Z A,B ,t (8) (9) In Eq 9, β is chosen by the adversary. Thus, we assume that β is known, when using α to analyze different backdoor attacks. Different from β, κ is determined by the trigger function A that distinguishes different backdoor attacks from each other. Next, we demonstrate how to estimate κ. Proof. See Appendix 10.3. Corollary 5 (Effects of β). Supposing Pr(B ) = κ Pr(A(B )), κ ≥ 1 and κ is fixed, when β varies in range [ κ1 , 1], we have S Pr(B ) κ ≤ dH −W 1 (DP , DA,B ,t ) ≤ κ Pr(B ) κ+κ Pr(B )−Pr(B ) , where S is defined in Theorem 4. Specially, the lowerbound S Pr(κ B ) is achieved when β = κ1 , and the upper-bound κ Pr(B ) κ+κ Pr(B )−Pr(B ) is achieved when β = 1. transformations, we get that κ = V (B ) and V (A(B )) are the volumes of set B and A(B ) respectively. Below, we demonstrate how to estimate Pr(x) and the (B ) volume ratio κV = V V(A( B )) separately. To estimate the prior probability of an input x for the primary task, Pr(x), we employed a Generative Adversarial Network (GAN) [34] and the GAN inversion [81] algorithms. Specifically, we aim to build a generative network G and a discriminator network D using adversarial learning: the discriminator D attempts to distinguish the outputs of G and the inputs (e.g., the training samples) x of the primary task, while G takes as the input z randomly drawn from a Gaussian distribution with the variance matrix I, i.e., z ∼ N(0, I) and attempts to generate the outputs that cannot be distinguished by D. When the adversarial learning converges, the output of G approximately follows the prior probability distribution of x, i.e., Pr(x) ≈ Pr(G(z) = x)). In addition, we incorporated with a GAN inversion algorithm capable of recovering the input z of G from a given x, s.t., G(z) = x. Combining the GAN and the inversion algorithm, we can estimate Pr(x) for a given x: we first compute z from x using the GAN inversion algorithm, and then estimate Pr(x) using PrN(0,I) (z). To estimate the volume ratio κV , we use a Monte Carlo algorithm similar to that proposed by the prior work [32]. Briefly speaking, for estimating V (B ), we first randomly select an x in the backdoor region B as the origin, and then uniformly sample many directions from the origin and approximate the extent (how long from the origin to the boundary of B ) along these directions, and finally, calculate the expectation of the extents of these directions as Ext(B ). According to the prior work [32], V (B ) is approximately equal to the product of Ext(B ) and the volume of the n dimensional unit sphere, Ext(B ) assuming B ⊂ Rn . Therefore, we estimate κV by Ext(A( B )) . Proof. See Appendix 10.4. Corollary 6 (Effects of κ). Supposing Pr(B ) = κ Pr(A(B )), β ≤ 1 and β is fixed, when κ varies in range [ β1 , ∞), we have Sβ Pr(B ) ≤ dH −W 1 (DP , DA,B ,t ) ≤ β Pr(B ), where S is defined in Theorem 4. Specially, the lower-bound Sβ Pr(B ) and the upper-bound β Pr(B ) are achieved, respectively, when κ = β1 . Proof. See Appendix 10.5. 3.3 α-Backdoor Definition of α-backdoor. Through Lemma 3 and Theorem 4, we show that the backdoor distance and its boundaries are proportional to Pr(B ), the probability of showing benign inputs in the backdoor region B on the prior distribution of inputs. However, different backdoor attacks may have different backdoor regions, which is a factor we intend to remove so as to compare the backdoor similarities across different attacks. For this purpose, here we define α-backdoor, based upon the same backdoor region B for different attacks, as follows: Definition 6 (α-backdoor). We define an α-backdoor as a backdoor whose backdoor distribution is DA,B ,t , primary distribution is DP and the associated backdoor distance equals to the product of α and Pr(B ), i.e., α · Pr(B ) = dH −W 1 (DP , DA,B ,t ). Approximation of α. Lemma 3 actually provides an approach to approximate α in practice. Specifically, using the ] symbol Pr gain that has been defined in Eq. 7, we get a simR ] max(Pr ple formulation of α: α = gain (x, y), 0). (x,y)∈A(B )×Y Note that Pr(x) Pr(A(B )) = Pr(x|x ∈ A(B )) and PrD PrD A,B ,t A,B ,t (x) (A(B )) Pr(B ) Pr(A(B )) . Through trivial EPr(x|x∈B ) Pr(x) V (B ) V (A(B )) EPr(x|x∈A(B )) Pr(x) , where Estimation of κ. Recall that κ = In general, we estimate κ as = PrDA,B ,t (x|x ∈ A(B )). This enables us to approximate α through sampling only trigger-carrying inputs x ∈ A(B ). EPr(x|x∈B ) Pr(G−1 (x)) Ext(B ) Ext(A(B )) EPr(x|x∈A(B )) Pr(G−1 (x)) , where G−1 (x) represents the output of a GAN inversion algorithm for a given x. We defer the details to Appendix 9. 5 BadNet SIG WB CB IAB TSA 2.53 2.99 2.93 17.05 5.97 3.27 the difference in inputs and S characterizes the difference in outputs between the primary and backdoor distributions. Specifically, for each attack method, we generated sourcespecific backdoors (source class is 1 and target class is 0) following the settings described in its original paper but changing β to adjust the poisoning rate. In particular, for β = 0.1, we injected 500 poisoning samples into the source class (i.e., class 1) with a total of 5,000 samples in the training set. For each backdoored model, we calculated its ASR at different β values as illustrated in Table 1 to demonstrate the side effect of reducing β on ASR. As we see from the table, for BadNet, the ASR is 99.97% when β is 0.1, which goes down with the decrease of the β, until 38.85% when the β drops to 0.005, rendering the backdoor attack less meaningful. This also shows the rationale of keeping β ≥ κ1 , as required in Theorem 4 (here β = 0.005 ≈ κ2 ). As illustrated in Eq. 9, α is proportional to β, but has a more complicated relation with κ and S as further demonstrated in Theorem 4. To study this complicated relation between α and the parameters other then β, we normalize α by dividing it with β and present the results in Table 1. Next, we elaborate our analysis about how existing backdoor attacks reduce α through controlling these parameters. Figure 1: Demonstration of trigger-carrying inputs generated by different attacks. The first row shows attacks’ name, the second row presents trigger-carrying inputs, the third row shows triggers, the fourth row shows amplified triggers and the fifth row illustrates the L2 -norm of triggers. Table 1: Backdoor similarities of backdoor attacks. ASR stands for attack success rate, L2 -norm stands for the average of {kx − A(x)k2 : x ∈ B } after regularizing all x and A(x) into [0, 1]n . Note that, when β < κ1 , the “α/β” columns represent the S value (Eq. 8). β BadNet [24] SIG [6] WB [16] CB [46] IAB [55] TSA (ours) 3.4 0.1 99.97 98.10 83.38 86.69 98.09 99.85 ASR (%) 0.05 0.01 99.18 69.27 81.62 34.57 72.29 32.97 78.18 55.09 92.58 54.37 99.21 92.83 0.005 38.85 9.88 7.69 45.92 20.13 79.07 0.1 0.98 0.99 0.67 1.00 1.00 0.37 0.05 0.98 0.98 0.49 1.00 1.00 0.34 α/β 0.01 0.97 0.96 0.23 1.00 1.00 0.32 0.005 0.95 0.94 0.18 1.00 1.00 0.25 ln(κ) All 5.98 6.05 4.01 17.93 10.72 3.07 L2 -norm All 2.37 2.72 2.86 17.37 5.96 3.13 Visually-unrecognizable backdoors (BadNet). This kind of backdoor attacks generate trigger-carrying inputs visually similar to their benign counterparts, in an attempt to evade the human inspection for anomalous input patterns. Generally, visually-unrecognizable backdoors constrain the L p -norm of the trigger, i.e., kA(x) − xk p , to be smaller than a threshold. Essentially, reducing kA(x) − xk p is to reduce | Pr(x) − Pr(A(x))|, the difference between the probability of presenting a trigger-carrying input and the probability of presenting its benign counterpart. This is because | Pr(x) − Pr(x + δ)| ∝ kδk p , when the perturbation δ is small and the prior distribution of inputs is some kind of smooth. Recall that κ = Pr(B )/ Pr(A(B )), thus reducing kA(x) − xk p can reduce κ, as demonstrated in the last two columns of Table 1. However, making κ small alone cannot effectively reduce the α as demonstrated by Corollary 6. Thus, visually-unrecognizable backdoors only marginally reduce α and moderately increase the backdoor similarity, as observed by our analysis on BadNet (Table 1), which only lowers down α/β (the normalized α) by 0.05 to 0.95 when β = 0.005. Analysis on Existing Backdoor Attacks Existing stealthy backdoor attack methods can be summarized into five categories: visually-unrecognizable backdoors, labelconsistent backdoors, latent-space backdoors, benign-feature backdoors and sample-specific backdoors. In this section, we report a backdoor similarity analysis on these backdoor attacks, which is important to understanding their stealthiness, given the positive correlation between backdoor distance and detectability we discovered (Section 4). We compare the backdoor distance of backdoored models generated by 5 different attacks, each representing a different category, on CIFAR10 [39]. As mentioned earlier (Theorem 4), the backdoor distance described by α is related to β, κ and S, where β is proportional to the poisoning rate (see Definition 4), that describes the adversary’s aspiration about how likely those trigger-carrying inputs present in the backdoor distribution in comparison with the probability of showing their benign counterparts in the primary distribution, κ also measures the difference between the probability of showing those trigger-carrying inputs and their benign counterparts however within the primary distribution, and S summarizes the conditional probability gain of the outputs given those trigger-carrying inputs obtained on the backdoor distribution compared with such conditional probability on the primary distribution. In simple words, β and κ together characterizes Label-consistent backdoor (SIG). The label-consistent backdoor attacks inject a backdoor into the victim model with only label-consistent inputs generated by pasting the trigger onto the vague (i.e., hard to be classified) inputs, in an attempt to increase the stealthiness against human inspection. Specifically, prior research [75] proposes to use GAN or adversarial examples to get hard-to-classify inputs, while SIG [6] utilizes a more inconspicuous trigger (small waves). However, we found that label-consistent backdoors do not reduce α more effectively, than the naive label-flipped back6 doors (e.g., BadNet), because injecting a backdoor through label-consistent way has changed neither κ nor S of this backdoor task away from that of injecting this backdoor through label-flipped way, as observed in our experiments where similar α/β (the normalized α) exhibited by these two types of backdoors (see the “SIG” and the “BadNet” rows in Table 1). Specifically, the BadNet and SIG attacks accomplished their backdoor tasks using similar triggers in terms of L2 -norm: BadNet uses a trigger with the L2 -norm of 2.37 and SIG utilizes a trigger with L2 -norm of 2.72. Apparently, the α/β values for the SIG and those for the BadNet are similar at all β values we tested. directly reduce S (defined in Eq. 8), but in the meantime, increase κ (since the trigger-carrying inputs becomes less likely to see from the primary distribution), and thus may not reduce the backdoor distance eventually, which has been shown by the “CB” row of Table 1. In addition, the benignfeature backdoors also increase the difficulty in learning the backdoor task (only 83.38% ASR achieved when β = 0.1). Sample-specific backdoors (IAB) The sample-specific backdoor attacks design the trigger specific to each input. As a result, if an input is given an inappropriate trigger, it will not trigger the backdoor. This kind of backdoors are designed to evade trigger inversion by increasing the difficulty in reconstructing the true trigger. The Input Aware Backdoor (IAB) [55] is a representative work in this category, which uses a trigger generation network to produce a sample-specific trigger. The attack methods proposed in the prior work [45] and [65] also belong to this category. A sample-specific backdoor requires that the trigger carries more information than the trigger of sample-agnostic backdoors, so as to enable the backdoored model to learn the complicated relations between triggers and the inputs. Thus, the trigger of the sample-specific backdoors may come with a large L2 -norm. As presented in Table 1, the L2 -norm of the trigger used by the IAB backdoor is 5.96, more than twice of the trigger for BadNet (2.37) in terms of L2 -norm. Such a large trigger renders the trigger-carrying inputs less likely to observe from the primary distribution, thereby reducing the similarity between the probability of seeing benign inputs and the probability of seeing trigger-carrying inputs on the primary distribution, and leading to the increase in κ (ln(κ) = 10.72) and the α/β (the normalized α). Latent-space backdoors (WB). The latent-space backdoor attacks aim to make the backdoored model produce similar latent features for trigger-carrying inputs and benign inputs. Prior research [84] proposes to use this idea to generate a student model that learns the backdoor injected in the teacher model under the transfer learning scenario. Later, this idea has been employed by the Wasserstein Backdoor (WB) [16] to increase the backdoor stealthiness against the latent space defense (e.g., AC [11]). Specifically, WB makes the distribution of the penultimate layer’s outputs (latent features) of trigger-carrying inputs as close to those of benign inputs as possible in terms of the sliced-Wasserstein distance [37]. Making latent features of trigger-carrying inputs and benign inputs be close is essentially to reduce S (defined in Eq. 8), the expectation of the conditional probability gain obtained by the backdoored model on trigger-carrying inputs, which is actually the lower-bound of the α when β = κ1 (Corollary 6). In this way, WB effectively reduces α/β (the normalized α) compared with other four types of backdoors as demonstrated in the “WB” row of Table 1. However, this α/β reduction achieved by WB comes with the cost of low ASRs, especially when β is low (ASR is only 7.69% when β = 0.005), which indicates that reducing S may make the trigger harder to learn. 3.5 New Attack In Section 3.4 and Table 1, we illustrated that the existing backdoor attacks did not effectively reduce the backdoor distance while keeping high attack success rate (ASR). Our analysis revealed that it is mainly due to three points: 1) most of these attacks did not reduce κ to a small value (e.g., in BadNet, SIG, CB and IAB); 2) the complicated triggers used by many attacks make the backdoor task hard to be learned (e.g., WB, CB and IAB); and 3) some missed to reduce S (e.g., BadNet, SIG, IAB). To address these issues, we aim to devise a new attack method that can handle all these points at one time. To reduce κ, the adversary should use a trigger function that maps a benign input to its close neighbor in terms of not only their L p -norm and but also their probabilities to be presented by the primary task. Using the trigger-carrying inputs with small L p -norm from the benign inputs may not unnecessarily lead to small κ; in fact, as shown in Table 1, the trigger used by BadNet lead to the trigger-carrying inputs with smaller L2 norms but higher κ compared to those by WB. On the other hand, to reduce S, the adversary should enable the backdoored model to generate similar conditional probabilities as the Benign-feature backdoors (CB) Benign-feature backdoor attacks aim to produce backdoored models that leverage features similar to those used by a benign model by constructing a trigger with a composite of benign features, thereby increasing the stealthiness of the backdoor against the backdoor detection techniques that distinguish the weights of backdoored models from those of benign models (e.g., ABS [48]). A representative work in this category is Composite Backdoor (CB) [46], which mixes two benign inputs from specific classes into one, and then trains the backdoored model to predict the target labels on these mixed inputs. In another example [51], the adversary constructs the trigger using the reflection features hiding in the input images. The training inputs with benign features from those in different classes could render the marginal backdoor distribution on inputs significantly deviating from the distribution of benign inputs, making this backdoor even easier to detect. When it comes to backdoor similarity, benign-feature backdoors in7 LC (A, ζ, ω) = LA,B ,t ( fP , −α∗ , β) + ω max{LA (C) − ζ, 0}, benign models of the outputs given those trigger-carrying inputs. Finally, the adversary should use a trigger function that can be easily learned; using a complex trigger function as used by WB lead to the backdoored model with low ASR when the poison rate is low (i.e., β is small). At a first glance, it appears impossible to reduce κ and S simultaneously, as a perfect benign model produces similar outputs for similar inputs and, thus, it always produces different conditional probabilities from the backdoored model of the outputs given those trigger-carrying inputs. In practice, however, the benign models may not be perfect (highly robust), which may produce very different outputs even for similar inputs, e.g., the adversarial samples [71], making it possible to reduce κ and S simultaneously. Together with the trick to make trigger be easy to learn, we the TSA attack which details are illustrated in Algorithm 1. (12) which searches for the trigger function A that maps the benign inputs to the trigger-carrying inputs close to the classification boundary in fP while penalizing those functions A with LA (C) > ζ by incorporating the penalty term with the weight ω. Finally, in line-7, we use the refined trigger function A to poison the training data, which is then used to train a backdoored model fb by minimizing LA,B ,t ( fb , α∗ , β) and a regularization term: LA,B ,t ( f , α∗ , β) + kgb (xC ) − gP (xC )k2 (13) where xC = arg max kgb (x) − gP (x)k2 . x∈X /A(B ) Here, the regularization term is designed to seek fb that minimizes the maximum difference between the outputs of fb and fP for the inputs without the trigger. Empirically, we used a LeNet-5 [40] network as C, and an UNet [62] as the trigger function A. Besides, we set epochad j = 3, δ = 0.1, ζ = 0.1 and ω = 0.1. We used an Adam [36] optimizer with the learning rate of 1e−3 to train model weights. We implemented our method based on the PyTorch framework and integrated our code into TrojanZoo [57]. In our experiments, we used Algorithm 1 to generate backdoored models on the CIFAR10 dataset and demonstrate the results in the last row of Table 1. We observe that the TSA backdoor not only achieved much better ASR (79.07%) than previous attacks (≤ 50%) even when β is as small as 0.005, but also smaller backdoor distance then other attacks at the meanwhile. This could be ascribed to several advantages of our approach. First, the trigger function refinement (line-3 to -6) helps to derive a trigger function that is easy to learn. Second, the L p -norm constraint lets the TSA backdoor has small κ, which reduces the lower-bound of the backdoor distance (Corollary 6), and thus allows for the reduction of the backdoor distance by manipulating S (Eq. 8). Furthermore, the TSA backdoor attack manages to control the backdoor distance through manipulating S (line-7) for a given α∗ , which enables the TSA backdoor to achieve small backdoor distance on all β values. In simple words, TSA backdoor maps the benign inputs to the trigger-carrying inputs close to both the classification boundary (controlled by α∗ ) and the original benign inputs (controlled by δ) through a easy-to-learn trigger. Algorithm 1 TSA attack. Input: Dtr , B , t, α∗ , β, epochad j , δ, ζ, ω Output: A(·), fb 1: Train a benign model f P on Dtr 2: Train A with LA,B ,t ( f P , α∗ , β) (Eq. 10) and δ constraint 3: for _ in range(epochad j ) do 4: Train C with LA (C) (Eq. 11) 5: Update A with LC (A, ζ, ω) (Eq. 12) 6: end for 7: Train f b on Dtr to minimize Eq. 13 First, in line-1, we train a benign model fP on a given training set Dtr . In line-2, for the benign model fP , we optimize trigger function A to minimize LA,B ,t ( fP , −α∗ , β) such that kA(x) − xk2 ≤ δ, where LA,B ,t ( f , α∗ , β) = E(x,y)∈X ×Y Lce ( f (x), y)− ∗ ∗ 1−α βE(x,y)∈B ×Y ( 1+α 2 log(g(A(x))t ) + 2 log(g(A(x))y )). (10) Here, we assume f = c ◦ g as described in Section 2.1. The loss function LA,B ,t is the sum of the loss for the primary task of f on the clean inputs and the loss for the backdoor task of f on the trigger-carrying inputs weighted by β. The initial trigger function A is an optimized variable, which is trained to minimize LA,B ,t ( fP , −α∗ , β) while satisfying the δ constraint, such that this initial trigger function maps the benign inputs to the trigger-carrying inputs in the region of the same class label but close to the classification boundary in fP . Then in line-3 to -6, we iteratively refine the trigger function and to make the backdoor task more easily learned. Specifically, we train a small classification network C to distinguish the triggercarrying inputs from their benign counterparts by minimizing the loss function: LA (C) = −Ex∈B log(C(A(x))) + log(1 −C(x)). (11) 4 TSA on Backdoor Detection In the last section, we show that current backdoor attacks are designed for evading specific backdoor detection methods, and do not effectively reduce the backdoor distance that measures how close the backdoor and the primary tasks of a backdoored model are. We further proposed a new TSA attack to strategically reduce the backdoor distance and create more stealthy backdoors. In this section, we demonstrate that the backdoor distance is closely related to the backdoor detectability: the backdoors with small backdoor distance are hard to detect, through both theoretical and experimental analysis. The poor performance of C (i.e., LA (C) > ζ) indicates that the current trigger function A is hard to learn, and then A is refined to minimize the loss function (line-5): 8 First, through theoretical analysis, we demonstrate how the backdoor distance affects the evasiveness of a backdoor from detection methods in each of the three classes (Section 2.2): detection on model outputs, detection on model weights and detection on model inputs, respectively. Next, we show that in practice, by reducing the backdoor distance, the detectability of a backdoor indeed becomes lower. 4.1 small, the difference of the outputs between a backdoored model and a benign model becomes small as well. Considering the randomness involved in the training process, when this difference is small, it is hard to distinguish backdoored models from benign models. Therefore, these approaches of detection on model outputs often suffer from false positives, and thus achieve low detection accuracy on the backdoors with small backdoor distance. Detection on Model Outputs 4.2 The first class of backdoor detection methods, herein referred to as the detection on model outputs, attempt to capture backdoored models by detecting the difference between the outputs of the backdoored models and the benign models on some inputs. One kind of methods in this class are those methods based on trigger reversion algorithm, which first reconstruct triggers and then check whether a model exhibits backdoor behaviors in response to these triggers. In other words, the objective of these methods is to identify some inputs on which the outputs of the backdoored models and of the benign models are different. When the difference becomes small, however, these methods often become less effective. For example, K-ARM [77] failed to detect the TSA backdoor (Section 4.4). Notably, the backdoored model generated by the TSA attack produce the similar outputs as the benign models on the trigger-carrying inputs, and consequently, K-ARM cannot distinguish these two types of models based on the reconstructed trigger candidates, even though the L p -norm of those triggers injected by the TSA attack is as small as desired by K-ARM (K-ARM is designed for detecting triggers smaller than a given maximum size). MNTD [83] is another method in this class, which searches for the inputs on which the backdoored models and the benign models generate the most different outputs. Formally, we consider the goal of detections in this class is to check whether a trigger function A(·) can be found that maps the inputs to a region A(B ), where the outputs of the backdoored model fb is most different from the outputs of a benign model fP (e.g., gb (A(x))t ≫ gP (A(x))t for the target label t). This goal becomes hard to achieve when the backdoor distance is small, as demonstrated in the following lemma. Detection on Model Weights The second class of detection approaches, herein referred to as the detection on model weights, attempt to detect a backdoored model through distinguishing its model weights from those of benign models. Formally, we consider the goal of detection methods in this class as to verify whether the minimum distance between the weights of a candidate backdoored model ωb and the weights of a benign model in a set {ωP } exceeds a pre-determined threshold θω , i.e., whether minω∈{ωP } kω − ωb k2 > θω . To study the difference between the weights of two models, we formulate it as the weight evolution problem in continual learning [73]. Specifically, we consider two tasks, TP and TA,B ,t , for which the benign model fP = f (· : ωP ) with the weights ωP and the backdoored model fb = f (· : ωb ) with the weights ωb learn to accomplish, respectively. We then analyze the change of ωP → ωb through the continual learning process TP → TA,B ,t . Based on the Neural Tangent Kernel (NTK) [33] theory, existing work [41] has showed that, fb (x) = fP (x)+ < φ(x), ωb − ωP > where φ(x) is the kernel function and φ(x) = ▽ω0 f (x; ω0 ), which is dependent only on some weights ω0 . Furthermore, recent research [18] has shown that kδTP →TA,B ,t (X)k22 = kφ(X)(ωb −ωP )k22 , where δTP →TA,B ,t (X) is the so-called task drift from TP to TA,B ,t , kδTP →TA,B ,t (X)k22 := Σ k fb (x) − fP (x)k22 . Based on these rex∈X sults, we connect the distance between ωP and ωb to the backdoor distance through the following lemma. Lemma 8. When A is fixed and β = κ1 , for a well-trained backdoored model fb = f (· : ωb ), and a well-trained benign model fP = f (· : ωP ), we have √ κ mL kφ(X)k2 α Lemma 7. When A is fixed and β = κ1 , for a well-trained backdoored model fb = cb ◦ gb and a well-trained benign model fP = cP ◦ gP , s.t., gb (x)y ≈ PrA,B ,t (y|x) and gP (x)y ≈ Pr(y|x) for all (x, y) ∈ X × Y , we have EPrA,B ,t (A(x)|x∈B ) gb (A(x))t − EPr(A(x)|x∈B ) gP (A(x))t ≤ ακ ≤ kωb − ωP k2 . where X = {x1 , x2 , ..., xm } is a set of m inputs in L classes and φ(·) is the kernel function. Proof. See Appendix 10.6. Proof. Using Corollary 5, one can derive the desired. Specifically, Lemma 7 demonstrates that, when the adversary has chosen a trigger function A and set β = κ1 , which minimizes the backdoor distance in reasonable settings, α is proportional to the upper bound of the difference between the expected outputs of a backdoored model and that of a benign model about the probability of a trigger-carrying input in the target class. In other words, when the backdoor distance is Lemma 8 demonstrates that, when the adversary has chosen a trigger function A and set β = κ1 , which minimizes the backdoor distance in reasonable settings, α is proportional to the lower bound of the distance between the weights ωb and ωP in term of L2 -norm. In other words, to ensure the weights of the backdoor models ωb is close to the weights of the benign models ωP , which lead to the backdoors more difficult to be detected by the methods of detection on model 9 weights, the adversary should design a backdoor with small backdoor distance. 4.3 TND [79], SCAn [72] and AC [11], to defend the backdoors injected by 6 backdoor attack methods (Section 3.4) on 4 datasets: CIFAR10 [39], GTSRB [30], ImageNet [15] and VGGFace2 [10]. On each dataset, we generated 200 benign models as the control group. For each backdoor attack, we used it to generate 200 backdoored models on every dataset. Specifically, in each of the backdoored models, a backdoor was injected with a randomly chosen source class and a randomly chosen target class (different from the source class). We fixed the number of poisoning samples to be equal to 10% of the total number of training samples in the source class, i.e., β = 0.1. Under these settings, we trained the backdoored model that achieved > 80% ASR for all 6 attack methods on all 4 datasets. In total, we generated 800 benign models and 4800 backdoored models on all 4 datasets. To evaluate a backdoor detection method on each dataset, we ran it to distinguish 200 benign models (trained on this dataset) from 200 backdoored models generated by each attack method. Overall, we performed a total of 144 (= 4 × 6 × 6) evaluations on all 4 datasets for all 6 detections against all 6 attacks. To train a model (benign or backdoored), we used the model structure randomly selected from these four: ResNet [28], VGG16 [69], ShuffleNet [87] and googlenet [70]. We used the Adam [36] optimizer with the learning rate of 1e−2 until the model converges (e.g., ∼ 50 epochs on CIFAR10). Detection on Model Inputs The third class of the detection methods, the detection on model inputs, attempt to identify a backdoored model through detecting the difference between inputs on that the backdoored model and benign models generate similar outputs. An prominent example of this category is SCAn [72], which checks whether the inputs predicted by a backdoored model as belonging to the same class can be well separated into two groups (modeled as two distinct distributions), while the inputs predicted by a benign model as belonging to the same class come from a single group (modeled as a single distribution). Similar idea was also exploited by AC [11]. We formulate this class of methods as a hypothesis test that evaluates whether the two distributions, characterized by two sets XP and Xb , respectively, on which the benign model fP and the backdoored model fb share the same prediction, are significantly different, where XP = { fP′ (xi ) : i = 1, 2, ..., nP } and Xb = { fb′ (xi ) : i = 1, 2, ..., nb }, fP′ (x) and fb′ (x) are the intermediate results of fP (x) and fb (x), respectively, for an input x. For instance, fb′ (x) could be the j-th layer’s outputs of fb in a multi-layer neural network. Without loss of the generality, we adopt a two-sample Hotelling’s T-square test [29] for this hypothesis test, which tests whether the means of two distributions are significantly different. Here, we consider the test statistic T 2 , which is calculated from the samples drawn from the two distributions, and is then compared with a pre-selected threshold according to a desirable confidence. The smaller T 2 , the less probable these two distributions are different in terms of their means. The following lemma demonstrates how the test statistic T 2 has an upper-bound related to the backdoor distance. Detection on model outputs. We tested two representative detection methods in this category: K-ARM and MNTD. KARM is one of the winning solutions in TrojAI Competition [2]. It could be viewed as an enhanced version of Neural Cleanse (NC). It cooperates with a reinforcement learning algorithm to efficiently explore many trigger candidates with different norm and different shape using the trigger reversion algorithm (as used in NC), and thus increases the chance to identify the true trigger. As mentioned by the authors of KARM [68], it significantly outperforms NC. Hence, here, we evaluated K-ARM instead of NC. MNTD is another representative method in this category. It has been taken as the standard detection method in the Trojan Detection Challenge (TDC) [3], a NeurIPS 2022 competition. Specifically, MNTD detects the backdoored models by finding some inputs on which the outputs of the backdoored model are most different from the outputs of the benign models. Table 2 illustrates that K-ARM works poorly on SIG, CB and TSA, three backdoor attacks using widespread triggers that may affect the whole inputs (even with small L2 -norm). MNTD performs well on the backdoor attacks except on TSA, indicating existing attack methods somehow make the outputs of backdoored models are distant from the outputs of the benign models on many inputs. On the other hand, the outputs of the backdoored model generated by TSA are close to the outputs of benign models. This also helps TSA perform well on TDC competition.1 Lemma 9. When A is fixed and β = κ1 , for a well-trained backdoored model fb and a well-trained benign model fP , if Xb ∼ N (mb , Σ) and XP ∼ N (mP , Σ) and nP and nb are sufficiently large, we have nb 2 α . T 2 ≤ λmax nnPP+n b where λmax is the largest eigenvalue of Σ−1 . Proof. See Appendix 10.7. Lemma 9 demonstrates that, when the adversary has chosen a trigger function A and set β = κ1 , which minimizes the backdoor distance in reasonable settings, α2 is proportional to the upper bound of the test statistic T 2 . This implies, when the backdoor distance is small, it is difficult to distinguish the distribution of Xb from the distribution of XP , resulting in the poor accuracy of detecting backdoor on model inputs. 4.4 Experiments: Detection vs. Attack To investigate the performance of these three kinds of detection methods against backdoor attacks, we evaluated 6 backdoor detection methods: K-ARM [68], MNTD [83], ABS [48], 1 On 10 TDC, the TSA attack reduced the detection AUC of MNTD to Table 2: The accuracies (%) of the detection-on-modeloutputs methods. C-rows stand for results on CIFAR10, Grows stand for results on GTSRB, I-rows stand for results on ImageNet and V -rows stand for results on VGGFace2. C G I V C G I V K-ARM MNTD BadNet 100 100 95.50 96.25 100 100 97.75 98.75 SIG 61.75 63.25 56.50 59.25 99.75 99.25 98.75 99.00 WB 79.50 82.25 75.00 76.50 86.00 85.50 84.25 85.25 CB 57.25 60.50 53.75 67.25 100 99.50 97.25 98.25 IAB 80.25 79.75 75.00 80.75 98.25 99.50 97.25 98.75 when defending against WB and CB, indicating the similarity between the untargeted universal perturbation and targeted per-input perturbations is a more general signal of the short cut comparing to a single dominant neuron exploited by ABS. Detection on model inputs. We tested two representative detection methods in this category: SCAn and AC. SCAn detects the backdoor by checking whether the representations (outputs of the penultimate layer) of inputs in a single class are from a mixture of two distributions, with the help of the so-called global variance matrix that captures how the representations of the inputs in different classes varies. SCAn first computes the global variance matrix on a clean dataset, then computes a score for each class, and finally checks whether any class has a abnormally high score. If such class exists, SCAn will report this model as the backdoored model and this abnormal class as the target class. Similarly, AC detects the backdoor by checking whether the representations of one class can be well separated into two groups. Specifically, for each class, AC first embeds the high-dimensional representations into 10-dimensional vectors and then computes the Sihouette score [63] to measures how well the 2-means algorithm can separate these vectors. TSA 59.25 62.50 57.25 64.75 51.25 52.50 53.25 54.75 Detection on model weights. We tested two representative detection methods in this category: ABS and TND. ABS proposed that when the backdoor is injected into a model, it also introduces a short cut, through which a trigger-carrying input will be easily predicted as belonging to the target class by the backdoored model. Specifically, this short cut is characterized by some neurons that are intensively activated by the trigger on the input, and then generate dramatic impact on the prediction. To detect this short cut, ABS first labels those neurons whose activation results in abnormally large change in the predicted label of the backdoored model for some inputs, and then, for each labeled neuron, seeks a trigger that can activate this neuron abnormally and consistently change the predicted label for a range of inputs. ABS alarms for a backdoored model, if such neuron coupled with a trigger is found. TND explores another phenomenon related to the short cut in a model. In particular, TND found that the untargeted universal perturbation is similar to the targeted per-input perturbation in backdoored models, while they are different in benign models. Hence, TND alarms for a backdoored model if such similarity is significant. Table 4: The accuracies (%) of the detection-on-modeloutputs methods. C-rows stand for results on CIFAR10, Grows stand for results on GTSRB, I-rows stand for results on ImageNet and V -rows stand for results on VGGFace2. SCAn AC Table 3: The accuracy (%) of the detection-on-model-weights methods. C-rows stand for results on CIFAR10, G-rows stand for results on GTSRB, I-rows stand for results on ImageNet and V -rows stand for results on VGGFace2. ABS TND C G I V C G I V BadNet 100 100 94.75 98.25 100 100 94.50 96.00 SIG 95.50 94.50 89.75 91.25 99.75 99.25 93.25 92.75 WB 59.75 61.00 56.25 56.25 67.00 64.25 62.00 63.50 CB 62.75 61.75 58.00 59.25 73.75 72.50 69.25 71.00 IAB 58.75 59.00 55.50 54.75 53.00 52.50 49.75 50.25 C G I V C G I V BadNet 100 100 94.25 95.75 98.00 98.50 91.75 92.25 SIG 100 100 91.25 92.00 99.00 99.25 95.50 96.25 WB 70.25 69.00 62.75 66.25 59.75 59.25 55.75 57.25 CB 95.25 97.00 88.00 89.50 90.00 91.50 86.25 88.00 IAB 74.25 74.75 67.75 69.50 65.75 66.50 59.75 62.50 TSA 63.25 61.75 59.00 60.25 55.25 55.25 52.75 53.50 Table 4 demonstrates that SCAn achieved better accuracies against all 6 attacks compared to AC. However, SCAn and AC both performed poorly on WB, IAB and TSA, the three attacks that attempt to mix the representations of trigger-carrying inputs with those of benign inputs. Taking all these results together, we concluded that an attack would exhibit different evasiveness against different detection methods. Even for TSA, although the detection accuracy by 4 out of 6 detections are as low as about 52%, two other methods (K-ARM and SCAn) retain about 60% accuracy against it. This illustrates the demand of a general measurement to depict how well a backdoor attack can evade different detection methods (including novel methods that are not known by the adversary), as in practice, the defender may adopt a cocktail approach by combining different methods to detect backdoors. We believe the backdoor distance is a promising candidate for such a measurement as it accurately showed the low detection accuracy on the TSA and WB backdoored models by all detection methods, with their low TSA 51.00 49.25 51.75 52.25 48.75 50.25 51.50 51.75 Table 3 illustrates that ABS suffers from difficulties when defending against WB, CB, IAB and TSA, perhaps because these attacks influence many neurons in the victim models and thus no single neuron changes the predicted label by itself. On the other hand, TND performs better than ABS 44.37%, indicating it is hard for MNTD to distinguish the TSA backdoored models from the benign ones. Until submission of this paper, our method is ranked #1 in the evasive trojans track of TDC. 11 backdoor distances (Table 1) compered to the other 4 attack methods. Below, we aim to further illustrate their connection. 4.5 door detectability, when the α/β be close to 1, be about 0.64 (WB) or be around 0.37 (TSA). To evaluate this relationship at more various backdoor distances, we used the TSA backdoor attack method (Algorithm 1) with β = 0.1 to generate backdoors with different backdoor distances by adjusting the parameter α∗ . Specifically, we performed this experiment on CIFAR10 with 9 different α∗ values ranging from 0.1 to 0.9. For each α∗ , we generated 200 backdoored models together with previously generated benign models for CIFAR10 to build a testing dataset containing 400 models (200 backdoored and 200 benign models). On each testing dataset, we applied SCAn and K-ARM detection methods, two comparably effective detection methods against TSA backdoor attack (Section 4.4), to distinguish those TSA backdoored models from benign models. Figure 2 demonstrates these detection results and the backdoor distances we estimated on each α∗ value. Experiments: Detectability vs. Similarity Our experiments in Section 4.4 indicate that the backdoor distance is a potentially good measurement of the backdoor detectability (as defined in below). Specifically, those backdoor attacks obtaining small backdoor distance are hard to be detected, which is also inline with what has been demonstrated in our theory analysis (Section 4.1 4.2 and 4.3). In this section, we report the experimental results showing the backdoors with small backdoor distance indeed have low detectability, and thus the backdoor distance is indeed a good indicator of the backdoor detectability. Definition 7 (Backdoor detectability). The detectability of the backdoor generated by a backdoor attack method is the maximum accuracy that backdoor detection methods can achieve to distinguish the backdoored model from the benign models. For convenience, we adjust the detectability between 0 and 1, i.e., γ = |acc − 0.5| × 2, where γ is the detectability and acc is the maximum accuracy. Table 5: Detectability for attacks. C-rows stand for results on CIFAR10, G-rows stand for results on GTSRB, I-rows stand for results on ImageNet dataset and V -rows stand for results on VGGFace2 dataset. The “Det” columns represent the backdoor detectability. The “α/β” columns depict the backdoor distance (Corollary 5). Here, we keep β = 0.1 for all cells. C G I V BadNet Det α/β 1.00 0.98 1.00 0.96 0.96 0.92 0.98 0.95 SIG Det α/β 1.00 0.99 1.00 1.00 0.98 0.98 0.98 0.99 WB Det α/β 0.72 0.67 0.71 0.61 0.69 0.66 0.71 0.65 CB Det α/β 1.00 1.00 0.99 1.00 0.96 0.99 0.97 1.00 IAB Det α/β 0.97 1.00 0.99 1.00 1.00 0.99 0.98 1.00 Figure 2: Backdoor distance and detectability of backdoors generated by using different α∗ values. From Figure 2, we observe that the backdoor detectability (blue line) of TSA increases along with the increasing backdoor distance (red line, characterized by α/β). Notably, a small difference exists between these two lines, because of two reasons: 1) imprecise estimation of the backdoor distance; and 2) the absence of an effective detection method against TSA backdoors. Specifically, when α∗ ≥ 0.8, the backdoor distance is lower than the backdoor detectability, illustrating the first reason. When α∗ ≤ 0.5, the backdoor distance is consistently higher than the backdoor detectability, demonstrating the room for better detection methods (the second reason). Based upon our results, we conclude that the backdoor distance is a good indicator of the backdoor detectability with small deviations, which is again illustrated by the following observation: the Pearson correlation coefficient between the backdoor distances and detectabilities shown on Figure 2 is 0.9795, while the mean of the absolute difference between them is 0.0800 with the standard deviation of 0.0453. Ours Det α/β 0.27 0.37 0.25 0.41 0.18 0.38 0.30 0.35 To evaluate the relationship between the backdoor detectability and the backdoor distance for each attack method, we summarized the maximum detection accuracy obtained among 6 detection methods (Section 4.4) and calculated the detectability according to the above definition. Also we approximated the backdoor distance of these 6 attacks on 4 datasets using our approximation method (Section 3.3) with the help of StyleGAN2 models [34]. Specifically, we used the officially pretrained StyleGAN2 models for datasets CIFAR10 and ImageNet, trained a StyleGAN2 model with its original code for GTSRB dataset, and trained a StyleGAN2 model with code [1] available online for VGGFace2 dataset. Our results are illustrated in Table 5. From Table 5, we observe that the backdoor detectability is roughly equal to the backdoor distance (depicted by α/β). Digitally, the Pearson correlation coefficient [42] between them is 0.9777, the mean value of the absolute difference between them is 0.0450 and the standard deviation of that is 0.0498. These numbers demonstrate that the backdoor distance is highly correlated to and a good indicator of the back- 5 Mitigation A simple defense to the backdoors with small backdoor distance could be just discarding those uncertain predictions while retaining only those confident predictions. However, doing this will obviously decrease the model accuracy on benign inputs. For example, on MNIST dataset, if keeping only those predictions with confidence higher than 0.8 and labeling the rest as “unknown”, the accuracy of a benign model 12 will decrease from 99.35% to 98.61%, this is far below the accuracy of a benign model could get2 , considering the mean accuracy among 200 benign models is 99.25% with the standard deviation of 0.00057. When the primary task becomes more hard (e.g., ImageNet), the accuracy reduction will be more serious if this simple defense be applied. Besides, backdoor unlearning methods and backdoor disabling methods might have potential to relieve the threat from backdoors with small backdoor distance. However, as demonstrated in our TSA on them (Appendix 11 & 12), they exhibit minor efficacy on these backdoors. Inspired by our backdoor distance theorems, a detection that considers both the difference exhibited in the inputs and in the outputs between the backdoored model and benign models would effectively reduce the evasiveness of those backdoors with small backdoor distance, which is a promising direction to develop powerful detections in futures. 6 IMC only reduced the difference between outputs on benign inputs (i.e., maintained the accuracy of backdoored model on benign inputs). And, there are two minor differences between them: a) TSA makes the trigger be easy to learn while IMC did not; b) TSA only slightly changes the classification boundary, however, IMC iteratively pushed the classification boundary deviate from its original position to seek a small trigger. Also, we established experiments to compare TSA with IMC on CIFAR10 dataset (see Appendix 13 for details), in which TSA exhibited lower detectability than IMC. 7 In this work, we only studied backdoor tasks where β ≥ κ1 , i.e., the adversary has not reduced the probability of drawing a trigger-carrying input from the backdoor distribution be lower than the probability of drawing it from the primary distribution. However, as demonstrated in Table 1, when β < κ1 , TSA still achieved acceptable ASR (ASR=79.07% when β = 0.005 < 0.046 = κ1 ), illustrating the need to extend our theorem to adapt β < κ1 scenarios. However, using the similar methods applied on β ≥ κ1 scenarios, one could easily obtain that the minimal backdoor distance will be obtained at β = κ1 even in β < κ1 scenarios, which is inline with the conclusion drew for β ≥ κ1 scenarios (Corollary 5 & 6) and has no conflicts with the results shown in Table 1. In section 5, we have only taken the first step to use our backdoor distance theorem to understand the backdoor unlearning and backdoor disabling methods. Comprehensive studies are needed in the future. Our Theorem 2 reveal that the fundamental difference between a backdoored model and a benign model comes from the difference between their joint probabilities among triggercarrying inputs and the outputs (i.e., A(B ) × Y ). This implies that a good backdoor detection method should simultaneously consider the differences in the outputs and in the inputs between backdoored models and benign models, rather than considering one of these two differences alone as what current detection methods did. Actually, to detect backdoor, this points out a potential direction for the future studies. Related Works We proposed theorems to study the detectability of backdoors. This, in general, has also been studied in previous work [23]. It proposed an approach to plant undetectable backdoors into a random feature network [60], a kind of neural network that learns only the weights on random features. Compared with classical deep neural networks, random feature networks have limited capability [85]: it cannot be used to learn even a single ReLU neuron, unless the network size is exponentially larger than the dimension of the inputs. In theory, work [23] reduced the problem of detecting their backdoor to solving a Continuous Learning With Errors problem [9], however, solving which is as hard as finding approximately short vectors on arbitrary integer lattices, thus detecting their backdoor is computationally infeasible in practice. Compared with work [23], our work established theorems about the detectability of backdoors injected into a classical deep neural network, and demonstrated that, in this case, backdoor detectability is characterized by the backdoor distance that further controlled by three parameters: κ, β and S. Based upon our theorem, we proposed an attack, TSA backdoor attack, to inject stealthy backdoor. Compared to existing stealthy backdoor attacks [24] [6] [16] [46] [55], TSA backdoor attack achieved lower backdoor detectability under current backdoor detections [68] [83] [48] [79] [72] [11] (demonstrated in Section 4.4) and has theory guarantee under unknown detections (illustrated in Section 4.1 4.2 4.3). Our TSA backdoor attack exploited adversarial perturbations as the trigger, that has been also exploited by IMC backdoor attack [56]. There are 3 main differences between TSA and IMC backdoor attacks: 1) TSA has theory guarantees on backdoor detectability what IMC has not; 2) TSA reduces the S (defined in Eq. 8) what IMC has not considered; 3) TSA reduces the difference between outputs of backdoored model and the benign model on whole input space (Eq. 13), however, 2 11 Limitations and Future Works 8 Conclusion We established theorems about the backdoor distance (similarity) and used them to investigate the stealthiness of current backdoors, revealing that they have taken only some of factors affecting the backdoor distance into the consideration. Thus, we proposed a new approach, TSA attack, which simultaneously optimizes those factors under the given constraint of backdoor distance. Through theoretical analysis and extensive experiments, we demonstrated that the backdoors with smaller backdoor distance were in general harder to be detected by existing backdoor defense methods. Furthermore, comparing with existing backdoor attacks, the TSA attack generates backdoors that exhibited smaller backdoor distances, and thus lower detectability under current backdoor detections. times of standard deviation below the mean accuracy 13 Huang, José Hernández-Orallo, and Mauricio CastilloEffen, editors, Workshop on Artificial Intelligence Safety 2019 co-located with the Thirty-Third AAAI Conference on Artificial Intelligence 2019 (AAAI-19), Honolulu, Hawaii, January 27, 2019, volume 2301 of CEUR Workshop Proceedings. CEUR-WS.org, 2019. References [1] Stylegan2-based face frontalization model. https://github.com/ch10tang/stylegan2-b ased-face-frontalization. [2] Trojai competition. https://pages.nist.gov/tro jai/. [12] Xinyun Chen, Chang Liu, Bo Li, Kimberly Lu, and Dawn Song. Targeted backdoor attacks on deep learning systems using data poisoning. CoRR, abs/1712.05526, 2017. [3] Trojan detection challenge. https://trojandetect ion.ai/. [4] Martín Arjovsky, Soumith Chintala, and Léon Bottou. Wasserstein generative adversarial networks. In Doina Precup and Yee Whye Teh, editors, Proceedings of the 34th International Conference on Machine Learning, ICML 2017, Sydney, NSW, Australia, 6-11 August 2017, volume 70 of Proceedings of Machine Learning Research, pages 214–223. PMLR, 2017. [13] Edward Chou, Florian Tramèr, and Giancarlo Pellegrino. Sentinet: Detecting localized universal attacks against deep learning systems. In 2020 IEEE Security and Privacy Workshops, SP Workshops, San Francisco, CA, USA, May 21, 2020, pages 48–54. IEEE, 2020. [14] Joseph Clements and Yingjie Lao. Backdoor attacks on neural network operations. In 2018 IEEE Global Conference on Signal and Information Processing, GlobalSIP 2018, Anaheim, CA, USA, November 26-29, 2018, pages 1154–1158. IEEE, 2018. [5] Eugene Bagdasaryan and Vitaly Shmatikov. Blind backdoors in deep learning models. In Michael Bailey and Rachel Greenstadt, editors, 30th USENIX Security Symposium, USENIX Security 2021, August 11-13, 2021, pages 1505–1521. USENIX Association, 2021. [15] Jia Deng, Wei Dong, Richard Socher, Li-Jia Li, Kai Li, and Li Fei-Fei. Imagenet: A large-scale hierarchical image database. In 2009 IEEE Conference on Computer Vision and Pattern Recognition, pages 248–255, 2009. [6] Mauro Barni, Kassem Kallas, and Benedetta Tondi. A new backdoor attack in CNNS by training set corruption without label poisoning. In 2019 IEEE International Conference on Image Processing, ICIP 2019, Taipei, Taiwan, September 22-25, 2019, pages 101–105. IEEE, 2019. [16] Khoa Doan, Yingjie Lao, and Ping Li. Backdoor attack with imperceptible input and latent modification. Advances in Neural Information Processing Systems, 34, 2021. [7] Shai Ben-David, John Blitzer, Koby Crammer, Alex Kulesza, Fernando Pereira, and Jennifer Wortman Vaughan. A theory of learning from different domains. Mach. Learn., 79(1-2):151–175, 2010. [17] Khoa Doan, Yingjie Lao, Weijie Zhao, and Ping Li. Lira: Learnable, imperceptible and robust backdoor attacks. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 11966–11976, 2021. [8] Douglas S Bridges et al. Foundations of real and abstract analysis. Number 146. Springer Science & Business Media, 1998. [18] Thang Doan, Mehdi Abbana Bennani, Bogdan Mazoure, Guillaume Rabusseau, and Pierre Alquier. A theoretical analysis of catastrophic forgetting through the NTK overlap matrix. In Arindam Banerjee and Kenji Fukumizu, editors, The 24th International Conference on Artificial Intelligence and Statistics, AISTATS 2021, April 13-15, 2021, Virtual Event, volume 130 of Proceedings of Machine Learning Research, pages 1072–1080. PMLR, 2021. [9] Joan Bruna, Oded Regev, Min Jae Song, and Yi Tang. Continuous lwe. In Proceedings of the 53rd Annual ACM SIGACT Symposium on Theory of Computing, pages 694–707, 2021. [10] Qiong Cao, Li Shen, Weidi Xie, Omkar M Parkhi, and Andrew Zisserman. Vggface2: A dataset for recognising faces across pose and age. In 2018 13th IEEE international conference on automatic face & gesture recognition (FG 2018), pages 67–74. IEEE, 2018. [19] DC Dowson and BV666017 Landau. The fréchet distance between multivariate normal distributions. Journal of multivariate analysis, 12(3):450–455, 1982. [11] Bryant Chen, Wilka Carvalho, Nathalie Baracaldo, Heiko Ludwig, Benjamin Edwards, Taesung Lee, Ian M. Molloy, and Biplav Srivastava. Detecting backdoor attacks on deep neural networks by activation clustering. In Huáscar Espinoza, Seán Ó hÉigeartaigh, Xiaowei [20] Jacob Dumford and Walter J. Scheirer. Backdooring convolutional neural networks via targeted weight perturbations. In 2020 IEEE International Joint Conference on 14 Biometrics, IJCB 2020, Houston, TX, USA, September 28 - October 1, 2020, pages 1–9. IEEE, 2020. [31] Xijie Huang, Moustafa Alzantot, and Mani B. Srivastava. Neuroninspect: Detecting backdoors in neural networks via output explanations. CoRR, abs/1911.07399, 2019. [21] Yansong Gao, Bao Gia Doan, Zhi Zhang, Siqi Ma, Jiliang Zhang, Anmin Fu, Surya Nepal, and Hyoungshick Kim. Backdoor attacks and countermeasures on deep learning: A comprehensive review. CoRR, abs/2007.10760, 2020. [32] Arun I. and Murugesan Venkatapathi. An algorithm for estimating volumes and other integrals in n dimensions. CoRR, abs/2007.06808, 2020. [33] Arthur Jacot, Clément Hongler, and Franck Gabriel. Neural tangent kernel: Convergence and generalization in neural networks. In Samy Bengio, Hanna M. Wallach, Hugo Larochelle, Kristen Grauman, Nicolò Cesa-Bianchi, and Roman Garnett, editors, Advances in Neural Information Processing Systems 31: Annual Conference on Neural Information Processing Systems 2018, NeurIPS 2018, December 3-8, 2018, Montréal, Canada, pages 8580–8589, 2018. [22] Yansong Gao, Change Xu, Derui Wang, Shiping Chen, Damith Chinthana Ranasinghe, and Surya Nepal. STRIP: a defence against trojan attacks on deep neural networks. In David Balenson, editor, Proceedings of the 35th Annual Computer Security Applications Conference, ACSAC 2019, San Juan, PR, USA, December 09-13, 2019, pages 113–125. ACM, 2019. [23] Shafi Goldwasser, Michael P Kim, Vinod Vaikuntanathan, and Or Zamir. Planting undetectable backdoors in machine learning models. arXiv preprint arXiv:2204.06974, 2022. [34] Tero Karras, Miika Aittala, Janne Hellsten, Samuli Laine, Jaakko Lehtinen, and Timo Aila. Training generative adversarial networks with limited data. In Hugo Larochelle, Marc’Aurelio Ranzato, Raia Hadsell, MariaFlorina Balcan, and Hsuan-Tien Lin, editors, Advances in Neural Information Processing Systems 33: Annual Conference on Neural Information Processing Systems 2020, NeurIPS 2020, December 6-12, 2020, virtual, 2020. [24] Tianyu Gu, Brendan Dolan-Gavitt, and Siddharth Garg. Badnets: Identifying vulnerabilities in the machine learning model supply chain. CoRR, abs/1708.06733, 2017. [25] Shangwei Guo, Chunlong Xie, Jiwei Li, Lingjuan Lyu, and Tianwei Zhang. Threats to pre-trained language models: Survey and taxonomy. CoRR, abs/2202.06862, 2022. [35] Sara Kaviani and Insoo Sohn. Defense against neural trojan attacks: A survey. Neurocomputing, 423:651– 667, 2021. Adam: A [36] Diederik P Kingma and Jimmy Ba. method for stochastic optimization. arXiv preprint arXiv:1412.6980, 2014. [26] Wenbo Guo, Lun Wang, Xinyu Xing, Min Du, and Dawn Song. TABOR: A highly accurate approach to inspecting and restoring trojan backdoors in AI systems. CoRR, abs/1908.01763, 2019. [37] Soheil Kolouri, Kimia Nadjahi, Umut Simsekli, Roland Badeau, and Gustavo Rohde. Generalized sliced wasserstein distances. Advances in neural information processing systems, 32, 2019. [27] Jonathan Hayase, Weihao Kong, Raghav Somani, and Sewoong Oh. Spectre: Defending against backdoor attacks using robust statistics. arXiv preprint arXiv:2104.11315, 2021. [38] Soheil Kolouri, Aniruddha Saha, Hamed Pirsiavash, and Heiko Hoffmann. Universal litmus patterns: Revealing backdoor attacks in cnns. In 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition, CVPR 2020, Seattle, WA, USA, June 13-19, 2020, pages 298–307. Computer Vision Foundation / IEEE, 2020. [28] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 770–778, 2016. [29] Harold Hotelling. The generalization of student’s ratio. In Breakthroughs in statistics, pages 54–65. Springer, 1992. [39] Alex Krizhevsky, Geoffrey Hinton, et al. Learning multiple layers of features from tiny images. 2009. [30] Sebastian Houben, Johannes Stallkamp, Jan Salmen, Marc Schlipsing, and Christian Igel. Detection of traffic signs in real-world images: The German Traffic Sign Detection Benchmark. In International Joint Conference on Neural Networks, number 1288, 2013. [40] Yann LeCun, Léon Bottou, Yoshua Bengio, and Patrick Haffner. Gradient-based learning applied to document recognition. Proceedings of the IEEE, 86(11):2278– 2324, 1998. 15 [41] Jaehoon Lee, Lechao Xiao, Samuel Schoenholz, Yasaman Bahri, Roman Novak, Jascha Sohl-Dickstein, and Jeffrey Pennington. Wide neural networks of any depth evolve as linear models under gradient descent. Advances in neural information processing systems, 32, 2019. [50] Yunfei Liu, Xingjun Ma, James Bailey, and Feng Lu. Reflection backdoor: A natural backdoor attack on deep neural networks. In Andrea Vedaldi, Horst Bischof, Thomas Brox, and Jan-Michael Frahm, editors, Computer Vision - ECCV 2020 - 16th European Conference, Glasgow, UK, August 23-28, 2020, Proceedings, Part X, volume 12355 of Lecture Notes in Computer Science, pages 182–199. Springer, 2020. [42] Joseph Lee Rodgers and W Alan Nicewander. Thirteen ways to look at the correlation coefficient. The American Statistician, 42(1):59–66, 1988. [51] Yunfei Liu, Xingjun Ma, James Bailey, and Feng Lu. Reflection backdoor: A natural backdoor attack on deep neural networks. In European Conference on Computer Vision, pages 182–199. Springer, 2020. [43] Yige Li, Xixiang Lyu, Nodens Koren, Lingjuan Lyu, Bo Li, and Xingjun Ma. Anti-backdoor learning: Training clean models on poisoned data. In NeurIPS, 2021. [52] Geoffrey J McLachlan. Discriminant analysis and statistical pattern recognition. John Wiley & Sons, 2005. [44] Yiming Li, Yong Jiang, Zhifeng Li, and Shu-Tao Xia. Backdoor learning: A survey. IEEE Transactions on Neural Networks and Learning Systems, 2022. Mahalanobis distance. [53] Goeffrey J McLachlan. Resonance, 4(6):20–26, 1999. [45] Yuezun Li, Yiming Li, Baoyuan Wu, Longkang Li, Ran He, and Siwei Lyu. Invisible backdoor attack with sample-specific triggers. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 16463–16472, 2021. [54] Kevin P Murphy. Machine learning: a probabilistic perspective. MIT press, 2012. [55] Tuan Anh Nguyen and Anh Tran. Input-aware dynamic backdoor attack. Advances in Neural Information Processing Systems, 33:3454–3464, 2020. [46] Junyu Lin, Lei Xu, Yingqi Liu, and Xiangyu Zhang. Composite backdoor attack for deep neural network by mixing existing benign features. In Jay Ligatti, Xinming Ou, Jonathan Katz, and Giovanni Vigna, editors, CCS ’20: 2020 ACM SIGSAC Conference on Computer and Communications Security, Virtual Event, USA, November 9-13, 2020, pages 113–131. ACM, 2020. [56] Ren Pang, Hua Shen, Xinyang Zhang, Shouling Ji, Yevgeniy Vorobeychik, Xiapu Luo, Alex Liu, and Ting Wang. A tale of evil twins: Adversarial inputs versus poisoned models. In Proceedings of the 2020 ACM SIGSAC Conference on Computer and Communications Security, pages 85–99, 2020. [47] Tao Liu, Wujie Wen, and Yier Jin. Sin 2: Stealth infection on neural network—a low-cost agile neural trojan attack methodology. In 2018 IEEE International Symposium on Hardware Oriented Security and Trust (HOST), pages 227–230. IEEE, 2018. [57] Ren Pang, Zheng Zhang, Xiangshan Gao, Zhaohan Xi, Shouling Ji, Peng Cheng, and Ting Wang. TROJANZOO: everything you ever wanted to know about neural backdoors (but were afraid to ask). In IEEE European Symposium on Security and Privacy, EuroS&P 2022, Genoa, June 6-10, 2022. IEEE, 2022. [48] Yingqi Liu, Wen-Chuan Lee, Guanhong Tao, Shiqing Ma, Yousra Aafer, and Xiangyu Zhang. ABS: scanning neural networks for back-doors by artificial brain stimulation. In Lorenzo Cavallaro, Johannes Kinder, XiaoFeng Wang, and Jonathan Katz, editors, Proceedings of the 2019 ACM SIGSAC Conference on Computer and Communications Security, CCS 2019, London, UK, November 11-15, 2019, pages 1265–1282. ACM, 2019. [58] Ximing Qiao, Yukun Yang, and Hai Li. Defending neural backdoors via generative distribution modeling. In Hanna M. Wallach, Hugo Larochelle, Alina Beygelzimer, Florence d’Alché-Buc, Emily B. Fox, and Roman Garnett, editors, Advances in Neural Information Processing Systems 32: Annual Conference on Neural Information Processing Systems 2019, NeurIPS 2019, December 8-14, 2019, Vancouver, BC, Canada, pages 14004–14013, 2019. [49] Yingqi Liu, Shiqing Ma, Yousra Aafer, Wen-Chuan Lee, Juan Zhai, Weihang Wang, and Xiangyu Zhang. Trojaning attack on neural networks. In 25th Annual Network and Distributed System Security Symposium, NDSS 2018, San Diego, California, USA, February 18-21, 2018. The Internet Society, 2018. [59] Erwin Quiring and Konrad Rieck. Backdooring and poisoning neural networks with image-scaling attacks. In 2020 IEEE Security and Privacy Workshops, SP Workshops, San Francisco, CA, USA, May 21, 2020, pages 41–47. IEEE, 2020. 16 [60] Ali Rahimi and Benjamin Recht. Random features for large-scale kernel machines. Advances in neural information processing systems, 20, 2007. conference on computer vision and pattern recognition, pages 1–9, 2015. [71] Christian Szegedy, Wojciech Zaremba, Ilya Sutskever, Joan Bruna, Dumitru Erhan, Ian Goodfellow, and Rob Fergus. Intriguing properties of neural networks. arXiv preprint arXiv:1312.6199, 2013. [61] Adnan Siraj Rakin, Zhezhi He, and Deliang Fan. TBT: targeted neural network attack with bit trojan. In 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition, CVPR 2020, Seattle, WA, USA, June 13-19, 2020, pages 13195–13204. Computer Vision Foundation / IEEE, 2020. [72] Di Tang, XiaoFeng Wang, Haixu Tang, and Kehuan Zhang. Demon in the variant: Statistical analysis of dnns for robust backdoor contamination detection. In Michael Bailey and Rachel Greenstadt, editors, 30th USENIX Security Symposium, USENIX Security 2021, August 11-13, 2021, pages 1541–1558. USENIX Association, 2021. [62] Olaf Ronneberger, Philipp Fischer, and Thomas Brox. U-net: Convolutional networks for biomedical image segmentation. In International Conference on Medical image computing and computer-assisted intervention, pages 234–241. Springer, 2015. [73] Sebastian Thrun. A lifelong learning perspective for mobile robot control. In Intelligent robots and systems, pages 201–214. Elsevier, 1995. [63] Peter J Rousseeuw. Silhouettes: a graphical aid to the interpretation and validation of cluster analysis. Journal of computational and applied mathematics, 20:53–65, 1987. [74] Brandon Tran, Jerry Li, and Aleksander Madry. Spectral signatures in backdoor attacks. In Samy Bengio, Hanna M. Wallach, Hugo Larochelle, Kristen Grauman, Nicolò Cesa-Bianchi, and Roman Garnett, editors, Advances in Neural Information Processing Systems 31: Annual Conference on Neural Information Processing Systems 2018, NeurIPS 2018, December 3-8, 2018, Montréal, Canada, pages 8011–8021, 2018. [64] Jonathan J Ruel and Matthew P Ayres. Jensen’s inequality predicts effects of environmental variation. Trends in Ecology & Evolution, 14(9):361–366, 1999. [65] Ahmed Salem, Rui Wen, Michael Backes, Shiqing Ma, and Yang Zhang. Dynamic backdoor attacks against machine learning models. In 2022 IEEE 7th European Symposium on Security and Privacy (EuroS&P), pages 703–718. IEEE, 2022. [75] Alexander Turner, Dimitris Tsipras, and Aleksander Madry. Label-consistent backdoor attacks. CoRR, abs/1912.02771, 2019. [66] Esha Sarkar, Hadjer Benkraouda, and Michail Maniatakos. Facehack: Triggering backdoored facial recognition systems using facial characteristics. CoRR, abs/2006.11623, 2020. [76] Cédric Villani. Optimal transport: old and new, volume 338. Springer, 2009. [67] Giorgio Severi, Jim Meyer, Scott E. Coull, and Alina Oprea. Explanation-guided backdoor poisoning attacks against malware classifiers. In Michael Bailey and Rachel Greenstadt, editors, 30th USENIX Security Symposium, USENIX Security 2021, August 11-13, 2021, pages 1487–1504. USENIX Association, 2021. [77] Bolun Wang, Yuanshun Yao, Shawn Shan, Huiying Li, Bimal Viswanath, Haitao Zheng, and Ben Y. Zhao. Neural cleanse: Identifying and mitigating backdoor attacks in neural networks. In 2019 IEEE Symposium on Security and Privacy, SP 2019, San Francisco, CA, USA, May 19-23, 2019, pages 707–723. IEEE, 2019. [68] Guangyu Shen, Yingqi Liu, Guanhong Tao, Shengwei An, Qiuling Xu, Siyuan Cheng, Shiqing Ma, and Xiangyu Zhang. Backdoor scanning for deep neural networks through k-arm optimization. arXiv preprint arXiv:2102.05123, 2021. [78] Jie Wang, Ghulam Mubashar Hassan, and Naveed Akhtar. A survey of neural trojan attacks and defenses in deep learning. CoRR, abs/2202.07183, 2022. [79] Ren Wang, Gaoyuan Zhang, Sijia Liu, Pin-Yu Chen, Jinjun Xiong, and Meng Wang. Practical detection of trojan neural networks: Data-limited and data-free cases. In Proceedings of the European Conference on Computer Vision (ECCV), 2020. [69] Karen Simonyan and Andrew Zisserman. Very deep convolutional networks for large-scale image recognition. arXiv preprint arXiv:1409.1556, 2014. [70] Christian Szegedy, Wei Liu, Yangqing Jia, Pierre Sermanet, Scott Reed, Dragomir Anguelov, Dumitru Erhan, Vincent Vanhoucke, and Andrew Rabinovich. Going deeper with convolutions. In Proceedings of the IEEE [80] Dongxian Wu and Yisen Wang. Adversarial neuron pruning purifies backdoored deep models. Advances in Neural Information Processing Systems, 34, 2021. 17 [81] Weihao Xia, Yulun Zhang, Yujiu Yang, Jing-Hao Xue, Bolei Zhou, and Ming-Hsuan Yang. GAN inversion: A survey. CoRR, abs/2101.05278, 2021. 9 Appendix of Details of Estimation κ Estimation of κPr . In Section 3.3, we described how to calculate Pr(x) for an given input x. Specifically, our implementation is based on the GAN inversion tools in the official repository of [34]. The original code of [34] can only recover inputs’ style parameters that actually are projections of the z through a transformation network. Thus, we modified the original code to directly recover z. After computing Pr(x), it is still not trivial to get the expectations, EPr(x|x∈B ) Pr(G−1 (x)) and EPr(x|x∈A(B )) Pr(G−1 (x)), due to the poor precision in computing tiny numbers (e.g., 1e−100 ). Thus, we, instead, compute the logarithm of the ra- [82] Xiaojun Xu, Qi Wang, Huichen Li, Nikita Borisov, Carl A. Gunter, and Bo Li. Detecting AI trojans using meta neural analysis. In 42nd IEEE Symposium on Security and Privacy, SP 2021, San Francisco, CA, USA, 24-27 May 2021, pages 103–120. IEEE, 2021. [83] Xiaojun Xu, Qi Wang, Huichen Li, Nikita Borisov, Carl A. Gunter, and Bo Li. Detecting AI trojans using meta neural analysis. In 42nd IEEE Symposium on Security and Privacy, SP 2021, San Francisco, CA, USA, 24-27 May 2021, pages 103–120. IEEE, 2021. EPr(x|x∈B ) Pr(G−1 (x)) ). Furthermore, we −1 Pr(x|x∈A(B )) Pr(G (x)) G−1 (x) follows a Gaussian distribution tio, i.e., ln(κPr ) = ln( E observed that zx = for both x ∈ B and x ∈ A(B ). Combining with the fact that Pr(zx ) ∝ kzx k22 , we get that [84] Yuanshun Yao, Huiying Li, Haitao Zheng, and Ben Y Zhao. Latent backdoor attacks on deep neural networks. In Proceedings of the 2019 ACM SIGSAC Conference on Computer and Communications Security, pages 2041–2055, 2019. µ2 µ2 B A(B ) ) ), ln(κPr ) = − 21 ( σ2 B+1 − σ2 A(B+1 [85] Gilad Yehudai and Ohad Shamir. On the power and limitations of random features for understanding neural networks. Advances in Neural Information Processing Systems, 32, 2019. where we assume zx ∼ N(µB , σB ) for x ∈ B and, for x ∈ A(B ), zx ∼ N(µA(B ) , σA(B )). Empirically, we sampled 100 points (x) in B and 100 points (x) in A(B ) to estimate the mean and the variance of the corresponding zx . [86] Kota Yoshida and Takeshi Fujino. Disabling backdoor and identifying poison data by using knowledge distillation in backdoor attacks on deep neural networks. In Jay Ligatti and Xinming Ou, editors, AISec@CCS 2020: Proceedings of the 13th ACM Workshop on Artificial Intelligence and Security, Virtual Event, USA, 13 November 2020, pages 117–127. ACM, 2020. Estimation of κV . In Section 3.3, we described κV = Ext(B ) Ext(A(B )) , and briefly introduced how to calculate Ext(B ). Specifically, for a randomly selected origin x ∈ B , we sampled a set of 256 other inputs {x1 , x2 , ..., x256 }. We then generated a set of 256 random directions from x as −x −x , x2 −x , ..., kxx256−xk }, and along each direction, we { kxx1−xk 1 2 kx2 −xk2 2 256 used an binary search algorithm to find the extent (i.e., how far the origin x is from the boundary). Take a source-specific backdoor (with the source class of 1 and the target class of 0) as an example. We used a benign model fP and a backdoored model fb to detect the boundary: fP (x) = 1 and fb (x) = 1 ⇔ x ∈ B ; fP (x) = 1 and fb (x) = 0 ⇔ x ∈ A(B ). Finally, for computing Ext(B ), we randomly selected 32 different origins, and set Ext(B ) as the average extent among those computed from these origins. The same method was also used to compute Ext(A(B )). [87] Xiangyu Zhang, Xinyu Zhou, Mengxiao Lin, and Jian Sun. Shufflenet: An extremely efficient convolutional neural network for mobile devices. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 6848–6856, 2018. 10 10.1 Appendix of Proofs Proof of Proposition 1 The inequality 0 ≤ dH −W 1 (D , D ′ ) ≤ 1 is obvious, and thus we omit the proof. Next, we focus on proving dW 1 (D , D ′ ) ≤ dH −W 1 (D , D ′ ) and dH −W 1 (D , D ′ ) = 12 dH (D , D ′ ). Let’s recall the definitions of Wasserstein-1 distance and H -divergence: • Wasserstein-1 distance: Assuming Π(D , D ′ ) is the set of joint distributions γ(x, x′ ) whose marginals are D and D ′ , 18 respectively, the Wasserstein-1 distance between D and D ′ is dW 1 (D , D ′ ) = E(x,x′ )∼γ kx − x′ k2 . inf Thus, we get what’s desired. (14) γ∈Π(D ,D ′ ) 10.3 • H -divergence: Given two probability distributions D and D ′ over the same domain X , we consider a hypothetical binary classification on X : H = {h : X 7→ {0, 1}}, and denote by I(h) the set for which h ∈ H is the characteristic function, i.e., x ∈ I(h) ⇔ h(x) = 1. The H -divergence between D and D ′ is dH (D , D ′ ) = 2 sup | PrD (I(h)) − PrD ′ (I(h))|. Proof. Let’s consider two Lemmas first. Lemma 10. Supposing there are two sets of n numbers: n {u1 , u2 , ..., un } and {v1 , v2 , ..., vn }, if ∀ai ≥ 0, ∀bi ≥ 0, ∑ ui = 1 and ∑ vi = 1, for a number K ≥ 1, we have i=1 n From the definition of H -divergence (Eq. 15), one can directly obtain that dH −W 1 (D , D ′ ) = 21 dH (D , D ′ ), using the fact that max({0, 1}) = max([0, 1]) = 1 and min({0, 1}) = min([0, 1]) = 0. Referring to the prior work [4], we apply the KantorovichRubinstein duality [76] to transform Eq. 14 into its dual form: dˆW 1 (D, D′ ) = max [EPrD h(x) − EPrD ′ h(x)], i=1 A,B ,t Proof. First, we have following equations. A,B ,t Now, we use above two Lemmas to prove Theorem 4. First, we denote u(x, y) as = = (x,y)∈A(B )×Y R (x,y)∈C− R PrDA,B ,t (x, y) − R (x,y)∈C− (x,y)∈C− (x,y)∈A(B )×Y A,B ,t PrDA,B ,t (y|x) and v(x, y) as (x,y)∈A(B )×Y (x,y)∈A(B )×Y According to Lemma 3, we have = dH −W 1 (DP , DA,B ,t ) R max{ Z β u(x, y) − κ1 v(x, y), 0} Pr(B ) = 1 κ PrDP (x, y) β ZA,B ,t get PrDA,B ,t (x, y)) (x, y) − PrDP (x, y), 0) 19 ≥ 1 κ A,B ,t (x,y)∈A(B )×Y Pr(B ) R max{Ku(x, y) − v(x, y), 0}, (x,y)∈A(B )×Y where we set K = (PrDA,B ,t (x, y) − PrDP (x, y)) max(PrD ′ (x) (A(B )) PrDP (y|x). Apperently, u(x, y) ≥ 0 and v(x, y) ≥ 0. BeR R sides, u(x, y) = 1 and v(x, y) = 1. We split A(B ) × Y = C+ ∪ C− where C+ = {(x, y) : PrDP (x, y) ≥ PrDA,B ,t (x, y)} and C− = {(x, y) : PrDP (x, y) < PrDA,B ,t (x, y)}. R A,B ,t A,B ,t h(x,y) (x,y)∈A(B )×Y dH −W 1 (DP , DA,B ,t ) = (1 − Z 1 )(1 − Pr(A(B ))) + Pr(A(B )) − PrD PrD Pr(x) Pr(A(B )) dH −W 1 (DP , DA,B ,t ) = (1 − Z 1 )(1 − Pr(A(B ))) A,B ,t R + max ( h(x, y)(PrDP (x, y) − PrDA,B ,t (x, y))). ( (18) Since κ1 ≤ β, we have κβ ≥ 1. Besides, since κ ≥ 1, we have κ − Pr(B ) > 0. Putting them together, we have κβ ≥ ZA,B ,t , or Z β ≥ κ1 as desired. Proof. Supposing ZA,B ,t = (x,y) P(x, y) where P(x, y) is defined in Eq. 6, A(·) is the trigger function, B is the backdoor region and t is the target label, we have: − κβ − ZA,B ,t 1 2 κ (κ β − κβ Pr(B ) − κ + Pr(B )) 1 κ (κ − Pr(B )(κβ − 1) = = R A,B ,t (17) Lemma 11. Supposing κ ≥ 1, κ1 ≤ β ≤ 1 and ZA,B ,t = 1 − β 1 ≥ κ1 . κ Pr(B) + β Pr(B), we have Z Proof of Theorem 2 R i=1 Proof. One can easily get the desired through max{Kui − vi , 0} ≥ max{ui − vi , 0} + (K − 1)ui and max{Kui − vi , 0} ≤ Kui where khkL ≤1 represents all 1-Lipschitz functions h : X 7→ R. Notice that, without loss of generality, we assume X = [0, 1]n where n is the dimension of the input. Thereby, we can further assume that h(x) ∈ [0, 1] which will not change the maximum value of Eq. 16. Under the above assumptions, comparing with dH −W 1 (Eq. 2), dW 1 (Eq. 16) is additionally constrained by that h should be a 1-Lipschitz function. Hence, dW 1 (D , D ′ ) ≤ dH −W 1 (D , D ′ ) as we desired. 10.2 n K ≥ ∑ max{Kui − vi , 0} ≥ (K − 1) + ∑ max{ui − vi , 0}. (16) khkL ≤1 i=1 n (15) h∈H Proof of Theorem 4 β 1 ZA,B ,t / κ . According to Lemma 11, we have and, thus, K ≥ 1. Applying Lemma 10, we further ≥ dH −W 1 (DP , DA,B ,t ) 1 κ Pr(B )((K − 1) + = 1 κ R (x,y)∈A(B )×Y Pr(B )((K − 1) + S). max{u(x, y) − v(x, y), 0}) β 1 ZA,B ,t / κ Taking K = β (Z A,B ,t − Apparently, r1 < into the last equation, we have 1 κ (1 − S)) Pr(B ) ≤ dH −W 1 (DP , DA,B ,t ) A,B ,t 0 when κ = β1 . In this case, dH −W 1 (DP , DA,B ,t ) reaches the minimum value βS Pr(B ) as desired. A,B ,t 10.6 Proof of Corollary 5 Proof. After calculation, we get that the derivative of w.r.t. β is 1 (1− κ1 2 ZA, B ,t Proof of Lemma 8 Proof. Supposing fP = cP ◦ gP and fb = cb ◦ gb , we have β ZA,B ,t β α2 = ( mL ∑ ∑ max(gP (x)y − gb (x)y , 0))2 Pr(B )). Since κ ≥ 1, we have κ1 Pr(B ) ≤ x∈X y∈Y 1. Thus, the derivative is non-negative, which indicates that β increases along with the increasing β. Considering that Z ≤ A,B ,t 1 κ ≤ κ Pr(B ) when β = κ1 , and achieves the upper-bound κ+κ Pr( B )−Pr(B ) when β = 1. Taking these results into Theorem 4, we get what’s desired. ≤ β ∈ [ κ1 , 1], we obtain that 10.5 β ZA,B ,t achieves the lower-bound = ≤ Proof. Let’s first consider the upper-bound of β dH −W 1 (DP , DA,B ,t ), that is Pr(B ) Z according to A,B ,t Theorem 4. After calculation, we get that the derivative of β < 0. Thus, Z β increases monotonously w.r.t. κ is Z−β 2 Z A,B ,t A,B ,t 10.7 along with the decreasing of κ. Considering κ1 ≤ β, we obtain that Z β reaches its upper-bound β when κ = β1 , and thus A,B ,t dH −W 1 (DP , DA,B ,t ) ≤ β Pr(B ) Let’s now consider the lower-bound of dH −W 1 (DP , DA,B ,t ), that is Pr(B )( Z β − κ1 (1−S)) according to Theorem 4. How- ZA,B ,t After calculation, we get the derivative of κ is β ZA,B ,t ∑ kgP (x) − gb (x)k22 x∈X β2 2 mL kφ(X)(ωb − ωP )k2 β2 2 2 mL kφ(X)k2 kωb − ωP k2 . Proof of Lemma 9 = ≤ nP ntg 2 nP +ntg dM (mP , mb ) nP ntg 2 nP +ntg λmax kmP − mb k2 , p where dM (mP , mb ) = (mP − mb )T Σ−1 (mP − mb ) is the Mahalanobis distance [53] and λmax is the largest eigenvalue of Σ−1 . Next, we demonstrate that kmP − mb k2 ≤ dH −W 1 (NP , Nb ) when NP = N (mP , σ) and Nb = N (mb , σ). Actually, if kmP − mb k2 = dW 1 (NP , Nb ), we can easily prove the inequality according to Proposition 1 that illustrates dW 1 (NP , Nb ) ≤ dH −W 1 (NP , Nb ). Next, we strictly prove kmP − mb k2 = dW 1 (NP , Nb ). According to the Jensen’s inequality [64], Ekx − x′ k2 ≥ kE(x − x′ )k2 = kmP − mb k2 . Thus dW 1 (NP , Nb ) ≥ kmP − mb k2 . Again by Jensen’s inequality, (Ekx − x′ k2 )2 ≤ Ekx − x′ k22 . Thus, dW 1 (NP , Nb ) ≤ dW 2 (NP , Nb ) where dW 2 (·, ·) is the Wasserstein-2 distance. As proved in paper [19], the Wasserstein-2 distance between two normal distribution can be calculated by: A,B ,t − κ1 = 0 when κ = β1 . β2 mL T2 A,B ,t β ∑ ∑ |gP (x) − gb (x)|)2 x∈X y∈Y ( mβ ∑ √1L kgP (x)y − gb (x)y k2 )2 x∈X Proof. Specifically, a test statistic T 2 is calculated as ever, κ1 S is o( κ1 ) when dH −W 1 (DP , DA,B ,t ) becomes close to its lower-bound, since S → 0 in this case. Thus, we only need to consider the relation between Z β − κ1 and κ. Particularly, we have that β ( mL The inequality of arithmetic and geometric means are used to obtain the third and forth transformations. The cauchyschwarz inequality are used to obtain the last transformation. After simple math, the lower-bound of kωb − ωP k2 will be derived as what’s desired. Proof of Corollary 6 A,B ,t and r2 < β1 . Considering the coefficient of the quadratic term is positive, we obtain that Z β − κ1 A,B ,t increases monotonously along with the increasing κ when κ ≥ β1 . This indicates that Z β − κ1 reaches its lower-bound as desired. Similarly, we get dH −W 1 (DP , DA,B ,t ) ≤ β Pr(B ). This completes this proof. Z 10.4 1 β − κ1 w.r.t. κ2 (1+β2 Pr(B )2 +β Pr(B ))−2κ(Pr(B )+2β Pr(B )2 )+Pr(B )2 . 2 2 ZA, B ,t κ The denominator is strictly positive and numerator is a quadratic function of κ. After calculation, we get its two roots r1 and r2: √ 1− β3 Pr(B )3 r1 = β1 − β(1+β2 Pr(B )2 +β Pr(B )) √ 1+ β3 Pr(B )3 r2 = β1 − β(1+β2 Pr(B )2 +β Pr(B )) 1 2 (N , N ) = km − m k2 + tr(Σ + Σ − 2(Σ Σ ) 2 ) dW P P b P P b b b 2 2 20 2 (N , N ) = km − Because ΣP = Σb = Σ as we assumed, dW P P b 2 2 mb k2 . Thus, putting the above together, we get kmP − mb k2 ≤ dW 1 (NP , Nb ) ≤ kmP − mb k2 , which indicates kmP − mb k2 = dW 1 (NP , Nb ) as desired. Due to that dH −W 1 (DP , DA,B ,t ) is the maximum value among all possible separation functions, we have Clearly, when △(X, s) increases, R(X, s) becomes larger. − However, increasing △(X, s) is less effective for removing − backdoors with small backdoor distances dH −W 1 (DP , DA,B ,t ) / apfor the following reasons. 1) When R(X, s) ∩ A(B ) = 0, parently, the predicted labels of trigger-carrying inputs do not / change. 2) When R(X, s) ∩ A(B ) 6= 0/ and A(B ) \ R(X, s) 6= 0, the small dH −W 1 (DP , DA,B ,t ) will lead to a large A(B ) \ R(X, s), i.e., the more x ∈ A(B ) close to the decision boundary, the more x ∈ A(B ) outside R(X, s), indicating that the backdoor remains largely un-removed. This is because, during robustness enhancement, fb is learned to push x ∈ X away from the boundary as much as possible, which is considered over/ fitting by the neural network. 3) When A(B ) \ R(X, s) = 0, R(X, s) covers many inputs within the robust radius whose true label is not s, i.e., f ∗ (x′ ) 6= s, and thus the robustness enhancement will result in a false prediction on these inputs, which is not desired. This is due to the irregular classification boundary of fb that makes the precise removal of the backdoor impossible without knowing the trigger function A. Besides, increasing △(X, s) will decrease △(X,t) for t 6= s, dH −W 1 (DP , DA,B ,t ) ≥ dH −W 1 (XP , Xb ) = dH −W 1 (NP , Nb ). Thus α ≥ kmP −mb k2 . Taking this into our original inequality, n n we get T 2 ≤ nPP+ntgtg λmax α2 as desired. 11 Appendix of TSA on Backdoor Unlearning In addition to detection, the defender may also want to remove the backdoor from an infected model, either after detecting the model or through “blindly” unlearning the backdoor should it indeed be present in the model. We classify unlearning methods for backdoor removal into two categories: targeted unlearning (for removing detected backdoors) and “blind” unlearning. − “Blind” unlearning. Such unlearning methods can be further classified into two sub-categories: fine-tuning and robustness enhancement. The former fine-tunes a given model on benign inputs, through which Catastrophic Forgetting (CF) would be induced so an infected model’s capability to recognize the trigger may be forgotten. To study the relationship between CF and the backdoor similarity, we identify the lower bound of task drift based on Lemma 8: kδTP →TA,B ,t (X)k2 ≥ α √ mL β Targeted unlearning. The targeted unlearning methods are guided by the triggers reconstructed by the backdoor detection methods. As we demonstrated in Section 4, backdoor detection methods themselves become hard when the backdoor distance is small. Therefore, the targeted unlearning methods also become less effective for the backdoors with smaller backdoor distances. (19) 12 Eq 19 shows that small task drift, the measurement of CF, requires small backdoor distance (depicted by α) between the primary task TP and the backdoor task TA,B ,t , implying that “blind" unlearning through fine-tuning becomes less effective (i.e., the backdoor may not be completely forgotten) when the backdoor distance is small. The robustness enhancement methods aim to enhance the robustness radius of a backdoor model fb within which the model prediction remains the same. Specifically, the robustness radius △(X, s) for the source label s on a set of benign Knowledge distillation. In knowledge distillation, usually, there is a teacher model and a student model. The knowledge distillation defense uses the knowledge distillation process to suppress the student model from learning the backdoor behaviour from the teacher model through temperature controlling. Specifically, following the notion used in paper [86], for the temperature T = 1, we have fb (x) j = softmaxT =1 (µ j ) inputs X could be formulated as de f − min {△(x) : x∈X f (x)=s b inf f (x+δ)6=s kδk}. We denote R(X, s) as the set of inputs x′ within the robustness radius △(X, s), where softmaxT (µ j ) = − R(X, s) = {x′ : inf x∈X f (x)=s b kx′ − xk Appendix of Backdoor Disabling Even though the backdoor with small backdoor distance is hard to be detected and unlearned from the target model, the defender could suppress the backdoor behaviour through backdoor disabling methods. Backdoor disabling aims to remove the backdoor behaviour of infected model without affecting model predictions on benign inputs. There are mainly two kinds of methods: knowledge distillation and inputs preprocessing. − △(X, s) = − which eventually results in a model making the false prediction on the inputs with the true label of t. exp(µ j )/T . ∑ j′ ∈Y exp(µ j′ /T ) The bigger is T , the softer is the prediction result fb (x). The high temperature (e.g., T = 20 as used by [86]) could prevent the student model from learning typical backdoors that drives the model < △(X, s)}. − 21 to generate highly confident predictions (of the target label) on trigger-carrying inputs. However, backdoors with low backdoor distance drive the model to generate only moderate predictions for trigger-carrying inputs that are close to the classification boundary, which may still be learned by the student model through the high temperature knowledge distillation. Besides, the higher is the temperature, the smaller amount of knowledge could be learned by the student model, which result in the relatively low accuracy of the student model. On the other hand, the low temperature will sharpen the classification boundary, which allows the student model to learn confident predictions from teacher model. However, in this case, the predictions of the teacher model for trigger-carrying inputs become confident, i.e., fb (A(x))t is high. As a result, the student model may easily learn the backdoor, in a similar manner as learning it from a contaminated training dataset in a traditional backdoor attack when the backdoor has a large backdoor distance. Therefore, the choice of the temperature reflects a trade-off between the performance (i.e., the effectiveness of knowledge distillation) and security (i.e., the effectiveness of backdoor disabling) of the student model. Finally, the knowledge distillation could be viewed as a continual learning process from the backdoor task TA,B ,t to the primary task TP . As shown in Eq. 19, it is expected that the student model will give similar predictions as the teacher model in predictions of either clean or trigger-carrying inputs, when the backdoor has a small backdoor distance. same δ, the benign model would also flip the predicted label for A(x) + δ, i.e., arg max fP (A(x) + δ) j 6= arg max fP (A(x)) j , j j because A(x) is also close to the classification boundary in fP . There is no reason to block the input which is predicted by the backdoor model fb with the same label as the one predicted by a benign model fP . If a large noise δ was added to the inputs, the performance of the deep neural networks (both fP and fb ) will decrease. Consequently, even though the backdoor is suppressed, the benign model for the primary task will become worse, indicating a trade-off between utility and security, which will be discussed in the next section. In general, there is no significant difference between the robustness of the benign model fP and the backdoor model fb for trigger-carrying inputs, when backdoor distance is small. Thus, the input preprocessing methods may only moderately suppress the backdoors with small backdoor distances. 13 Appendix of IMC Experiments We exploited IMC to generate 200 backdoored models on CIFAR10 with its official code that has been integrated into the TrojnZoo framework. In this experiment, those backdoors carried on those backdoored models are source-specific with the source class is 1 and the target class is 0. We set β = 0.1, i.e., injecting 500 trigger-carrying inputs into the training set, and kept other parameters as the default values, i.e., the trigger size is 3x3 and the transparency of trigger is 0 (meaning the trigger is clear and has not been blurred). Table 6 illustrates the accuracy of 6 detections in distinguishing 200 IMC backdoored models from 200 benign models, comparing with what has been obtained by TSA attacks. Input preprocessing. This kind of defenses introduce a preprocessing module before feeding the inputs into the target model that removes the trigger contained in inputs [44]. Accordingly, the modified triggers no longer match the hidden backdoor and therefore preventing the activation of the backdoor. Without knowing the details of the trigger, these methods perform preprocessing on both benign inputs and trigger-carrying inputs. Thus, actually, these methods disable backdoor based on a fundamental assumption that the trigger is sensitive to noise and the robustness of benign model fP and backdoor model fb for trigger-carrying inputs differ significantly. To study the robustness of backdoors with small backdoor distance, we investigate the difference between the predictions for the trigger-carrying inputs and the trigger-carrying inputs with small added noise δ, i.e., | fb (A(x) + δ)t − fb (A(x))t |. When the backdoor distance is small, not only A(x) is close to the classification boundary of the backdoor model fb but also close to the classification boundary of benign model fP . Intuitively, A(x) + δ is close to the classification boundary of both fb and fP when kδk is small. Thus, | fb (A(x) + δ)t − fb (A(x))t | should be small. One may argue that some δ would make arg max fb (A(x) + δ) j 6= Table 6: The backdoor detection accuracies (%) for IMC and TSA obtained by six detection methods. IMC TSA j t = arg max fb (A(x)) j even if | fb (A(x) + δ)t − fb (A(x))t | is j small, as small δ could flip the predicted label for A(x) that is close to the classification boundary in fb . However, for the 22 K-ARM 89.50 59.25 MNTD 98.75 51.25 ABS 74.25 51.00 TND 80.25 48.75 SCAn 91.00 63.25 AC 73.50 55.25