arXiv:2210.06509v1 [cs.CR] 12 Oct 2022
Understanding Impacts of Task Similarity on Backdoor Attack and Detection
Di Tang
Indiana University Bloomington
Rui Zhu
Indiana University Bloomington
XiaoFeng wang
Indiana University Bloomington
Haixu Tang
Indiana University Bloomington
Yi Chen
Indiana University Bloomington
has never been fully understood. Even though new attacks and
detections continue to show up, they are mostly responding
to some specific techniques, and therefore offer little insights
into the best the adversary could do and the most effective
strategies the detector could possibly deploy.
Such understanding is related to the similarity between
the primary task that a benign model is supposed to accomplish and the backdoor task that a backdoored model actually
performs, which is fundamental to distinguishing between a
backdoored model and its benign counterpart. Therefore, a
Task Similarity Analysis (TSA) between these two tasks can
help us calibrate the extent to which a backdoor is detectable
(differentiable from a benign model) by not only known but
also new detection techniques, inform us which characters of
a backdoor trigger contribute to the improvement of the similarity, thereby making the attack stealthy, and further guides
us to develop even stealthier backdoors so as to better understand what the adversary could possibly do and what the
limitation of detection could actually be.
Abstract
With extensive studies on backdoor attack and detection,
still fundamental questions are left unanswered regarding the
limits in the adversary’s capability to attack and the defender’s
capability to detect. We believe that answers to these questions can be found through an in-depth understanding of the
relations between the primary task that a benign model is supposed to accomplish and the backdoor task that a backdoored
model actually performs. For this purpose, we leverage similarity metrics in multi-task learning to formally define the
backdoor distance (similarity) between the primary task and
the backdoor task, and analyze existing stealthy backdoor attacks, revealing that most of them fail to effectively reduce
the backdoor distance and even for those that do, still much
room is left to further improve their stealthiness. So we further
design a new method, called TSA attack, to automatically generate a backdoor model under a given distance constraint, and
demonstrate that our new attack indeed outperforms existing
attacks, making a step closer to understanding the attacker’s
limits. Most importantly, we provide both theoretic results
and experimental evidence on various datasets for the positive correlation between the backdoor distance and backdoor
detectability, demonstrating that indeed our task similarity
analysis help us better understand backdoor risks and has the
potential to identify more effective mitigations.
1
Methodology and discoveries. This paper reports the first
TSA on backdoor attacks and detections. We formally model
the backdoor attack and define backdoor similarity based
upon the task similarity metrics utilized in multi-task learning
to measure the similarity between the backdoor task and its
related primary task. On top of the metric, we further define
the concept of α-backdoor to compare the backdoor similarity across different backdoors, and present a technique to
estimate the α for an attack in practice. With the concept of
α-backdoor, we analyze representative attacks proposed so
far to understand the stealthiness they intend to achieve, based
upon their effectiveness in increasing the backdoor similarity.
We find that current attacks only marginally increased the
overall similarity between the backdoor task and the primary
tasks, due to that they failed to simultaneously increase the
similarity of inputs and that of outputs between these two
tasks. Based on this finding, we develop a new attack/analysis
technique, called TSA attack, to automatically generate a backdoored model under a given similarity constraint. The new
technique is found to be much stealthier than existing attacks,
Introduction
A backdoor is a function hidden inside a machine learning
(ML) model, through which a special pattern on the model’s
input, called a trigger, can induce misclassification of the input. The backdoor attack is considered to be a serious threat
to trustworthy AI, allowing the adversary to control the operations of an ML model, a deep neural network (DNN) in
particular, for the purposes such as evading malware detection [67], gaming a facial-recognition system to gain unauthorized access [50], etc.
Task similarity analysis on backdoor. With continued effort on backdoor attack and detection, this emerging threat
1
not only in terms of backdoor similarity, but also in terms of
its effectiveness in evading existing detections, as observed in
our experiments. Further, we demonstrate that the backdoor
with high backdoor similarity is indeed hard to detect through
theoretic analysis as well as extensive experimental studies
on four datasets under six representative detections using our
TSA attack together with five representative attacks proposed
in prior researches.
and model inputs. This categorization has been used in our
research to analyze different detection approaches (Section 4).
More specifically, detection on model outputs captures
backdoored models through detecting the difference between the outputs of backdoored models and benign models
on some inputs. Such detection methods include NC [77],
K-ARM [68], MNTD [83], Spectre [27], TABOR [26],
MESA [58], STRIP [22], SentiNet [13], ABL [43], ULP [38],
etc. Detection of model weights finds a backdoored model
through distinguishing its model weights from those of benign models. Such detection approaches include ABS [48],
ANP [80], NeuronInspect [31], etc. Detection of model inputs
identifies a backdoored model through detecting difference
between inputs that let a backdoored model and a benign
model output similarly. Prominent detections in this category
include SCAn [72], AC [11], SS [74], etc.
Contributions. Our contributions are as follows:
• New direction on backdoor analysis. Our research has
brought a new aspect to the backdoor research, through the
lens of backdoor similarity. Our study reveals the great impacts backdoor similarity has on both backdoor attack and
detection, which can potentially help determine the limits of
the adversary’s capability in a backdoor attack and therefore
enables the development of the best possible response.
• New stealthy backdoor attack. Based upon our understanding on backdoor similarity, we developed a novel technique,
TSA attack, to generate a stealthy backdoor under a given
backdoor similarity constraint, helping us better understand
the adversary’s potential and more effectively calibrate the
capability of backdoor detections,
2
2.1
2.3
We focus on backdoors for image classification tasks, while
assuming a white-box attack scenario where the adversary can
access the training process. The attacker inject the backdoor
to accomplish the goal formally defined in Section 3.2 and
evade from backdoor detections.
The backdoor defender aim to distinguish backdoored models from benign models. She can white-box access those backdoored models and owns a small set of benign inputs. Besides,
the defender may obtain a set of mix inputs containing a large
number of benign inputs together with a few trigger-carrying
inputs, however which inputs carried the trigger in this set is
unknown to her.
Background
Neural Network
We model a neural network model f as a mapping function from the input space X to the output space Y , i.e.,
f : X 7→ Y . Further, the model f can be decomposed into
two sub-functions: f (x) = c(g(x)). Specifically, for a classification task with L classes where the output space Y =
{0, 1, ..., L − 1}, we define g : X 7→ [0, 1]L , c : [0, 1]L 7→ Y and
c(g(x)) = arg max j g(x) j where g(x) j is the j-th element of
g(x). According to the common understanding, after well
training, g(x) approximates the conditional probability of presenting y given x, i.e., g(x)y ≈ Pr(y|x), for y ∈ Y and x ∈ X .
2.2
Threat Model
3
TSA on Backdoor Attack
Not only does a backdoor attack aim at inducing misclassification of trigger-carrying inputs to a victim model, but it
is also meant to achieve high stealthiness against backdoor
detections. For this purpose, some attacks [17, 49] reduce
the L p -norm of the trigger, i.e., kA(x) − xk p , to make triggercarrying inputs be similar to benign inputs, while some others
construct the trigger using benign features [46, 66]. All these
tricks are designed to evade specific detection methods. Still
less clear is the stealthiness guarantee that those tricks can
provide against other detection methods. Understanding such
stealthiness guarantee requires to model the detectability of
backdoored models, which depends on measuring fundamental differences between backdoored and benign models that
was not studied before.
To fill in this gap, we analyze the difference between the
task a backdoored model intends to accomplish (called backdoor task) and that of its benign counterpart (called primary
task), which indicates the detectability of the backdoored
model, as demonstrated by our experimental study (see Section 4). Between these two tasks, we define the concept of
backdoor similarity – the similarity between the primary and
the backdoor task, by leveraging the task similarity metrics
Backdoor Attack & Detection
Backdoor attack. In our research, we focus on targeted backdoors that cause the backdoor infected model fb to map
trigger-carrying inputs A(x) to the target label t different from
the ground truth label of x [5, 59, 77, 82]:
fb (A(x)) = t 6= fP (x)
(1)
where fP is the benign model that outputs the ground truth
label for x and A is the trigger function that transfers a benign
input to its trigger-carrying counterpart. There are many attack
methods have been proposed to inject backdoors, e.g., [12,
14, 20, 47, 49, 50, 61, 72].
Backdoor detection. The backdoor detection has been extensively studied recently [21, 25, 35, 44, 78]. These proposed
approaches can be categorized based upon their focuses on
different model information: model outputs, model weights
2
used in multi-task learning studies, and further demonstrate
how to compute the similarity in practice. Applying the metric to existing backdoor attacks, we analyze their impacts
on the backdoor similarity, which consequently affects their
stealthiness against detection techniques (see Section 4). We
further present a new algorithm that automatically generates
a backdoored model under a desirable backdoor similarity,
which leads to a stealthier backdoor attack.
separate two distributions can be approximated with a neural
network to distinguish them.
Using the dH −W 1 distance, we can now quantify the similarity between tasks. In particular, dH −W 1 (DT 1 , DT 2 ) =
0 indicates that tasks T 1 and T 2 are identical, and
dH −W 1 (DT 1 , DT 2 ) = 1 indicates that these two tasks are totally different. Without further notice, we consider the task
similarity between T 1 and T 2 as 1 − dH −W 1 (DT 1 , DT 2 ).
3.1
3.2
Task Similarity
Backdoor detection, essentially, is a problem about how to
differentiate between a legitimate task (primary task) a model
is supposed to perform and the compromised task (backdoor
task), which involves backdoor activities, the backdoored
model actually runs. To this end, a detection mechanism needs
to figure out the difference between these two tasks. According to modern learning theory [54], a task can be fully characterized by the distribution on the graph of the function [8] – a
joint distribution on the input space X and the output space Y .
Formally, a task T is characterized by the joint distribution
DT : T := DT (X , Y ) = {PrDT (x, y) : (x, y) ∈ X × Y }. Note
that, for a well-trained model f = c ◦ g (defined in Section 2.2)
for task T , we have g(x)y ≈ PrDT (y|x) for all (x, y) ∈ X × Y .
With this task modeling, the mission of backdoor detection
becomes how to distinguish the distribution of a backdoor task
from that of its primary task. The Fisher’s discriminant theorem [52] tells us that two distributions become easier to distinguish when they are less similar in terms of some distance
metrics, indicating that the distinguishability (or separability)
of two tasks is positively correlated with their distance. This
motivates us to measure the distance between the distributions
of two tasks. For this purpose, we define the dH −W 1 distance,
which covers both Wasserstein-1 distance and H-divergence,
two most common distance metrics for distributions.
Following we first define primary task and backdoor task and
then utilize dH −W 1 to specify backdoor similarity, that is, the
similarity between the primary task and the backdoor task.
Backdoor attack. As mentioned earlier (Section 2.2), the
well-accepted definition of the backdoor attack is specified by
Eq. 1 [5, 13, 57, 59, 72, 77, 82]. According to the definition, the
attack aims to find a trigger function A(·) that maps benign
inputs to their trigger-carrying counterparts and also ensures
that these trigger-carrying inputs are misclassfied to the target
class t by the backdoor infected model fb . In particular, Eq. 1
requires the target class t to be different from the source class
of the benign inputs, i.e., t 6= fP (x). This definition, however,
is problematic, since there exists a trivial trigger function
satisfying Eq. 1, i.e., A(·) simply replaces a benign input
x with another benign input xt in the target class t. Under
this trigger function, even a completely clean model fP (·)
becomes “backdoored”, as it outputs the target label on any
“trigger-carrying” inputs xt = A(x).
Clearly, this trivial trigger function does not introduce any
meaningful backdoor to the victim model, even though it
satisfies Eq. 1. To address this issue, we adjust the objective
of the backdoor attack (Eq. 1) as follows:
fb (A(x)) = t, where fP (x) 6= t 6= fP (A(x)).
where H = {h : X × Y 7→ [0, 1]}.
(2)
Proposition 1.
(D , D ′ )
0 ≤ dH −W 1
≤ 1,
dW 1 (D , D ′ ) ≤ dH −W 1 (D , D ′ ) = 21 dH (D , D ′ ),
(4)
Here, the constraint fP (x) 6= t 6= fP (A(x)) requires that under
the benign model fP , not only the input x but also its triggercarrying version A(x) will not be mapped to the target class t,
thereby excluding the trivial attack mentioned above.
Generally speaking, the trigger function A(·) may not work
on a model’s whole input space. So we introduce the concept
of backdoor region:
Definition 1 (dH −W 1 distance). For two distributions D and
D ′ defined on X × Y , dH −W 1 (D , D ′ ) measures the distance
between them two as:
dH −W 1 (D , D ′ ) = sup [EPrD (x,y) h(x, y) − EPrD ′ (x,y) h(x, y)],
h∈H
Backdoor Similarity
Definition 2 (Backdoor region). The backdoor region B ⊂ X
of a backdoor with the trigger function A(·) is the set of inputs
on which the backdoored model fb satisfy Eq. 4, i.e.,
(
t, 6= fP (A(x)), 6= fP (x), ∀x ∈ B
fb (A(x)) =
fP (A(x)),
∀x ∈ X \ B .
(5)
Accordingly, we denote A(B) = {A(x) : x ∈ B} as the set of
trigger-carrying inputs.
(3)
where dW 1 (D , D ′ ) is the Wasserstein-1 distance [4] between
D and D ′ , and dH (D , D ′ ) is their H -divergence [7].
Proof. See Appendix 10.1
Proposition 1 shows that dH −W 1 is representative: it is the
upper-bound of the Wasserstein-1 distance and the half of
the H -divergence. More importantly, dH −W 1 can be easily
computed: the optimal function h in Eq. 2 that maximally
For example, the backdoor region of a source-agnostic backdoor, which maps the trigger-carrying input A(x) whose label
under the benign model is not t into t, is B = X \ (X fP (x)=t ∪
3
dH −W 1 (DP , DA,B ,t ) =
X fP (A(x))=t ), while the backdoor region for a source-specific
backdoor, which maps the trigger-carrying input A(x) with
the true label of the source class s (6= t) into t, is B =
X fP (x)=s \ X fP (A(x))=t . Here, we use XC to denote the subset of
all elements in X that satisfy the condition C: XC = {x|x ∈
X , C is True}, e.g., X fP (x)=t = {x|x ∈ X , fP (x) = t}.
Definition of the primary and backdoor tasks. Now we
can formally define the primary task and the backdoor task
for a backdoored model. Here we denote the prior probability
of input x (also the probability of presenting x on the primary
task) by Pr(x).
Definition 3 (Primary task & distribution). The primary task
of a backdoored model is TP , the task that its benign counterpart learns to accomplish. TP is characterized by the primary distribution DP , a joint distribution over the input space
X and the output space Y . Specifically, PrDP (x, y) is the
probability of presenting (x, y) in benign scenarios, and thus
PrDP (y|x) = PrDP (x, y)/Pr(x) is the conditional probability
that a benign model strives to approximate.
R
max(Prgain (x, y), 0),
(x,y)∈A(B )×Y
where
Prgain (x, y) = PrDA,B ,t (x, y) − PrDP (x, y).
Proof. See Appendix 10.2.
Theorem 2 shows that the calculation of backdoor distance
dH −W 1 (DP , DA,B ,t ) can be reduced to the calculation of the
probability gain of PrDA,B ,t (x, y) over PrDP (x, y) on those
trigger-carrying inputs A(B ), when ZA,B ,t ≥ 1. Notably, because ZA,B ,t = 1 − Pr(A(B )) + β Pr(B ), ZA,B ,t ≥ 1 is satisfied if Pr(A(B )) ≤ β Pr(B ). This implies that if those triggercarrying inputs show up more often on the backdoor distribution than on the primary distribution, we can use the
aforementioned method to compute the backdoor distance.
Parametrization of backdoor distance. The following
Lemma further reveals the impacts of two parameters β and
κ on the backdoor distance:
Lemma 3. When, ZA,B ,t ≥ 1 and Pr(B ) = κ Pr(A(B )),
R
]
dH −W 1 (DP , DA,B ,t ) = Pr(B )
max(Pr
gain (x, y), 0),
Definition 4 (Backdoor task & distribution). The backdoor task of a backdoored model is denoted by TA,B ,t ,
the task that the adversary intends to accomplish by
training a backdoored model. TA,B ,t is characterized by
the backdoor distribution DA,B ,t , a joint distribution over
X × Y . Specifically, the probability of presenting (x, y)
in DA,B ,t is PrDA,B ,t (x, y) = P(x, y)/ZA,B ,t , where ZA,B ,t =
R
(x,y)∈X ×Y P(x, y) = 1 − Pr(A(B )) + β Pr(B ) and
(
PrDA,B ,t (y|x) Pr(A−1 (x))β, x ∈ A(B )
P(x, y) =
PrDP (x, y),
x ∈ X \ A(B ).
(6)
Here, A−1 (x) = {z|A(z) = x} represents the inverse of the trigger function, PrDA,B ,t (y|x) is the conditional probability that
the adversary desires to train a backdoored model to approximate, β is a parameter selected by the adversary to amplify the
probability that the trigger-carrying inputs A(x) are presented
β
to the backdoor task. Actually, we consider 1+β
as the poisoning rate with the assumption that poisoned training data
is randomly drawn from the backdoor distribution. Finally, it
is worth noting that PrDA,B ,t (x, y) is proportional to PrDP (x, y)
except on those trigger-carrying inputs A(B ).
(x,y)∈A(B )×Y
]
where Pr
gain (x, y) equals to
PrD
(x)
β
A,B ,t
ZA,B ,t PrD
(A(B ))
A,B ,t
Pr(x)
PrDA,B ,t (y|x) − κ1 Pr(A(
B )) PrDP (y|x).
(7)
Proof. The derivation is straightforward, thus we omit it.
As demonstrated by Lemma 3, the two parameters β and κ
are important to the backdoor distance, where β is related to
the poisoning rate (Definition 4) and κ describes how close
is the probability of presenting trigger-carrying inputs to the
probability of showing their benign counterparts on the primary distribution (the bigger κ the farther away are these two
probabilities).
Let us first consider the range of β. Intuitive, a large β
causes the trigger-carrying inputs more likely to show up on
the backdoor distribution, and therefore could be easier detected. A reasonable backdoor attack should keep β smaller
than 1, which is equivalent to constraining the poisoning rate
β
( 1+β
) below 50%. On the other hand, a very small β will make
the backdoor task more difficult to learn by a model, which
eventually reduces the attack success rate (ASR). A reasonable backdoor attack should use a β greater than κ1 : that is,
the chance of seeing trigger-carrying inputs on the backdoor
distribution no lower than that on the primary distribution.
Therefore, we assume κ1 ≤ β ≤ 1. Next, we consider the range
of κ. A reasonable lower-bound of κ is 1; if κ < 1, triggercarrying inputs show up even more often than their benign
counterparts on the primary distribution, which eventually
lets the backdoored model outputs differently from benign
models on such large portion of inputs and make the backdoor
be easy detected. So, we assume κ ≥ 1.
With above assumptions on the range of β and κ, we get the
following theorem to describe the range of backdoor distance.
Formalization of backdoor similarity. Putting together the
definitions of the primary task, the backdoor task, and the
dH −W 1 distance between the two tasks (Eq. 2), we are ready
to define backdoor similarity as follows:
Definition 5 (Backdoor distance & similarity). We define dH −W 1 (DP , DA,B ,t ) as the backdoor distance between
the primary task TP and the backdoor task TA,B ,t and 1 −
dH −W 1 (DP , DA,B ,t ) as the backdoor similarity
Theorem 2 (Computing backdoor distance). When ZA,B ,t ≥
1, where ZA,B ,t is defined in Eq. 6, the backdoor distance
between DP and DA,B ,t is
4
Theorem 4 (Backdoor distance range). Supposing Pr(B ) =
κ Pr(A(B )), when κ ≥ 1 and κ1 ≤ β ≤ 1, we have ZA,B ,t ≥ 1,
( Z β − κ1 (1 − S)) Pr(B ) ≤ dH −W 1 (DP , DA,B ,t ) ≤
A,B ,t
R
where S =
β
ZA,B ,t
max{∆ prob , 0} and
(x,y)∈A(B )×Y
∆ prob =
PrD
PrD
A,B ,t
A,B ,t
(x)
(A(B ))
Pr(x)
PrDA,B ,t (y|x) − Pr(A(
B )) PrDP (y|x).
Also, PrDA,B ,t (y|x) and PrDP (y|x) can be approximated by
a well-trained backdoored model fb = cb ◦ gb and a welltrained benign model fP = cP ◦ gP , respectively, i.e., gb (x)y ≈
Pr(B ),
PrDA,B ,t (y|x) and gP (x)y ≈ PrDP (y|x). Supposing that we have
sampled m trigger-carrying inputs {A(x1 ), A(x2 ), ..., A(xm )},
α can be approximated by:
β
L−1
gb (A(xi ))y − κ1 gP (A(xi ))y , 0}.
α ≈ ∑m
i=1 ∑y=0 max{ Z
A,B ,t
(8)
(9)
In Eq 9, β is chosen by the adversary. Thus, we assume
that β is known, when using α to analyze different backdoor
attacks. Different from β, κ is determined by the trigger function A that distinguishes different backdoor attacks from each
other. Next, we demonstrate how to estimate κ.
Proof. See Appendix 10.3.
Corollary 5 (Effects of β). Supposing Pr(B ) = κ Pr(A(B )),
κ ≥ 1 and κ is fixed, when β varies in range [ κ1 , 1], we have
S Pr(B )
κ
≤ dH −W 1 (DP , DA,B ,t ) ≤
κ Pr(B )
κ+κ Pr(B )−Pr(B ) ,
where S is defined in Theorem 4. Specially, the lowerbound S Pr(κ B ) is achieved when β = κ1 , and the upper-bound
κ Pr(B )
κ+κ Pr(B )−Pr(B ) is achieved when β = 1.
transformations, we get that κ =
V (B ) and V (A(B )) are the volumes of set B and A(B ) respectively. Below, we demonstrate how to estimate Pr(x) and the
(B )
volume ratio κV = V V(A(
B )) separately.
To estimate the prior probability of an input x for the primary task, Pr(x), we employed a Generative Adversarial Network (GAN) [34] and the GAN inversion [81] algorithms.
Specifically, we aim to build a generative network G and a
discriminator network D using adversarial learning: the discriminator D attempts to distinguish the outputs of G and
the inputs (e.g., the training samples) x of the primary task,
while G takes as the input z randomly drawn from a Gaussian
distribution with the variance matrix I, i.e., z ∼ N(0, I) and
attempts to generate the outputs that cannot be distinguished
by D. When the adversarial learning converges, the output of
G approximately follows the prior probability distribution of
x, i.e., Pr(x) ≈ Pr(G(z) = x)). In addition, we incorporated
with a GAN inversion algorithm capable of recovering the
input z of G from a given x, s.t., G(z) = x. Combining the
GAN and the inversion algorithm, we can estimate Pr(x) for
a given x: we first compute z from x using the GAN inversion
algorithm, and then estimate Pr(x) using PrN(0,I) (z).
To estimate the volume ratio κV , we use a Monte Carlo algorithm similar to that proposed by the prior work [32]. Briefly
speaking, for estimating V (B ), we first randomly select an x
in the backdoor region B as the origin, and then uniformly
sample many directions from the origin and approximate
the extent (how long from the origin to the boundary of B )
along these directions, and finally, calculate the expectation
of the extents of these directions as Ext(B ). According to the
prior work [32], V (B ) is approximately equal to the product
of Ext(B ) and the volume of the n dimensional unit sphere,
Ext(B )
assuming B ⊂ Rn . Therefore, we estimate κV by Ext(A(
B )) .
Proof. See Appendix 10.4.
Corollary 6 (Effects of κ). Supposing Pr(B ) = κ Pr(A(B )),
β ≤ 1 and β is fixed, when κ varies in range [ β1 , ∞), we have
Sβ Pr(B ) ≤ dH −W 1 (DP , DA,B ,t ) ≤ β Pr(B ),
where S is defined in Theorem 4. Specially, the lower-bound
Sβ Pr(B ) and the upper-bound β Pr(B ) are achieved, respectively, when κ = β1 .
Proof. See Appendix 10.5.
3.3 α-Backdoor
Definition of α-backdoor. Through Lemma 3 and Theorem 4, we show that the backdoor distance and its boundaries
are proportional to Pr(B ), the probability of showing benign
inputs in the backdoor region B on the prior distribution of inputs. However, different backdoor attacks may have different
backdoor regions, which is a factor we intend to remove so as
to compare the backdoor similarities across different attacks.
For this purpose, here we define α-backdoor, based upon the
same backdoor region B for different attacks, as follows:
Definition 6 (α-backdoor). We define an α-backdoor as a
backdoor whose backdoor distribution is DA,B ,t , primary distribution is DP and the associated backdoor distance equals
to the product of α and Pr(B ), i.e.,
α · Pr(B ) = dH −W 1 (DP , DA,B ,t ).
Approximation of α. Lemma 3 actually provides an approach to approximate α in practice. Specifically, using the
]
symbol Pr
gain that has been defined in Eq. 7, we get a simR
]
max(Pr
ple formulation of α: α =
gain (x, y), 0).
(x,y)∈A(B )×Y
Note that
Pr(x)
Pr(A(B ))
= Pr(x|x ∈ A(B )) and
PrD
PrD
A,B ,t
A,B ,t
(x)
(A(B ))
Pr(B )
Pr(A(B )) . Through trivial
EPr(x|x∈B ) Pr(x)
V (B )
V (A(B )) EPr(x|x∈A(B )) Pr(x) , where
Estimation of κ. Recall that κ =
In general, we estimate κ as
=
PrDA,B ,t (x|x ∈ A(B )). This enables us to approximate α
through sampling only trigger-carrying inputs x ∈ A(B ).
EPr(x|x∈B ) Pr(G−1 (x))
Ext(B )
Ext(A(B )) EPr(x|x∈A(B )) Pr(G−1 (x)) ,
where G−1 (x) represents the output of a GAN inversion algorithm for a given x. We defer the details to Appendix 9.
5
BadNet
SIG
WB
CB
IAB
TSA
2.53
2.99
2.93
17.05
5.97
3.27
the difference in inputs and S characterizes the difference in
outputs between the primary and backdoor distributions.
Specifically, for each attack method, we generated sourcespecific backdoors (source class is 1 and target class is 0)
following the settings described in its original paper but changing β to adjust the poisoning rate. In particular, for β = 0.1,
we injected 500 poisoning samples into the source class (i.e.,
class 1) with a total of 5,000 samples in the training set.
For each backdoored model, we calculated its ASR at different β values as illustrated in Table 1 to demonstrate the
side effect of reducing β on ASR. As we see from the table,
for BadNet, the ASR is 99.97% when β is 0.1, which goes
down with the decrease of the β, until 38.85% when the β
drops to 0.005, rendering the backdoor attack less meaningful.
This also shows the rationale of keeping β ≥ κ1 , as required
in Theorem 4 (here β = 0.005 ≈ κ2 ).
As illustrated in Eq. 9, α is proportional to β, but has a more
complicated relation with κ and S as further demonstrated in
Theorem 4. To study this complicated relation between α and
the parameters other then β, we normalize α by dividing it
with β and present the results in Table 1. Next, we elaborate
our analysis about how existing backdoor attacks reduce α
through controlling these parameters.
Figure 1: Demonstration of trigger-carrying inputs generated
by different attacks. The first row shows attacks’ name, the
second row presents trigger-carrying inputs, the third row
shows triggers, the fourth row shows amplified triggers and
the fifth row illustrates the L2 -norm of triggers.
Table 1: Backdoor similarities of backdoor attacks. ASR
stands for attack success rate, L2 -norm stands for the average
of {kx − A(x)k2 : x ∈ B } after regularizing all x and A(x) into
[0, 1]n . Note that, when β < κ1 , the “α/β” columns represent
the S value (Eq. 8).
β
BadNet [24]
SIG [6]
WB [16]
CB [46]
IAB [55]
TSA (ours)
3.4
0.1
99.97
98.10
83.38
86.69
98.09
99.85
ASR (%)
0.05
0.01
99.18 69.27
81.62 34.57
72.29 32.97
78.18 55.09
92.58 54.37
99.21 92.83
0.005
38.85
9.88
7.69
45.92
20.13
79.07
0.1
0.98
0.99
0.67
1.00
1.00
0.37
0.05
0.98
0.98
0.49
1.00
1.00
0.34
α/β
0.01
0.97
0.96
0.23
1.00
1.00
0.32
0.005
0.95
0.94
0.18
1.00
1.00
0.25
ln(κ)
All
5.98
6.05
4.01
17.93
10.72
3.07
L2 -norm
All
2.37
2.72
2.86
17.37
5.96
3.13
Visually-unrecognizable backdoors (BadNet). This kind of
backdoor attacks generate trigger-carrying inputs visually
similar to their benign counterparts, in an attempt to evade the
human inspection for anomalous input patterns. Generally,
visually-unrecognizable backdoors constrain the L p -norm of
the trigger, i.e., kA(x) − xk p , to be smaller than a threshold.
Essentially, reducing kA(x) − xk p is to reduce | Pr(x) −
Pr(A(x))|, the difference between the probability of presenting a trigger-carrying input and the probability of presenting its benign counterpart. This is because | Pr(x) − Pr(x +
δ)| ∝ kδk p , when the perturbation δ is small and the prior
distribution of inputs is some kind of smooth. Recall that
κ = Pr(B )/ Pr(A(B )), thus reducing kA(x) − xk p can reduce
κ, as demonstrated in the last two columns of Table 1. However, making κ small alone cannot effectively reduce the α as
demonstrated by Corollary 6. Thus, visually-unrecognizable
backdoors only marginally reduce α and moderately increase
the backdoor similarity, as observed by our analysis on BadNet (Table 1), which only lowers down α/β (the normalized
α) by 0.05 to 0.95 when β = 0.005.
Analysis on Existing Backdoor Attacks
Existing stealthy backdoor attack methods can be summarized
into five categories: visually-unrecognizable backdoors, labelconsistent backdoors, latent-space backdoors, benign-feature
backdoors and sample-specific backdoors. In this section,
we report a backdoor similarity analysis on these backdoor
attacks, which is important to understanding their stealthiness,
given the positive correlation between backdoor distance and
detectability we discovered (Section 4).
We compare the backdoor distance of backdoored models
generated by 5 different attacks, each representing a different category, on CIFAR10 [39]. As mentioned earlier (Theorem 4), the backdoor distance described by α is related to β, κ
and S, where β is proportional to the poisoning rate (see Definition 4), that describes the adversary’s aspiration about how
likely those trigger-carrying inputs present in the backdoor
distribution in comparison with the probability of showing
their benign counterparts in the primary distribution, κ also
measures the difference between the probability of showing
those trigger-carrying inputs and their benign counterparts
however within the primary distribution, and S summarizes
the conditional probability gain of the outputs given those
trigger-carrying inputs obtained on the backdoor distribution
compared with such conditional probability on the primary
distribution. In simple words, β and κ together characterizes
Label-consistent backdoor (SIG). The label-consistent
backdoor attacks inject a backdoor into the victim model
with only label-consistent inputs generated by pasting the
trigger onto the vague (i.e., hard to be classified) inputs, in
an attempt to increase the stealthiness against human inspection. Specifically, prior research [75] proposes to use GAN
or adversarial examples to get hard-to-classify inputs, while
SIG [6] utilizes a more inconspicuous trigger (small waves).
However, we found that label-consistent backdoors do not
reduce α more effectively, than the naive label-flipped back6
doors (e.g., BadNet), because injecting a backdoor through
label-consistent way has changed neither κ nor S of this backdoor task away from that of injecting this backdoor through
label-flipped way, as observed in our experiments where similar α/β (the normalized α) exhibited by these two types of
backdoors (see the “SIG” and the “BadNet” rows in Table 1).
Specifically, the BadNet and SIG attacks accomplished their
backdoor tasks using similar triggers in terms of L2 -norm:
BadNet uses a trigger with the L2 -norm of 2.37 and SIG utilizes a trigger with L2 -norm of 2.72. Apparently, the α/β
values for the SIG and those for the BadNet are similar at all
β values we tested.
directly reduce S (defined in Eq. 8), but in the meantime,
increase κ (since the trigger-carrying inputs becomes less
likely to see from the primary distribution), and thus may
not reduce the backdoor distance eventually, which has been
shown by the “CB” row of Table 1. In addition, the benignfeature backdoors also increase the difficulty in learning the
backdoor task (only 83.38% ASR achieved when β = 0.1).
Sample-specific backdoors (IAB) The sample-specific backdoor attacks design the trigger specific to each input. As a
result, if an input is given an inappropriate trigger, it will
not trigger the backdoor. This kind of backdoors are designed to evade trigger inversion by increasing the difficulty
in reconstructing the true trigger. The Input Aware Backdoor
(IAB) [55] is a representative work in this category, which
uses a trigger generation network to produce a sample-specific
trigger. The attack methods proposed in the prior work [45]
and [65] also belong to this category.
A sample-specific backdoor requires that the trigger carries more information than the trigger of sample-agnostic
backdoors, so as to enable the backdoored model to learn
the complicated relations between triggers and the inputs.
Thus, the trigger of the sample-specific backdoors may come
with a large L2 -norm. As presented in Table 1, the L2 -norm
of the trigger used by the IAB backdoor is 5.96, more than
twice of the trigger for BadNet (2.37) in terms of L2 -norm.
Such a large trigger renders the trigger-carrying inputs less
likely to observe from the primary distribution, thereby reducing the similarity between the probability of seeing benign
inputs and the probability of seeing trigger-carrying inputs
on the primary distribution, and leading to the increase in κ
(ln(κ) = 10.72) and the α/β (the normalized α).
Latent-space backdoors (WB). The latent-space backdoor
attacks aim to make the backdoored model produce similar
latent features for trigger-carrying inputs and benign inputs.
Prior research [84] proposes to use this idea to generate a
student model that learns the backdoor injected in the teacher
model under the transfer learning scenario. Later, this idea
has been employed by the Wasserstein Backdoor (WB) [16]
to increase the backdoor stealthiness against the latent space
defense (e.g., AC [11]). Specifically, WB makes the distribution of the penultimate layer’s outputs (latent features) of
trigger-carrying inputs as close to those of benign inputs as
possible in terms of the sliced-Wasserstein distance [37].
Making latent features of trigger-carrying inputs and benign
inputs be close is essentially to reduce S (defined in Eq. 8),
the expectation of the conditional probability gain obtained
by the backdoored model on trigger-carrying inputs, which is
actually the lower-bound of the α when β = κ1 (Corollary 6).
In this way, WB effectively reduces α/β (the normalized α)
compared with other four types of backdoors as demonstrated
in the “WB” row of Table 1. However, this α/β reduction
achieved by WB comes with the cost of low ASRs, especially
when β is low (ASR is only 7.69% when β = 0.005), which
indicates that reducing S may make the trigger harder to learn.
3.5
New Attack
In Section 3.4 and Table 1, we illustrated that the existing
backdoor attacks did not effectively reduce the backdoor distance while keeping high attack success rate (ASR). Our
analysis revealed that it is mainly due to three points: 1) most
of these attacks did not reduce κ to a small value (e.g., in
BadNet, SIG, CB and IAB); 2) the complicated triggers used
by many attacks make the backdoor task hard to be learned
(e.g., WB, CB and IAB); and 3) some missed to reduce S (e.g.,
BadNet, SIG, IAB).
To address these issues, we aim to devise a new attack
method that can handle all these points at one time. To reduce
κ, the adversary should use a trigger function that maps a
benign input to its close neighbor in terms of not only their
L p -norm and but also their probabilities to be presented by
the primary task. Using the trigger-carrying inputs with small
L p -norm from the benign inputs may not unnecessarily lead
to small κ; in fact, as shown in Table 1, the trigger used by
BadNet lead to the trigger-carrying inputs with smaller L2 norms but higher κ compared to those by WB. On the other
hand, to reduce S, the adversary should enable the backdoored
model to generate similar conditional probabilities as the
Benign-feature backdoors (CB) Benign-feature backdoor
attacks aim to produce backdoored models that leverage features similar to those used by a benign model by constructing
a trigger with a composite of benign features, thereby increasing the stealthiness of the backdoor against the backdoor
detection techniques that distinguish the weights of backdoored models from those of benign models (e.g., ABS [48]).
A representative work in this category is Composite Backdoor (CB) [46], which mixes two benign inputs from specific
classes into one, and then trains the backdoored model to
predict the target labels on these mixed inputs. In another
example [51], the adversary constructs the trigger using the
reflection features hiding in the input images.
The training inputs with benign features from those in different classes could render the marginal backdoor distribution
on inputs significantly deviating from the distribution of benign inputs, making this backdoor even easier to detect. When
it comes to backdoor similarity, benign-feature backdoors in7
LC (A, ζ, ω) = LA,B ,t ( fP , −α∗ , β) + ω max{LA (C) − ζ, 0},
benign models of the outputs given those trigger-carrying
inputs. Finally, the adversary should use a trigger function
that can be easily learned; using a complex trigger function
as used by WB lead to the backdoored model with low ASR
when the poison rate is low (i.e., β is small).
At a first glance, it appears impossible to reduce κ and S
simultaneously, as a perfect benign model produces similar
outputs for similar inputs and, thus, it always produces different conditional probabilities from the backdoored model
of the outputs given those trigger-carrying inputs. In practice, however, the benign models may not be perfect (highly
robust), which may produce very different outputs even for
similar inputs, e.g., the adversarial samples [71], making it
possible to reduce κ and S simultaneously. Together with the
trick to make trigger be easy to learn, we the TSA attack
which details are illustrated in Algorithm 1.
(12)
which searches for the trigger function A that maps the benign inputs to the trigger-carrying inputs close to the classification boundary in fP while penalizing those functions A
with LA (C) > ζ by incorporating the penalty term with the
weight ω. Finally, in line-7, we use the refined trigger function
A to poison the training data, which is then used to train a
backdoored model fb by minimizing LA,B ,t ( fb , α∗ , β) and a
regularization term:
LA,B ,t ( f , α∗ , β) + kgb (xC ) − gP (xC )k2
(13)
where xC = arg max kgb (x) − gP (x)k2 .
x∈X /A(B )
Here, the regularization term is designed to seek fb that minimizes the maximum difference between the outputs of fb and
fP for the inputs without the trigger.
Empirically, we used a LeNet-5 [40] network as C, and
an UNet [62] as the trigger function A. Besides, we set
epochad j = 3, δ = 0.1, ζ = 0.1 and ω = 0.1. We used an
Adam [36] optimizer with the learning rate of 1e−3 to train
model weights. We implemented our method based on the PyTorch framework and integrated our code into TrojanZoo [57].
In our experiments, we used Algorithm 1 to generate backdoored models on the CIFAR10 dataset and demonstrate the
results in the last row of Table 1. We observe that the TSA
backdoor not only achieved much better ASR (79.07%) than
previous attacks (≤ 50%) even when β is as small as 0.005,
but also smaller backdoor distance then other attacks at the
meanwhile. This could be ascribed to several advantages of
our approach. First, the trigger function refinement (line-3
to -6) helps to derive a trigger function that is easy to learn.
Second, the L p -norm constraint lets the TSA backdoor has
small κ, which reduces the lower-bound of the backdoor distance (Corollary 6), and thus allows for the reduction of the
backdoor distance by manipulating S (Eq. 8). Furthermore,
the TSA backdoor attack manages to control the backdoor
distance through manipulating S (line-7) for a given α∗ , which
enables the TSA backdoor to achieve small backdoor distance
on all β values. In simple words, TSA backdoor maps the
benign inputs to the trigger-carrying inputs close to both the
classification boundary (controlled by α∗ ) and the original benign inputs (controlled by δ) through a easy-to-learn trigger.
Algorithm 1 TSA attack.
Input: Dtr , B , t, α∗ , β, epochad j , δ, ζ, ω
Output: A(·), fb
1: Train a benign model f P on Dtr
2: Train A with LA,B ,t ( f P , α∗ , β) (Eq. 10) and δ constraint
3: for _ in range(epochad j ) do
4:
Train C with LA (C) (Eq. 11)
5:
Update A with LC (A, ζ, ω) (Eq. 12)
6: end for
7: Train f b on Dtr to minimize Eq. 13
First, in line-1, we train a benign model fP on a given training set Dtr . In line-2, for the benign model fP , we optimize
trigger function A to minimize LA,B ,t ( fP , −α∗ , β) such that
kA(x) − xk2 ≤ δ, where
LA,B ,t ( f , α∗ , β) = E(x,y)∈X ×Y Lce ( f (x), y)−
∗
∗
1−α
βE(x,y)∈B ×Y ( 1+α
2 log(g(A(x))t ) + 2 log(g(A(x))y )).
(10)
Here, we assume f = c ◦ g as described in Section 2.1. The
loss function LA,B ,t is the sum of the loss for the primary task
of f on the clean inputs and the loss for the backdoor task
of f on the trigger-carrying inputs weighted by β. The initial
trigger function A is an optimized variable, which is trained to
minimize LA,B ,t ( fP , −α∗ , β) while satisfying the δ constraint,
such that this initial trigger function maps the benign inputs
to the trigger-carrying inputs in the region of the same class
label but close to the classification boundary in fP . Then in
line-3 to -6, we iteratively refine the trigger function and to
make the backdoor task more easily learned. Specifically, we
train a small classification network C to distinguish the triggercarrying inputs from their benign counterparts by minimizing
the loss function:
LA (C) = −Ex∈B log(C(A(x))) + log(1 −C(x)).
(11)
4
TSA on Backdoor Detection
In the last section, we show that current backdoor attacks
are designed for evading specific backdoor detection methods, and do not effectively reduce the backdoor distance that
measures how close the backdoor and the primary tasks of a
backdoored model are. We further proposed a new TSA attack
to strategically reduce the backdoor distance and create more
stealthy backdoors. In this section, we demonstrate that the
backdoor distance is closely related to the backdoor detectability: the backdoors with small backdoor distance are hard to
detect, through both theoretical and experimental analysis.
The poor performance of C (i.e., LA (C) > ζ) indicates that
the current trigger function A is hard to learn, and then A is
refined to minimize the loss function (line-5):
8
First, through theoretical analysis, we demonstrate how the
backdoor distance affects the evasiveness of a backdoor from
detection methods in each of the three classes (Section 2.2):
detection on model outputs, detection on model weights and
detection on model inputs, respectively. Next, we show that in
practice, by reducing the backdoor distance, the detectability
of a backdoor indeed becomes lower.
4.1
small, the difference of the outputs between a backdoored
model and a benign model becomes small as well. Considering the randomness involved in the training process, when
this difference is small, it is hard to distinguish backdoored
models from benign models. Therefore, these approaches of
detection on model outputs often suffer from false positives,
and thus achieve low detection accuracy on the backdoors
with small backdoor distance.
Detection on Model Outputs
4.2
The first class of backdoor detection methods, herein referred
to as the detection on model outputs, attempt to capture backdoored models by detecting the difference between the outputs of the backdoored models and the benign models on
some inputs. One kind of methods in this class are those
methods based on trigger reversion algorithm, which first reconstruct triggers and then check whether a model exhibits
backdoor behaviors in response to these triggers. In other
words, the objective of these methods is to identify some inputs on which the outputs of the backdoored models and of
the benign models are different. When the difference becomes
small, however, these methods often become less effective.
For example, K-ARM [77] failed to detect the TSA backdoor
(Section 4.4). Notably, the backdoored model generated by
the TSA attack produce the similar outputs as the benign models on the trigger-carrying inputs, and consequently, K-ARM
cannot distinguish these two types of models based on the
reconstructed trigger candidates, even though the L p -norm of
those triggers injected by the TSA attack is as small as desired by K-ARM (K-ARM is designed for detecting triggers
smaller than a given maximum size). MNTD [83] is another
method in this class, which searches for the inputs on which
the backdoored models and the benign models generate the
most different outputs.
Formally, we consider the goal of detections in this class
is to check whether a trigger function A(·) can be found that
maps the inputs to a region A(B ), where the outputs of the
backdoored model fb is most different from the outputs of a
benign model fP (e.g., gb (A(x))t ≫ gP (A(x))t for the target
label t). This goal becomes hard to achieve when the backdoor
distance is small, as demonstrated in the following lemma.
Detection on Model Weights
The second class of detection approaches, herein referred
to as the detection on model weights, attempt to detect a
backdoored model through distinguishing its model weights
from those of benign models. Formally, we consider the goal
of detection methods in this class as to verify whether the
minimum distance between the weights of a candidate backdoored model ωb and the weights of a benign model in a set
{ωP } exceeds a pre-determined threshold θω , i.e., whether
minω∈{ωP } kω − ωb k2 > θω .
To study the difference between the weights of two models, we formulate it as the weight evolution problem in continual learning [73]. Specifically, we consider two tasks, TP
and TA,B ,t , for which the benign model fP = f (· : ωP ) with
the weights ωP and the backdoored model fb = f (· : ωb )
with the weights ωb learn to accomplish, respectively. We
then analyze the change of ωP → ωb through the continual
learning process TP → TA,B ,t . Based on the Neural Tangent
Kernel (NTK) [33] theory, existing work [41] has showed
that, fb (x) = fP (x)+ < φ(x), ωb − ωP > where φ(x) is the
kernel function and φ(x) = ▽ω0 f (x; ω0 ), which is dependent
only on some weights ω0 . Furthermore, recent research [18]
has shown that kδTP →TA,B ,t (X)k22 = kφ(X)(ωb −ωP )k22 , where
δTP →TA,B ,t (X) is the so-called task drift from TP to TA,B ,t ,
kδTP →TA,B ,t (X)k22 := Σ k fb (x) − fP (x)k22 . Based on these rex∈X
sults, we connect the distance between ωP and ωb to the
backdoor distance through the following lemma.
Lemma 8. When A is fixed and β = κ1 , for a well-trained
backdoored model fb = f (· : ωb ), and a well-trained benign
model fP = f (· : ωP ), we have
√
κ mL
kφ(X)k2 α
Lemma 7. When A is fixed and β = κ1 , for a well-trained
backdoored model fb = cb ◦ gb and a well-trained benign
model fP = cP ◦ gP , s.t., gb (x)y ≈ PrA,B ,t (y|x) and gP (x)y ≈
Pr(y|x) for all (x, y) ∈ X × Y , we have
EPrA,B ,t (A(x)|x∈B ) gb (A(x))t − EPr(A(x)|x∈B ) gP (A(x))t ≤ ακ
≤ kωb − ωP k2 .
where X = {x1 , x2 , ..., xm } is a set of m inputs in L classes and
φ(·) is the kernel function.
Proof. See Appendix 10.6.
Proof. Using Corollary 5, one can derive the desired.
Specifically, Lemma 7 demonstrates that, when the adversary has chosen a trigger function A and set β = κ1 , which
minimizes the backdoor distance in reasonable settings, α is
proportional to the upper bound of the difference between the
expected outputs of a backdoored model and that of a benign
model about the probability of a trigger-carrying input in the
target class. In other words, when the backdoor distance is
Lemma 8 demonstrates that, when the adversary has chosen a trigger function A and set β = κ1 , which minimizes the
backdoor distance in reasonable settings, α is proportional
to the lower bound of the distance between the weights ωb
and ωP in term of L2 -norm. In other words, to ensure the
weights of the backdoor models ωb is close to the weights
of the benign models ωP , which lead to the backdoors more
difficult to be detected by the methods of detection on model
9
weights, the adversary should design a backdoor with small
backdoor distance.
4.3
TND [79], SCAn [72] and AC [11], to defend the backdoors
injected by 6 backdoor attack methods (Section 3.4) on 4
datasets: CIFAR10 [39], GTSRB [30], ImageNet [15] and
VGGFace2 [10]. On each dataset, we generated 200 benign
models as the control group. For each backdoor attack, we
used it to generate 200 backdoored models on every dataset.
Specifically, in each of the backdoored models, a backdoor
was injected with a randomly chosen source class and a randomly chosen target class (different from the source class).
We fixed the number of poisoning samples to be equal to 10%
of the total number of training samples in the source class,
i.e., β = 0.1. Under these settings, we trained the backdoored
model that achieved > 80% ASR for all 6 attack methods
on all 4 datasets. In total, we generated 800 benign models
and 4800 backdoored models on all 4 datasets. To evaluate
a backdoor detection method on each dataset, we ran it to
distinguish 200 benign models (trained on this dataset) from
200 backdoored models generated by each attack method.
Overall, we performed a total of 144 (= 4 × 6 × 6) evaluations on all 4 datasets for all 6 detections against all 6 attacks.
To train a model (benign or backdoored), we used the model
structure randomly selected from these four: ResNet [28],
VGG16 [69], ShuffleNet [87] and googlenet [70]. We used
the Adam [36] optimizer with the learning rate of 1e−2 until
the model converges (e.g., ∼ 50 epochs on CIFAR10).
Detection on Model Inputs
The third class of the detection methods, the detection on
model inputs, attempt to identify a backdoored model through
detecting the difference between inputs on that the backdoored model and benign models generate similar outputs.
An prominent example of this category is SCAn [72], which
checks whether the inputs predicted by a backdoored model
as belonging to the same class can be well separated into
two groups (modeled as two distinct distributions), while the
inputs predicted by a benign model as belonging to the same
class come from a single group (modeled as a single distribution). Similar idea was also exploited by AC [11].
We formulate this class of methods as a hypothesis test that
evaluates whether the two distributions, characterized by two
sets XP and Xb , respectively, on which the benign model fP
and the backdoored model fb share the same prediction, are
significantly different, where XP = { fP′ (xi ) : i = 1, 2, ..., nP }
and Xb = { fb′ (xi ) : i = 1, 2, ..., nb }, fP′ (x) and fb′ (x) are the
intermediate results of fP (x) and fb (x), respectively, for an
input x. For instance, fb′ (x) could be the j-th layer’s outputs
of fb in a multi-layer neural network.
Without loss of the generality, we adopt a two-sample
Hotelling’s T-square test [29] for this hypothesis test, which
tests whether the means of two distributions are significantly
different. Here, we consider the test statistic T 2 , which is calculated from the samples drawn from the two distributions,
and is then compared with a pre-selected threshold according
to a desirable confidence. The smaller T 2 , the less probable
these two distributions are different in terms of their means.
The following lemma demonstrates how the test statistic T 2
has an upper-bound related to the backdoor distance.
Detection on model outputs. We tested two representative
detection methods in this category: K-ARM and MNTD. KARM is one of the winning solutions in TrojAI Competition [2]. It could be viewed as an enhanced version of Neural
Cleanse (NC). It cooperates with a reinforcement learning
algorithm to efficiently explore many trigger candidates with
different norm and different shape using the trigger reversion
algorithm (as used in NC), and thus increases the chance to
identify the true trigger. As mentioned by the authors of KARM [68], it significantly outperforms NC. Hence, here, we
evaluated K-ARM instead of NC. MNTD is another representative method in this category. It has been taken as the
standard detection method in the Trojan Detection Challenge
(TDC) [3], a NeurIPS 2022 competition. Specifically, MNTD
detects the backdoored models by finding some inputs on
which the outputs of the backdoored model are most different
from the outputs of the benign models.
Table 2 illustrates that K-ARM works poorly on SIG, CB
and TSA, three backdoor attacks using widespread triggers
that may affect the whole inputs (even with small L2 -norm).
MNTD performs well on the backdoor attacks except on
TSA, indicating existing attack methods somehow make the
outputs of backdoored models are distant from the outputs
of the benign models on many inputs. On the other hand, the
outputs of the backdoored model generated by TSA are close
to the outputs of benign models. This also helps TSA perform
well on TDC competition.1
Lemma 9. When A is fixed and β = κ1 , for a well-trained
backdoored model fb and a well-trained benign model fP ,
if Xb ∼ N (mb , Σ) and XP ∼ N (mP , Σ) and nP and nb are
sufficiently large, we have
nb 2
α .
T 2 ≤ λmax nnPP+n
b
where λmax is the largest eigenvalue of Σ−1 .
Proof. See Appendix 10.7.
Lemma 9 demonstrates that, when the adversary has chosen a trigger function A and set β = κ1 , which minimizes the
backdoor distance in reasonable settings, α2 is proportional
to the upper bound of the test statistic T 2 . This implies, when
the backdoor distance is small, it is difficult to distinguish the
distribution of Xb from the distribution of XP , resulting in the
poor accuracy of detecting backdoor on model inputs.
4.4
Experiments: Detection vs. Attack
To investigate the performance of these three kinds of detection methods against backdoor attacks, we evaluated 6 backdoor detection methods: K-ARM [68], MNTD [83], ABS [48],
1 On
10
TDC, the TSA attack reduced the detection AUC of MNTD to
Table 2: The accuracies (%) of the detection-on-modeloutputs methods. C-rows stand for results on CIFAR10, Grows stand for results on GTSRB, I-rows stand for results on
ImageNet and V -rows stand for results on VGGFace2.
C
G
I
V
C
G
I
V
K-ARM
MNTD
BadNet
100
100
95.50
96.25
100
100
97.75
98.75
SIG
61.75
63.25
56.50
59.25
99.75
99.25
98.75
99.00
WB
79.50
82.25
75.00
76.50
86.00
85.50
84.25
85.25
CB
57.25
60.50
53.75
67.25
100
99.50
97.25
98.25
IAB
80.25
79.75
75.00
80.75
98.25
99.50
97.25
98.75
when defending against WB and CB, indicating the similarity
between the untargeted universal perturbation and targeted
per-input perturbations is a more general signal of the short
cut comparing to a single dominant neuron exploited by ABS.
Detection on model inputs. We tested two representative
detection methods in this category: SCAn and AC. SCAn
detects the backdoor by checking whether the representations
(outputs of the penultimate layer) of inputs in a single class
are from a mixture of two distributions, with the help of the
so-called global variance matrix that captures how the representations of the inputs in different classes varies. SCAn first
computes the global variance matrix on a clean dataset, then
computes a score for each class, and finally checks whether
any class has a abnormally high score. If such class exists,
SCAn will report this model as the backdoored model and this
abnormal class as the target class. Similarly, AC detects the
backdoor by checking whether the representations of one class
can be well separated into two groups. Specifically, for each
class, AC first embeds the high-dimensional representations
into 10-dimensional vectors and then computes the Sihouette
score [63] to measures how well the 2-means algorithm can
separate these vectors.
TSA
59.25
62.50
57.25
64.75
51.25
52.50
53.25
54.75
Detection on model weights. We tested two representative
detection methods in this category: ABS and TND. ABS proposed that when the backdoor is injected into a model, it also
introduces a short cut, through which a trigger-carrying input
will be easily predicted as belonging to the target class by the
backdoored model. Specifically, this short cut is characterized
by some neurons that are intensively activated by the trigger
on the input, and then generate dramatic impact on the prediction. To detect this short cut, ABS first labels those neurons
whose activation results in abnormally large change in the
predicted label of the backdoored model for some inputs, and
then, for each labeled neuron, seeks a trigger that can activate
this neuron abnormally and consistently change the predicted
label for a range of inputs. ABS alarms for a backdoored
model, if such neuron coupled with a trigger is found. TND
explores another phenomenon related to the short cut in a
model. In particular, TND found that the untargeted universal
perturbation is similar to the targeted per-input perturbation in
backdoored models, while they are different in benign models.
Hence, TND alarms for a backdoored model if such similarity
is significant.
Table 4: The accuracies (%) of the detection-on-modeloutputs methods. C-rows stand for results on CIFAR10, Grows stand for results on GTSRB, I-rows stand for results on
ImageNet and V -rows stand for results on VGGFace2.
SCAn
AC
Table 3: The accuracy (%) of the detection-on-model-weights
methods. C-rows stand for results on CIFAR10, G-rows stand
for results on GTSRB, I-rows stand for results on ImageNet
and V -rows stand for results on VGGFace2.
ABS
TND
C
G
I
V
C
G
I
V
BadNet
100
100
94.75
98.25
100
100
94.50
96.00
SIG
95.50
94.50
89.75
91.25
99.75
99.25
93.25
92.75
WB
59.75
61.00
56.25
56.25
67.00
64.25
62.00
63.50
CB
62.75
61.75
58.00
59.25
73.75
72.50
69.25
71.00
IAB
58.75
59.00
55.50
54.75
53.00
52.50
49.75
50.25
C
G
I
V
C
G
I
V
BadNet
100
100
94.25
95.75
98.00
98.50
91.75
92.25
SIG
100
100
91.25
92.00
99.00
99.25
95.50
96.25
WB
70.25
69.00
62.75
66.25
59.75
59.25
55.75
57.25
CB
95.25
97.00
88.00
89.50
90.00
91.50
86.25
88.00
IAB
74.25
74.75
67.75
69.50
65.75
66.50
59.75
62.50
TSA
63.25
61.75
59.00
60.25
55.25
55.25
52.75
53.50
Table 4 demonstrates that SCAn achieved better accuracies
against all 6 attacks compared to AC. However, SCAn and AC
both performed poorly on WB, IAB and TSA, the three attacks
that attempt to mix the representations of trigger-carrying
inputs with those of benign inputs.
Taking all these results together, we concluded that an attack would exhibit different evasiveness against different detection methods. Even for TSA, although the detection accuracy by 4 out of 6 detections are as low as about 52%,
two other methods (K-ARM and SCAn) retain about 60%
accuracy against it. This illustrates the demand of a general
measurement to depict how well a backdoor attack can evade
different detection methods (including novel methods that
are not known by the adversary), as in practice, the defender
may adopt a cocktail approach by combining different methods to detect backdoors. We believe the backdoor distance
is a promising candidate for such a measurement as it accurately showed the low detection accuracy on the TSA and WB
backdoored models by all detection methods, with their low
TSA
51.00
49.25
51.75
52.25
48.75
50.25
51.50
51.75
Table 3 illustrates that ABS suffers from difficulties when
defending against WB, CB, IAB and TSA, perhaps because
these attacks influence many neurons in the victim models
and thus no single neuron changes the predicted label by
itself. On the other hand, TND performs better than ABS
44.37%, indicating it is hard for MNTD to distinguish the TSA backdoored
models from the benign ones. Until submission of this paper, our method is
ranked #1 in the evasive trojans track of TDC.
11
backdoor distances (Table 1) compered to the other 4 attack
methods. Below, we aim to further illustrate their connection.
4.5
door detectability, when the α/β be close to 1, be about 0.64
(WB) or be around 0.37 (TSA).
To evaluate this relationship at more various backdoor
distances, we used the TSA backdoor attack method (Algorithm 1) with β = 0.1 to generate backdoors with different
backdoor distances by adjusting the parameter α∗ . Specifically, we performed this experiment on CIFAR10 with 9
different α∗ values ranging from 0.1 to 0.9. For each α∗ ,
we generated 200 backdoored models together with previously generated benign models for CIFAR10 to build a testing dataset containing 400 models (200 backdoored and 200
benign models). On each testing dataset, we applied SCAn
and K-ARM detection methods, two comparably effective
detection methods against TSA backdoor attack (Section 4.4),
to distinguish those TSA backdoored models from benign
models. Figure 2 demonstrates these detection results and the
backdoor distances we estimated on each α∗ value.
Experiments: Detectability vs. Similarity
Our experiments in Section 4.4 indicate that the backdoor
distance is a potentially good measurement of the backdoor
detectability (as defined in below). Specifically, those backdoor attacks obtaining small backdoor distance are hard to
be detected, which is also inline with what has been demonstrated in our theory analysis (Section 4.1 4.2 and 4.3). In
this section, we report the experimental results showing the
backdoors with small backdoor distance indeed have low detectability, and thus the backdoor distance is indeed a good
indicator of the backdoor detectability.
Definition 7 (Backdoor detectability). The detectability of
the backdoor generated by a backdoor attack method is the
maximum accuracy that backdoor detection methods can
achieve to distinguish the backdoored model from the benign
models. For convenience, we adjust the detectability between
0 and 1, i.e., γ = |acc − 0.5| × 2, where γ is the detectability
and acc is the maximum accuracy.
Table 5: Detectability for attacks. C-rows stand for results
on CIFAR10, G-rows stand for results on GTSRB, I-rows
stand for results on ImageNet dataset and V -rows stand for
results on VGGFace2 dataset. The “Det” columns represent
the backdoor detectability. The “α/β” columns depict the
backdoor distance (Corollary 5). Here, we keep β = 0.1 for
all cells.
C
G
I
V
BadNet
Det α/β
1.00 0.98
1.00 0.96
0.96 0.92
0.98 0.95
SIG
Det α/β
1.00 0.99
1.00 1.00
0.98 0.98
0.98 0.99
WB
Det α/β
0.72 0.67
0.71 0.61
0.69 0.66
0.71 0.65
CB
Det α/β
1.00 1.00
0.99 1.00
0.96 0.99
0.97 1.00
IAB
Det α/β
0.97 1.00
0.99 1.00
1.00 0.99
0.98 1.00
Figure 2: Backdoor distance and detectability of backdoors
generated by using different α∗ values.
From Figure 2, we observe that the backdoor detectability (blue line) of TSA increases along with the increasing
backdoor distance (red line, characterized by α/β). Notably,
a small difference exists between these two lines, because of
two reasons: 1) imprecise estimation of the backdoor distance;
and 2) the absence of an effective detection method against
TSA backdoors. Specifically, when α∗ ≥ 0.8, the backdoor
distance is lower than the backdoor detectability, illustrating
the first reason. When α∗ ≤ 0.5, the backdoor distance is consistently higher than the backdoor detectability, demonstrating
the room for better detection methods (the second reason).
Based upon our results, we conclude that the backdoor
distance is a good indicator of the backdoor detectability with
small deviations, which is again illustrated by the following
observation: the Pearson correlation coefficient between the
backdoor distances and detectabilities shown on Figure 2 is
0.9795, while the mean of the absolute difference between
them is 0.0800 with the standard deviation of 0.0453.
Ours
Det α/β
0.27 0.37
0.25 0.41
0.18 0.38
0.30 0.35
To evaluate the relationship between the backdoor detectability and the backdoor distance for each attack method,
we summarized the maximum detection accuracy obtained
among 6 detection methods (Section 4.4) and calculated the
detectability according to the above definition. Also we approximated the backdoor distance of these 6 attacks on 4
datasets using our approximation method (Section 3.3) with
the help of StyleGAN2 models [34]. Specifically, we used
the officially pretrained StyleGAN2 models for datasets CIFAR10 and ImageNet, trained a StyleGAN2 model with its
original code for GTSRB dataset, and trained a StyleGAN2
model with code [1] available online for VGGFace2 dataset.
Our results are illustrated in Table 5.
From Table 5, we observe that the backdoor detectability
is roughly equal to the backdoor distance (depicted by α/β).
Digitally, the Pearson correlation coefficient [42] between
them is 0.9777, the mean value of the absolute difference
between them is 0.0450 and the standard deviation of that is
0.0498. These numbers demonstrate that the backdoor distance is highly correlated to and a good indicator of the back-
5
Mitigation
A simple defense to the backdoors with small backdoor distance could be just discarding those uncertain predictions
while retaining only those confident predictions. However,
doing this will obviously decrease the model accuracy on
benign inputs. For example, on MNIST dataset, if keeping
only those predictions with confidence higher than 0.8 and labeling the rest as “unknown”, the accuracy of a benign model
12
will decrease from 99.35% to 98.61%, this is far below the
accuracy of a benign model could get2 , considering the mean
accuracy among 200 benign models is 99.25% with the standard deviation of 0.00057. When the primary task becomes
more hard (e.g., ImageNet), the accuracy reduction will be
more serious if this simple defense be applied.
Besides, backdoor unlearning methods and backdoor disabling methods might have potential to relieve the threat from
backdoors with small backdoor distance. However, as demonstrated in our TSA on them (Appendix 11 & 12), they exhibit
minor efficacy on these backdoors.
Inspired by our backdoor distance theorems, a detection
that considers both the difference exhibited in the inputs and in
the outputs between the backdoored model and benign models
would effectively reduce the evasiveness of those backdoors
with small backdoor distance, which is a promising direction
to develop powerful detections in futures.
6
IMC only reduced the difference between outputs on benign
inputs (i.e., maintained the accuracy of backdoored model
on benign inputs). And, there are two minor differences between them: a) TSA makes the trigger be easy to learn while
IMC did not; b) TSA only slightly changes the classification
boundary, however, IMC iteratively pushed the classification
boundary deviate from its original position to seek a small
trigger. Also, we established experiments to compare TSA
with IMC on CIFAR10 dataset (see Appendix 13 for details),
in which TSA exhibited lower detectability than IMC.
7
In this work, we only studied backdoor tasks where β ≥ κ1 ,
i.e., the adversary has not reduced the probability of drawing a trigger-carrying input from the backdoor distribution
be lower than the probability of drawing it from the primary distribution. However, as demonstrated in Table 1, when
β < κ1 , TSA still achieved acceptable ASR (ASR=79.07%
when β = 0.005 < 0.046 = κ1 ), illustrating the need to extend
our theorem to adapt β < κ1 scenarios. However, using the
similar methods applied on β ≥ κ1 scenarios, one could easily
obtain that the minimal backdoor distance will be obtained
at β = κ1 even in β < κ1 scenarios, which is inline with the
conclusion drew for β ≥ κ1 scenarios (Corollary 5 & 6) and
has no conflicts with the results shown in Table 1.
In section 5, we have only taken the first step to use our
backdoor distance theorem to understand the backdoor unlearning and backdoor disabling methods. Comprehensive
studies are needed in the future.
Our Theorem 2 reveal that the fundamental difference between a backdoored model and a benign model comes from
the difference between their joint probabilities among triggercarrying inputs and the outputs (i.e., A(B ) × Y ). This implies
that a good backdoor detection method should simultaneously
consider the differences in the outputs and in the inputs between backdoored models and benign models, rather than
considering one of these two differences alone as what current detection methods did. Actually, to detect backdoor, this
points out a potential direction for the future studies.
Related Works
We proposed theorems to study the detectability of backdoors.
This, in general, has also been studied in previous work [23].
It proposed an approach to plant undetectable backdoors into
a random feature network [60], a kind of neural network that
learns only the weights on random features. Compared with
classical deep neural networks, random feature networks have
limited capability [85]: it cannot be used to learn even a single ReLU neuron, unless the network size is exponentially
larger than the dimension of the inputs. In theory, work [23]
reduced the problem of detecting their backdoor to solving
a Continuous Learning With Errors problem [9], however,
solving which is as hard as finding approximately short vectors on arbitrary integer lattices, thus detecting their backdoor is computationally infeasible in practice. Compared with
work [23], our work established theorems about the detectability of backdoors injected into a classical deep neural network,
and demonstrated that, in this case, backdoor detectability is
characterized by the backdoor distance that further controlled
by three parameters: κ, β and S.
Based upon our theorem, we proposed an attack, TSA backdoor attack, to inject stealthy backdoor. Compared to existing stealthy backdoor attacks [24] [6] [16] [46] [55], TSA
backdoor attack achieved lower backdoor detectability under current backdoor detections [68] [83] [48] [79] [72] [11]
(demonstrated in Section 4.4) and has theory guarantee under
unknown detections (illustrated in Section 4.1 4.2 4.3).
Our TSA backdoor attack exploited adversarial perturbations as the trigger, that has been also exploited by IMC backdoor attack [56]. There are 3 main differences between TSA
and IMC backdoor attacks: 1) TSA has theory guarantees on
backdoor detectability what IMC has not; 2) TSA reduces the
S (defined in Eq. 8) what IMC has not considered; 3) TSA
reduces the difference between outputs of backdoored model
and the benign model on whole input space (Eq. 13), however,
2 11
Limitations and Future Works
8
Conclusion
We established theorems about the backdoor distance (similarity) and used them to investigate the stealthiness of current
backdoors, revealing that they have taken only some of factors
affecting the backdoor distance into the consideration. Thus,
we proposed a new approach, TSA attack, which simultaneously optimizes those factors under the given constraint of
backdoor distance. Through theoretical analysis and extensive
experiments, we demonstrated that the backdoors with smaller
backdoor distance were in general harder to be detected by
existing backdoor defense methods. Furthermore, comparing with existing backdoor attacks, the TSA attack generates
backdoors that exhibited smaller backdoor distances, and thus
lower detectability under current backdoor detections.
times of standard deviation below the mean accuracy
13
Huang, José Hernández-Orallo, and Mauricio CastilloEffen, editors, Workshop on Artificial Intelligence
Safety 2019 co-located with the Thirty-Third AAAI
Conference on Artificial Intelligence 2019 (AAAI-19),
Honolulu, Hawaii, January 27, 2019, volume 2301 of
CEUR Workshop Proceedings. CEUR-WS.org, 2019.
References
[1] Stylegan2-based face frontalization model.
https://github.com/ch10tang/stylegan2-b
ased-face-frontalization.
[2] Trojai competition. https://pages.nist.gov/tro
jai/.
[12] Xinyun Chen, Chang Liu, Bo Li, Kimberly Lu, and
Dawn Song. Targeted backdoor attacks on deep learning
systems using data poisoning. CoRR, abs/1712.05526,
2017.
[3] Trojan detection challenge. https://trojandetect
ion.ai/.
[4] Martín Arjovsky, Soumith Chintala, and Léon Bottou.
Wasserstein generative adversarial networks. In Doina
Precup and Yee Whye Teh, editors, Proceedings of the
34th International Conference on Machine Learning,
ICML 2017, Sydney, NSW, Australia, 6-11 August
2017, volume 70 of Proceedings of Machine Learning
Research, pages 214–223. PMLR, 2017.
[13] Edward Chou, Florian Tramèr, and Giancarlo Pellegrino.
Sentinet: Detecting localized universal attacks against
deep learning systems. In 2020 IEEE Security and
Privacy Workshops, SP Workshops, San Francisco, CA,
USA, May 21, 2020, pages 48–54. IEEE, 2020.
[14] Joseph Clements and Yingjie Lao. Backdoor attacks
on neural network operations. In 2018 IEEE Global
Conference on Signal and Information Processing,
GlobalSIP 2018, Anaheim, CA, USA, November 26-29,
2018, pages 1154–1158. IEEE, 2018.
[5] Eugene Bagdasaryan and Vitaly Shmatikov. Blind backdoors in deep learning models. In Michael Bailey
and Rachel Greenstadt, editors, 30th USENIX Security
Symposium, USENIX Security 2021, August 11-13,
2021, pages 1505–1521. USENIX Association, 2021.
[15] Jia Deng, Wei Dong, Richard Socher, Li-Jia Li, Kai Li,
and Li Fei-Fei. Imagenet: A large-scale hierarchical image database. In 2009 IEEE Conference on Computer
Vision and Pattern Recognition, pages 248–255, 2009.
[6] Mauro Barni, Kassem Kallas, and Benedetta Tondi. A
new backdoor attack in CNNS by training set corruption
without label poisoning. In 2019 IEEE International
Conference on Image Processing, ICIP 2019, Taipei,
Taiwan, September 22-25, 2019, pages 101–105. IEEE,
2019.
[16] Khoa Doan, Yingjie Lao, and Ping Li. Backdoor attack with imperceptible input and latent modification.
Advances in Neural Information Processing Systems,
34, 2021.
[7] Shai Ben-David, John Blitzer, Koby Crammer, Alex
Kulesza, Fernando Pereira, and Jennifer Wortman
Vaughan. A theory of learning from different domains.
Mach. Learn., 79(1-2):151–175, 2010.
[17] Khoa Doan, Yingjie Lao, Weijie Zhao, and Ping Li.
Lira: Learnable, imperceptible and robust backdoor attacks. In Proceedings of the IEEE/CVF International
Conference on Computer Vision, pages 11966–11976,
2021.
[8] Douglas S Bridges et al. Foundations of real and
abstract analysis. Number 146. Springer Science &
Business Media, 1998.
[18] Thang Doan, Mehdi Abbana Bennani, Bogdan Mazoure, Guillaume Rabusseau, and Pierre Alquier. A
theoretical analysis of catastrophic forgetting through
the NTK overlap matrix.
In Arindam Banerjee
and Kenji Fukumizu, editors, The 24th International
Conference on Artificial Intelligence and Statistics,
AISTATS 2021, April 13-15, 2021, Virtual Event, volume 130 of Proceedings of Machine Learning Research,
pages 1072–1080. PMLR, 2021.
[9] Joan Bruna, Oded Regev, Min Jae Song, and Yi Tang.
Continuous lwe. In Proceedings of the 53rd Annual
ACM SIGACT Symposium on Theory of Computing,
pages 694–707, 2021.
[10] Qiong Cao, Li Shen, Weidi Xie, Omkar M Parkhi, and
Andrew Zisserman. Vggface2: A dataset for recognising faces across pose and age. In 2018 13th IEEE
international conference on automatic face & gesture
recognition (FG 2018), pages 67–74. IEEE, 2018.
[19] DC Dowson and BV666017 Landau. The fréchet distance between multivariate normal distributions. Journal
of multivariate analysis, 12(3):450–455, 1982.
[11] Bryant Chen, Wilka Carvalho, Nathalie Baracaldo,
Heiko Ludwig, Benjamin Edwards, Taesung Lee, Ian M.
Molloy, and Biplav Srivastava. Detecting backdoor attacks on deep neural networks by activation clustering.
In Huáscar Espinoza, Seán Ó hÉigeartaigh, Xiaowei
[20] Jacob Dumford and Walter J. Scheirer. Backdooring convolutional neural networks via targeted weight perturbations. In 2020 IEEE International Joint Conference on
14
Biometrics, IJCB 2020, Houston, TX, USA, September
28 - October 1, 2020, pages 1–9. IEEE, 2020.
[31] Xijie Huang, Moustafa Alzantot, and Mani B. Srivastava.
Neuroninspect: Detecting backdoors in neural networks
via output explanations. CoRR, abs/1911.07399, 2019.
[21] Yansong Gao, Bao Gia Doan, Zhi Zhang, Siqi Ma,
Jiliang Zhang, Anmin Fu, Surya Nepal, and Hyoungshick Kim. Backdoor attacks and countermeasures
on deep learning: A comprehensive review. CoRR,
abs/2007.10760, 2020.
[32] Arun I. and Murugesan Venkatapathi. An algorithm for
estimating volumes and other integrals in n dimensions.
CoRR, abs/2007.06808, 2020.
[33] Arthur Jacot, Clément Hongler, and Franck Gabriel.
Neural tangent kernel: Convergence and generalization in neural networks. In Samy Bengio, Hanna M.
Wallach, Hugo Larochelle, Kristen Grauman, Nicolò
Cesa-Bianchi, and Roman Garnett, editors, Advances
in Neural Information Processing Systems 31: Annual
Conference on Neural Information Processing Systems
2018, NeurIPS 2018, December 3-8, 2018, Montréal,
Canada, pages 8580–8589, 2018.
[22] Yansong Gao, Change Xu, Derui Wang, Shiping
Chen, Damith Chinthana Ranasinghe, and Surya
Nepal. STRIP: a defence against trojan attacks on
deep neural networks. In David Balenson, editor,
Proceedings of the 35th Annual Computer Security
Applications Conference, ACSAC 2019, San Juan, PR,
USA, December 09-13, 2019, pages 113–125. ACM,
2019.
[23] Shafi Goldwasser, Michael P Kim, Vinod Vaikuntanathan, and Or Zamir. Planting undetectable backdoors in machine learning models. arXiv preprint
arXiv:2204.06974, 2022.
[34] Tero Karras, Miika Aittala, Janne Hellsten, Samuli
Laine, Jaakko Lehtinen, and Timo Aila. Training generative adversarial networks with limited data. In Hugo
Larochelle, Marc’Aurelio Ranzato, Raia Hadsell, MariaFlorina Balcan, and Hsuan-Tien Lin, editors, Advances
in Neural Information Processing Systems 33: Annual
Conference on Neural Information Processing Systems
2020, NeurIPS 2020, December 6-12, 2020, virtual,
2020.
[24] Tianyu Gu, Brendan Dolan-Gavitt, and Siddharth Garg.
Badnets: Identifying vulnerabilities in the machine
learning model supply chain. CoRR, abs/1708.06733,
2017.
[25] Shangwei Guo, Chunlong Xie, Jiwei Li, Lingjuan Lyu,
and Tianwei Zhang. Threats to pre-trained language
models: Survey and taxonomy. CoRR, abs/2202.06862,
2022.
[35] Sara Kaviani and Insoo Sohn. Defense against neural
trojan attacks: A survey. Neurocomputing, 423:651–
667, 2021.
Adam: A
[36] Diederik P Kingma and Jimmy Ba.
method for stochastic optimization. arXiv preprint
arXiv:1412.6980, 2014.
[26] Wenbo Guo, Lun Wang, Xinyu Xing, Min Du, and Dawn
Song. TABOR: A highly accurate approach to inspecting and restoring trojan backdoors in AI systems. CoRR,
abs/1908.01763, 2019.
[37] Soheil Kolouri, Kimia Nadjahi, Umut Simsekli, Roland
Badeau, and Gustavo Rohde.
Generalized sliced
wasserstein distances. Advances in neural information
processing systems, 32, 2019.
[27] Jonathan Hayase, Weihao Kong, Raghav Somani, and
Sewoong Oh.
Spectre: Defending against backdoor attacks using robust statistics. arXiv preprint
arXiv:2104.11315, 2021.
[38] Soheil Kolouri, Aniruddha Saha, Hamed Pirsiavash,
and Heiko Hoffmann.
Universal litmus patterns:
Revealing backdoor attacks in cnns.
In 2020
IEEE/CVF Conference on Computer Vision and
Pattern Recognition, CVPR 2020, Seattle, WA, USA,
June 13-19, 2020, pages 298–307. Computer Vision
Foundation / IEEE, 2020.
[28] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian
Sun. Deep residual learning for image recognition.
In Proceedings of the IEEE conference on computer
vision and pattern recognition, pages 770–778, 2016.
[29] Harold Hotelling. The generalization of student’s ratio.
In Breakthroughs in statistics, pages 54–65. Springer,
1992.
[39] Alex Krizhevsky, Geoffrey Hinton, et al. Learning multiple layers of features from tiny images. 2009.
[30] Sebastian Houben, Johannes Stallkamp, Jan Salmen,
Marc Schlipsing, and Christian Igel. Detection of traffic
signs in real-world images: The German Traffic Sign Detection Benchmark. In International Joint Conference
on Neural Networks, number 1288, 2013.
[40] Yann LeCun, Léon Bottou, Yoshua Bengio, and Patrick
Haffner. Gradient-based learning applied to document
recognition. Proceedings of the IEEE, 86(11):2278–
2324, 1998.
15
[41] Jaehoon Lee, Lechao Xiao, Samuel Schoenholz,
Yasaman Bahri, Roman Novak, Jascha Sohl-Dickstein,
and Jeffrey Pennington. Wide neural networks of any
depth evolve as linear models under gradient descent.
Advances in neural information processing systems, 32,
2019.
[50] Yunfei Liu, Xingjun Ma, James Bailey, and Feng
Lu. Reflection backdoor: A natural backdoor attack on deep neural networks. In Andrea Vedaldi,
Horst Bischof, Thomas Brox, and Jan-Michael Frahm,
editors, Computer Vision - ECCV 2020 - 16th
European Conference, Glasgow, UK, August 23-28,
2020, Proceedings, Part X, volume 12355 of Lecture
Notes in Computer Science, pages 182–199. Springer,
2020.
[42] Joseph Lee Rodgers and W Alan Nicewander. Thirteen ways to look at the correlation coefficient. The
American Statistician, 42(1):59–66, 1988.
[51] Yunfei Liu, Xingjun Ma, James Bailey, and Feng Lu.
Reflection backdoor: A natural backdoor attack on deep
neural networks. In European Conference on Computer
Vision, pages 182–199. Springer, 2020.
[43] Yige Li, Xixiang Lyu, Nodens Koren, Lingjuan Lyu,
Bo Li, and Xingjun Ma. Anti-backdoor learning: Training clean models on poisoned data. In NeurIPS, 2021.
[52] Geoffrey J McLachlan. Discriminant analysis and
statistical pattern recognition. John Wiley & Sons,
2005.
[44] Yiming Li, Yong Jiang, Zhifeng Li, and Shu-Tao Xia.
Backdoor learning: A survey. IEEE Transactions on
Neural Networks and Learning Systems, 2022.
Mahalanobis distance.
[53] Goeffrey J McLachlan.
Resonance, 4(6):20–26, 1999.
[45] Yuezun Li, Yiming Li, Baoyuan Wu, Longkang Li,
Ran He, and Siwei Lyu. Invisible backdoor attack
with sample-specific triggers.
In Proceedings of
the IEEE/CVF International Conference on Computer
Vision, pages 16463–16472, 2021.
[54] Kevin P Murphy. Machine learning: a probabilistic
perspective. MIT press, 2012.
[55] Tuan Anh Nguyen and Anh Tran. Input-aware dynamic
backdoor attack. Advances in Neural Information
Processing Systems, 33:3454–3464, 2020.
[46] Junyu Lin, Lei Xu, Yingqi Liu, and Xiangyu Zhang.
Composite backdoor attack for deep neural network
by mixing existing benign features. In Jay Ligatti,
Xinming Ou, Jonathan Katz, and Giovanni Vigna, editors, CCS ’20: 2020 ACM SIGSAC Conference on
Computer and Communications Security, Virtual Event,
USA, November 9-13, 2020, pages 113–131. ACM,
2020.
[56] Ren Pang, Hua Shen, Xinyang Zhang, Shouling Ji,
Yevgeniy Vorobeychik, Xiapu Luo, Alex Liu, and
Ting Wang. A tale of evil twins: Adversarial inputs versus poisoned models. In Proceedings of the
2020 ACM SIGSAC Conference on Computer and
Communications Security, pages 85–99, 2020.
[47] Tao Liu, Wujie Wen, and Yier Jin. Sin 2: Stealth infection on neural network—a low-cost agile neural trojan attack methodology. In 2018 IEEE International
Symposium on Hardware Oriented Security and Trust
(HOST), pages 227–230. IEEE, 2018.
[57] Ren Pang, Zheng Zhang, Xiangshan Gao, Zhaohan Xi,
Shouling Ji, Peng Cheng, and Ting Wang. TROJANZOO: everything you ever wanted to know about neural
backdoors (but were afraid to ask). In IEEE European
Symposium on Security and Privacy, EuroS&P 2022,
Genoa, June 6-10, 2022. IEEE, 2022.
[48] Yingqi Liu, Wen-Chuan Lee, Guanhong Tao, Shiqing
Ma, Yousra Aafer, and Xiangyu Zhang. ABS: scanning neural networks for back-doors by artificial brain
stimulation. In Lorenzo Cavallaro, Johannes Kinder, XiaoFeng Wang, and Jonathan Katz, editors, Proceedings
of the 2019 ACM SIGSAC Conference on Computer
and Communications Security, CCS 2019, London, UK,
November 11-15, 2019, pages 1265–1282. ACM, 2019.
[58] Ximing Qiao, Yukun Yang, and Hai Li. Defending neural backdoors via generative distribution modeling. In
Hanna M. Wallach, Hugo Larochelle, Alina Beygelzimer, Florence d’Alché-Buc, Emily B. Fox, and Roman Garnett, editors, Advances in Neural Information
Processing Systems 32: Annual Conference on Neural
Information Processing Systems 2019, NeurIPS 2019,
December 8-14, 2019, Vancouver, BC, Canada, pages
14004–14013, 2019.
[49] Yingqi Liu, Shiqing Ma, Yousra Aafer, Wen-Chuan
Lee, Juan Zhai, Weihang Wang, and Xiangyu Zhang.
Trojaning attack on neural networks. In 25th Annual
Network and Distributed System Security Symposium,
NDSS 2018, San Diego, California, USA, February
18-21, 2018. The Internet Society, 2018.
[59] Erwin Quiring and Konrad Rieck. Backdooring and
poisoning neural networks with image-scaling attacks.
In 2020 IEEE Security and Privacy Workshops, SP
Workshops, San Francisco, CA, USA, May 21, 2020,
pages 41–47. IEEE, 2020.
16
[60] Ali Rahimi and Benjamin Recht. Random features
for large-scale kernel machines. Advances in neural
information processing systems, 20, 2007.
conference on computer vision and pattern recognition,
pages 1–9, 2015.
[71] Christian Szegedy, Wojciech Zaremba, Ilya Sutskever,
Joan Bruna, Dumitru Erhan, Ian Goodfellow, and Rob
Fergus. Intriguing properties of neural networks. arXiv
preprint arXiv:1312.6199, 2013.
[61] Adnan Siraj Rakin, Zhezhi He, and Deliang Fan. TBT:
targeted neural network attack with bit trojan. In
2020 IEEE/CVF Conference on Computer Vision and
Pattern Recognition, CVPR 2020, Seattle, WA, USA,
June 13-19, 2020, pages 13195–13204. Computer Vision Foundation / IEEE, 2020.
[72] Di Tang, XiaoFeng Wang, Haixu Tang, and Kehuan
Zhang. Demon in the variant: Statistical analysis of dnns
for robust backdoor contamination detection. In Michael
Bailey and Rachel Greenstadt, editors, 30th USENIX
Security Symposium, USENIX Security 2021, August
11-13, 2021, pages 1541–1558. USENIX Association,
2021.
[62] Olaf Ronneberger, Philipp Fischer, and Thomas Brox.
U-net: Convolutional networks for biomedical image
segmentation. In International Conference on Medical
image computing and computer-assisted intervention,
pages 234–241. Springer, 2015.
[73] Sebastian Thrun. A lifelong learning perspective for
mobile robot control. In Intelligent robots and systems,
pages 201–214. Elsevier, 1995.
[63] Peter J Rousseeuw. Silhouettes: a graphical aid to the
interpretation and validation of cluster analysis. Journal
of computational and applied mathematics, 20:53–65,
1987.
[74] Brandon Tran, Jerry Li, and Aleksander Madry. Spectral signatures in backdoor attacks. In Samy Bengio, Hanna M. Wallach, Hugo Larochelle, Kristen
Grauman, Nicolò Cesa-Bianchi, and Roman Garnett,
editors, Advances in Neural Information Processing
Systems 31: Annual Conference on Neural Information
Processing Systems 2018, NeurIPS 2018, December
3-8, 2018, Montréal, Canada, pages 8011–8021, 2018.
[64] Jonathan J Ruel and Matthew P Ayres. Jensen’s inequality predicts effects of environmental variation. Trends
in Ecology & Evolution, 14(9):361–366, 1999.
[65] Ahmed Salem, Rui Wen, Michael Backes, Shiqing Ma,
and Yang Zhang. Dynamic backdoor attacks against
machine learning models. In 2022 IEEE 7th European
Symposium on Security and Privacy (EuroS&P), pages
703–718. IEEE, 2022.
[75] Alexander Turner, Dimitris Tsipras, and Aleksander
Madry. Label-consistent backdoor attacks. CoRR,
abs/1912.02771, 2019.
[66] Esha Sarkar, Hadjer Benkraouda, and Michail Maniatakos. Facehack: Triggering backdoored facial recognition systems using facial characteristics. CoRR,
abs/2006.11623, 2020.
[76] Cédric Villani. Optimal transport: old and new, volume
338. Springer, 2009.
[67] Giorgio Severi, Jim Meyer, Scott E. Coull, and Alina
Oprea. Explanation-guided backdoor poisoning attacks against malware classifiers. In Michael Bailey
and Rachel Greenstadt, editors, 30th USENIX Security
Symposium, USENIX Security 2021, August 11-13,
2021, pages 1487–1504. USENIX Association, 2021.
[77] Bolun Wang, Yuanshun Yao, Shawn Shan, Huiying Li,
Bimal Viswanath, Haitao Zheng, and Ben Y. Zhao. Neural cleanse: Identifying and mitigating backdoor attacks in neural networks. In 2019 IEEE Symposium
on Security and Privacy, SP 2019, San Francisco, CA,
USA, May 19-23, 2019, pages 707–723. IEEE, 2019.
[68] Guangyu Shen, Yingqi Liu, Guanhong Tao, Shengwei
An, Qiuling Xu, Siyuan Cheng, Shiqing Ma, and Xiangyu Zhang. Backdoor scanning for deep neural
networks through k-arm optimization. arXiv preprint
arXiv:2102.05123, 2021.
[78] Jie Wang, Ghulam Mubashar Hassan, and Naveed
Akhtar. A survey of neural trojan attacks and defenses
in deep learning. CoRR, abs/2202.07183, 2022.
[79] Ren Wang, Gaoyuan Zhang, Sijia Liu, Pin-Yu Chen, Jinjun Xiong, and Meng Wang. Practical detection of trojan
neural networks: Data-limited and data-free cases. In
Proceedings of the European Conference on Computer
Vision (ECCV), 2020.
[69] Karen Simonyan and Andrew Zisserman. Very deep convolutional networks for large-scale image recognition.
arXiv preprint arXiv:1409.1556, 2014.
[70] Christian Szegedy, Wei Liu, Yangqing Jia, Pierre Sermanet, Scott Reed, Dragomir Anguelov, Dumitru Erhan,
Vincent Vanhoucke, and Andrew Rabinovich. Going
deeper with convolutions. In Proceedings of the IEEE
[80] Dongxian Wu and Yisen Wang. Adversarial neuron
pruning purifies backdoored deep models. Advances in
Neural Information Processing Systems, 34, 2021.
17
[81] Weihao Xia, Yulun Zhang, Yujiu Yang, Jing-Hao Xue,
Bolei Zhou, and Ming-Hsuan Yang. GAN inversion: A
survey. CoRR, abs/2101.05278, 2021.
9
Appendix of Details of Estimation κ
Estimation of κPr . In Section 3.3, we described how to calculate Pr(x) for an given input x. Specifically, our implementation is based on the GAN inversion tools in the official
repository of [34]. The original code of [34] can only recover
inputs’ style parameters that actually are projections of the
z through a transformation network. Thus, we modified the
original code to directly recover z.
After computing Pr(x), it is still not trivial to get the expectations, EPr(x|x∈B ) Pr(G−1 (x)) and EPr(x|x∈A(B )) Pr(G−1 (x)),
due to the poor precision in computing tiny numbers (e.g.,
1e−100 ). Thus, we, instead, compute the logarithm of the ra-
[82] Xiaojun Xu, Qi Wang, Huichen Li, Nikita Borisov,
Carl A. Gunter, and Bo Li. Detecting AI trojans using meta neural analysis. In 42nd IEEE Symposium
on Security and Privacy, SP 2021, San Francisco, CA,
USA, 24-27 May 2021, pages 103–120. IEEE, 2021.
[83] Xiaojun Xu, Qi Wang, Huichen Li, Nikita Borisov,
Carl A. Gunter, and Bo Li. Detecting AI trojans using meta neural analysis. In 42nd IEEE Symposium
on Security and Privacy, SP 2021, San Francisco, CA,
USA, 24-27 May 2021, pages 103–120. IEEE, 2021.
EPr(x|x∈B ) Pr(G−1 (x))
). Furthermore, we
−1
Pr(x|x∈A(B )) Pr(G (x))
G−1 (x) follows a Gaussian distribution
tio, i.e., ln(κPr ) = ln( E
observed that zx =
for both x ∈ B and x ∈ A(B ). Combining with the fact that
Pr(zx ) ∝ kzx k22 , we get that
[84] Yuanshun Yao, Huiying Li, Haitao Zheng, and Ben Y
Zhao. Latent backdoor attacks on deep neural networks.
In Proceedings of the 2019 ACM SIGSAC Conference
on Computer and Communications Security, pages
2041–2055, 2019.
µ2
µ2
B
A(B )
)
),
ln(κPr ) = − 21 ( σ2 B+1 − σ2 A(B+1
[85] Gilad Yehudai and Ohad Shamir. On the power and
limitations of random features for understanding neural
networks. Advances in Neural Information Processing
Systems, 32, 2019.
where we assume zx ∼ N(µB , σB ) for x ∈ B and, for x ∈ A(B ),
zx ∼ N(µA(B ) , σA(B )). Empirically, we sampled 100 points (x)
in B and 100 points (x) in A(B ) to estimate the mean and the
variance of the corresponding zx .
[86] Kota Yoshida and Takeshi Fujino. Disabling backdoor
and identifying poison data by using knowledge distillation in backdoor attacks on deep neural networks.
In Jay Ligatti and Xinming Ou, editors, AISec@CCS
2020: Proceedings of the 13th ACM Workshop on
Artificial Intelligence and Security, Virtual Event, USA,
13 November 2020, pages 117–127. ACM, 2020.
Estimation of κV . In Section 3.3, we described κV =
Ext(B )
Ext(A(B )) , and briefly introduced how to calculate Ext(B ).
Specifically, for a randomly selected origin x ∈ B , we
sampled a set of 256 other inputs {x1 , x2 , ..., x256 }. We
then generated a set of 256 random directions from x as
−x
−x
, x2 −x , ..., kxx256−xk
}, and along each direction, we
{ kxx1−xk
1
2 kx2 −xk2
2
256
used an binary search algorithm to find the extent (i.e., how
far the origin x is from the boundary). Take a source-specific
backdoor (with the source class of 1 and the target class
of 0) as an example. We used a benign model fP and a
backdoored model fb to detect the boundary: fP (x) = 1 and
fb (x) = 1 ⇔ x ∈ B ; fP (x) = 1 and fb (x) = 0 ⇔ x ∈ A(B ).
Finally, for computing Ext(B ), we randomly selected 32 different origins, and set Ext(B ) as the average extent among
those computed from these origins. The same method was
also used to compute Ext(A(B )).
[87] Xiangyu Zhang, Xinyu Zhou, Mengxiao Lin, and Jian
Sun. Shufflenet: An extremely efficient convolutional
neural network for mobile devices. In Proceedings of
the IEEE conference on computer vision and pattern
recognition, pages 6848–6856, 2018.
10
10.1
Appendix of Proofs
Proof of Proposition 1
The inequality 0 ≤ dH −W 1 (D , D ′ ) ≤ 1 is obvious, and thus
we omit the proof. Next, we focus on proving dW 1 (D , D ′ ) ≤
dH −W 1 (D , D ′ ) and dH −W 1 (D , D ′ ) = 12 dH (D , D ′ ).
Let’s recall the definitions of Wasserstein-1 distance and
H -divergence:
• Wasserstein-1 distance: Assuming Π(D , D ′ ) is the set of
joint distributions γ(x, x′ ) whose marginals are D and D ′ ,
18
respectively, the Wasserstein-1 distance between D and D ′ is
dW 1 (D , D ′ ) =
E(x,x′ )∼γ kx − x′ k2 .
inf
Thus, we get what’s desired.
(14)
γ∈Π(D ,D ′ )
10.3
• H -divergence: Given two probability distributions D and
D ′ over the same domain X , we consider a hypothetical binary
classification on X : H = {h : X 7→ {0, 1}}, and denote by
I(h) the set for which h ∈ H is the characteristic function,
i.e., x ∈ I(h) ⇔ h(x) = 1. The H -divergence between D and
D ′ is
dH (D , D ′ ) = 2 sup | PrD (I(h)) − PrD ′ (I(h))|.
Proof. Let’s consider two Lemmas first.
Lemma 10. Supposing there are two sets of n numbers:
n
{u1 , u2 , ..., un } and {v1 , v2 , ..., vn }, if ∀ai ≥ 0, ∀bi ≥ 0, ∑ ui =
1 and ∑ vi = 1, for a number K ≥ 1, we have
i=1
n
From the definition of H -divergence (Eq. 15), one can
directly obtain that dH −W 1 (D , D ′ ) = 21 dH (D , D ′ ), using the
fact that max({0, 1}) = max([0, 1]) = 1 and min({0, 1}) =
min([0, 1]) = 0.
Referring to the prior work [4], we apply the KantorovichRubinstein duality [76] to transform Eq. 14 into its dual form:
dˆW 1 (D, D′ ) = max [EPrD h(x) − EPrD ′ h(x)],
i=1
A,B ,t
Proof. First, we have following equations.
A,B ,t
Now, we use above two Lemmas to prove Theorem 4. First,
we denote u(x, y) as
=
=
(x,y)∈A(B )×Y
R
(x,y)∈C−
R
PrDA,B ,t (x, y) −
R
(x,y)∈C−
(x,y)∈C−
(x,y)∈A(B )×Y
A,B ,t
PrDA,B ,t (y|x) and v(x, y) as
(x,y)∈A(B )×Y
(x,y)∈A(B )×Y
According to Lemma 3, we have
=
dH −W 1 (DP , DA,B ,t )
R
max{ Z β u(x, y) − κ1 v(x, y), 0}
Pr(B )
=
1
κ
PrDP (x, y)
β
ZA,B ,t
get
PrDA,B ,t (x, y))
(x, y) − PrDP (x, y), 0)
19
≥
1
κ
A,B ,t
(x,y)∈A(B )×Y
Pr(B )
R
max{Ku(x, y) − v(x, y), 0},
(x,y)∈A(B )×Y
where we set K =
(PrDA,B ,t (x, y) − PrDP (x, y))
max(PrD ′
(x)
(A(B ))
PrDP (y|x). Apperently, u(x, y) ≥ 0 and v(x, y) ≥ 0. BeR
R
sides,
u(x, y) = 1 and
v(x, y) = 1.
We split A(B ) × Y = C+ ∪ C− where C+ = {(x, y) :
PrDP (x, y) ≥ PrDA,B ,t (x, y)} and C− = {(x, y) : PrDP (x, y) <
PrDA,B ,t (x, y)}.
R
A,B ,t
A,B ,t
h(x,y) (x,y)∈A(B )×Y
dH −W 1 (DP , DA,B ,t )
= (1 − Z 1 )(1 − Pr(A(B ))) + Pr(A(B )) −
PrD
PrD
Pr(x)
Pr(A(B ))
dH −W 1 (DP , DA,B ,t )
= (1 − Z 1 )(1 − Pr(A(B )))
A,B ,t R
+ max (
h(x, y)(PrDP (x, y) − PrDA,B ,t (x, y))).
(
(18)
Since κ1 ≤ β, we have κβ ≥ 1. Besides, since κ ≥ 1, we have
κ − Pr(B ) > 0. Putting them together, we have κβ ≥ ZA,B ,t ,
or Z β ≥ κ1 as desired.
Proof. Supposing ZA,B ,t = (x,y) P(x, y) where P(x, y) is defined in Eq. 6, A(·) is the trigger function, B is the backdoor
region and t is the target label, we have:
−
κβ − ZA,B ,t
1 2
κ (κ β − κβ Pr(B ) − κ + Pr(B ))
1
κ (κ − Pr(B )(κβ − 1)
=
=
R
A,B ,t
(17)
Lemma 11. Supposing κ ≥ 1, κ1 ≤ β ≤ 1 and ZA,B ,t = 1 −
β
1
≥ κ1 .
κ Pr(B) + β Pr(B), we have Z
Proof of Theorem 2
R
i=1
Proof. One can easily get the desired through max{Kui −
vi , 0} ≥ max{ui − vi , 0} + (K − 1)ui and max{Kui − vi , 0} ≤
Kui
where khkL ≤1 represents all 1-Lipschitz functions h : X 7→ R.
Notice that, without loss of generality, we assume X = [0, 1]n
where n is the dimension of the input. Thereby, we can further
assume that h(x) ∈ [0, 1] which will not change the maximum
value of Eq. 16. Under the above assumptions, comparing with
dH −W 1 (Eq. 2), dW 1 (Eq. 16) is additionally constrained by
that h should be a 1-Lipschitz function. Hence, dW 1 (D , D ′ ) ≤
dH −W 1 (D , D ′ ) as we desired.
10.2
n
K ≥ ∑ max{Kui − vi , 0} ≥ (K − 1) + ∑ max{ui − vi , 0}.
(16)
khkL ≤1
i=1
n
(15)
h∈H
Proof of Theorem 4
β
1
ZA,B ,t / κ .
According to Lemma 11, we have
and, thus, K ≥ 1. Applying Lemma 10, we further
≥
dH −W 1 (DP , DA,B ,t )
1
κ Pr(B )((K − 1) +
=
1
κ
R
(x,y)∈A(B )×Y
Pr(B )((K − 1) + S).
max{u(x, y) − v(x, y), 0})
β
1
ZA,B ,t / κ
Taking K =
β
(Z
A,B ,t
−
Apparently, r1 <
into the last equation, we have
1
κ (1 − S)) Pr(B )
≤ dH −W 1 (DP , DA,B ,t )
A,B ,t
0 when κ = β1 . In this case, dH −W 1 (DP , DA,B ,t ) reaches the
minimum value βS Pr(B ) as desired.
A,B ,t
10.6
Proof of Corollary 5
Proof. After calculation, we get that the derivative of
w.r.t. β is
1
(1− κ1
2
ZA,
B ,t
Proof of Lemma 8
Proof. Supposing fP = cP ◦ gP and fb = cb ◦ gb , we have
β
ZA,B ,t
β
α2 = ( mL
∑ ∑ max(gP (x)y − gb (x)y , 0))2
Pr(B )). Since κ ≥ 1, we have κ1 Pr(B ) ≤
x∈X y∈Y
1. Thus, the derivative is non-negative, which indicates that
β
increases along with the increasing β. Considering that
Z
≤
A,B ,t
1
κ
≤
κ Pr(B )
when β = κ1 , and achieves the upper-bound κ+κ Pr(
B )−Pr(B )
when β = 1. Taking these results into Theorem 4, we get
what’s desired.
≤
β ∈ [ κ1 , 1], we obtain that
10.5
β
ZA,B ,t
achieves the lower-bound
=
≤
Proof. Let’s
first consider the
upper-bound of
β
dH −W 1 (DP , DA,B ,t ), that is Pr(B ) Z
according to
A,B ,t
Theorem 4. After calculation, we get that the derivative of
β
< 0. Thus, Z β increases monotonously
w.r.t. κ is Z−β
2
Z
A,B ,t
A,B ,t
10.7
along with the decreasing of κ. Considering κ1 ≤ β, we obtain
that Z β reaches its upper-bound β when κ = β1 , and thus
A,B ,t
dH −W 1 (DP , DA,B ,t ) ≤ β Pr(B )
Let’s now consider the lower-bound of dH −W 1 (DP , DA,B ,t ),
that is Pr(B )( Z β − κ1 (1−S)) according to Theorem 4. How-
ZA,B ,t
After calculation, we get the derivative of
κ is
β
ZA,B ,t
∑ kgP (x) − gb (x)k22
x∈X
β2
2
mL kφ(X)(ωb − ωP )k2
β2
2
2
mL kφ(X)k2 kωb − ωP k2 .
Proof of Lemma 9
=
≤
nP ntg
2
nP +ntg dM (mP , mb )
nP ntg
2
nP +ntg λmax kmP − mb k2 ,
p
where dM (mP , mb ) = (mP − mb )T Σ−1 (mP − mb ) is the Mahalanobis distance [53] and λmax is the largest eigenvalue of
Σ−1 .
Next, we demonstrate that kmP − mb k2 ≤ dH −W 1 (NP , Nb )
when NP = N (mP , σ) and Nb = N (mb , σ). Actually, if
kmP − mb k2 = dW 1 (NP , Nb ), we can easily prove the inequality according to Proposition 1 that illustrates dW 1 (NP , Nb ) ≤
dH −W 1 (NP , Nb ). Next, we strictly prove kmP − mb k2 =
dW 1 (NP , Nb ).
According to the Jensen’s inequality [64], Ekx − x′ k2 ≥
kE(x − x′ )k2 = kmP − mb k2 . Thus dW 1 (NP , Nb ) ≥ kmP −
mb k2 . Again by Jensen’s inequality, (Ekx − x′ k2 )2 ≤ Ekx −
x′ k22 . Thus, dW 1 (NP , Nb ) ≤ dW 2 (NP , Nb ) where dW 2 (·, ·) is
the Wasserstein-2 distance. As proved in paper [19], the
Wasserstein-2 distance between two normal distribution can
be calculated by:
A,B ,t
− κ1 = 0 when κ = β1 .
β2
mL
T2
A,B ,t
β
∑ ∑ |gP (x) − gb (x)|)2
x∈X y∈Y
( mβ ∑ √1L kgP (x)y − gb (x)y k2 )2
x∈X
Proof. Specifically, a test statistic T 2 is calculated as
ever, κ1 S is o( κ1 ) when dH −W 1 (DP , DA,B ,t ) becomes close to
its lower-bound, since S → 0 in this case. Thus, we only need
to consider the relation between Z β − κ1 and κ. Particularly,
we have that
β
( mL
The inequality of arithmetic and geometric means are used
to obtain the third and forth transformations. The cauchyschwarz inequality are used to obtain the last transformation.
After simple math, the lower-bound of kωb − ωP k2 will be
derived as what’s desired.
Proof of Corollary 6
A,B ,t
and r2 < β1 . Considering the coefficient
of the quadratic term is positive, we obtain that Z β − κ1
A,B ,t
increases monotonously along with the increasing κ when
κ ≥ β1 . This indicates that Z β − κ1 reaches its lower-bound
as desired. Similarly, we get dH −W 1 (DP , DA,B ,t ) ≤
β
Pr(B ). This completes this proof.
Z
10.4
1
β
− κ1 w.r.t.
κ2 (1+β2 Pr(B )2 +β Pr(B ))−2κ(Pr(B )+2β Pr(B )2 )+Pr(B )2
.
2
2
ZA,
B ,t κ
The denominator is strictly positive and numerator is a
quadratic function of κ. After calculation, we get its two roots
r1 and r2:
√
1− β3 Pr(B )3
r1 = β1 − β(1+β2 Pr(B )2 +β Pr(B ))
√
1+ β3 Pr(B )3
r2 = β1 − β(1+β2 Pr(B )2 +β Pr(B ))
1
2 (N , N ) = km − m k2 + tr(Σ + Σ − 2(Σ Σ ) 2 )
dW
P
P b
P
P
b
b
b 2
2
20
2 (N , N ) = km −
Because ΣP = Σb = Σ as we assumed, dW
P
P
b
2
2
mb k2 . Thus, putting the above together, we get kmP − mb k2 ≤
dW 1 (NP , Nb ) ≤ kmP − mb k2 , which indicates kmP − mb k2 =
dW 1 (NP , Nb ) as desired.
Due to that dH −W 1 (DP , DA,B ,t ) is the maximum value
among all possible separation functions, we have
Clearly, when △(X, s) increases, R(X, s) becomes larger.
−
However, increasing △(X, s) is less effective for removing
−
backdoors with small backdoor distances dH −W 1 (DP , DA,B ,t )
/ apfor the following reasons. 1) When R(X, s) ∩ A(B ) = 0,
parently, the predicted labels of trigger-carrying inputs do not
/
change. 2) When R(X, s) ∩ A(B ) 6= 0/ and A(B ) \ R(X, s) 6= 0,
the small dH −W 1 (DP , DA,B ,t ) will lead to a large A(B ) \
R(X, s), i.e., the more x ∈ A(B ) close to the decision boundary,
the more x ∈ A(B ) outside R(X, s), indicating that the backdoor remains largely un-removed. This is because, during robustness enhancement, fb is learned to push x ∈ X away from
the boundary as much as possible, which is considered over/
fitting by the neural network. 3) When A(B ) \ R(X, s) = 0,
R(X, s) covers many inputs within the robust radius whose
true label is not s, i.e., f ∗ (x′ ) 6= s, and thus the robustness
enhancement will result in a false prediction on these inputs,
which is not desired. This is due to the irregular classification boundary of fb that makes the precise removal of the
backdoor impossible without knowing the trigger function A.
Besides, increasing △(X, s) will decrease △(X,t) for t 6= s,
dH −W 1 (DP , DA,B ,t ) ≥ dH −W 1 (XP , Xb ) = dH −W 1 (NP , Nb ).
Thus α ≥ kmP −mb k2 . Taking this into our original inequality,
n n
we get T 2 ≤ nPP+ntgtg λmax α2 as desired.
11
Appendix of TSA on Backdoor Unlearning
In addition to detection, the defender may also want to remove
the backdoor from an infected model, either after detecting the
model or through “blindly” unlearning the backdoor should it
indeed be present in the model.
We classify unlearning methods for backdoor removal into
two categories: targeted unlearning (for removing detected
backdoors) and “blind” unlearning.
−
“Blind” unlearning. Such unlearning methods can be further
classified into two sub-categories: fine-tuning and robustness
enhancement. The former fine-tunes a given model on benign
inputs, through which Catastrophic Forgetting (CF) would be
induced so an infected model’s capability to recognize the
trigger may be forgotten. To study the relationship between
CF and the backdoor similarity, we identify the lower bound
of task drift based on Lemma 8:
kδTP →TA,B ,t (X)k2 ≥ α
√
mL
β
Targeted unlearning. The targeted unlearning methods are
guided by the triggers reconstructed by the backdoor detection
methods. As we demonstrated in Section 4, backdoor detection methods themselves become hard when the backdoor
distance is small. Therefore, the targeted unlearning methods
also become less effective for the backdoors with smaller
backdoor distances.
(19)
12
Eq 19 shows that small task drift, the measurement of CF,
requires small backdoor distance (depicted by α) between the
primary task TP and the backdoor task TA,B ,t , implying that
“blind" unlearning through fine-tuning becomes less effective
(i.e., the backdoor may not be completely forgotten) when the
backdoor distance is small.
The robustness enhancement methods aim to enhance the
robustness radius of a backdoor model fb within which the
model prediction remains the same. Specifically, the robustness radius △(X, s) for the source label s on a set of benign
Knowledge distillation. In knowledge distillation, usually,
there is a teacher model and a student model. The knowledge
distillation defense uses the knowledge distillation process
to suppress the student model from learning the backdoor
behaviour from the teacher model through temperature controlling. Specifically, following the notion used in paper [86],
for the temperature T = 1, we have fb (x) j = softmaxT =1 (µ j )
inputs X could be formulated as
de f
−
min {△(x) :
x∈X f (x)=s
b
inf
f (x+δ)6=s
kδk}.
We denote R(X, s) as the set of inputs x′ within the robustness
radius △(X, s),
where softmaxT (µ j ) =
−
R(X, s) =
{x′
:
inf
x∈X f (x)=s
b
kx′ − xk
Appendix of Backdoor Disabling
Even though the backdoor with small backdoor distance is
hard to be detected and unlearned from the target model, the
defender could suppress the backdoor behaviour through backdoor disabling methods. Backdoor disabling aims to remove
the backdoor behaviour of infected model without affecting
model predictions on benign inputs. There are mainly two
kinds of methods: knowledge distillation and inputs preprocessing.
−
△(X, s) =
−
which eventually results in a model making the false prediction on the inputs with the true label of t.
exp(µ j )/T
.
∑ j′ ∈Y exp(µ j′ /T )
The bigger is T , the
softer is the prediction result fb (x). The high temperature
(e.g., T = 20 as used by [86]) could prevent the student
model from learning typical backdoors that drives the model
< △(X, s)}.
−
21
to generate highly confident predictions (of the target label) on
trigger-carrying inputs. However, backdoors with low backdoor distance drive the model to generate only moderate predictions for trigger-carrying inputs that are close to the classification boundary, which may still be learned by the student
model through the high temperature knowledge distillation.
Besides, the higher is the temperature, the smaller amount
of knowledge could be learned by the student model, which
result in the relatively low accuracy of the student model. On
the other hand, the low temperature will sharpen the classification boundary, which allows the student model to learn
confident predictions from teacher model. However, in this
case, the predictions of the teacher model for trigger-carrying
inputs become confident, i.e., fb (A(x))t is high. As a result,
the student model may easily learn the backdoor, in a similar
manner as learning it from a contaminated training dataset in
a traditional backdoor attack when the backdoor has a large
backdoor distance. Therefore, the choice of the temperature
reflects a trade-off between the performance (i.e., the effectiveness of knowledge distillation) and security (i.e., the effectiveness of backdoor disabling) of the student model. Finally,
the knowledge distillation could be viewed as a continual
learning process from the backdoor task TA,B ,t to the primary
task TP . As shown in Eq. 19, it is expected that the student
model will give similar predictions as the teacher model in
predictions of either clean or trigger-carrying inputs, when
the backdoor has a small backdoor distance.
same δ, the benign model would also flip the predicted label
for A(x) + δ, i.e., arg max fP (A(x) + δ) j 6= arg max fP (A(x)) j ,
j
j
because A(x) is also close to the classification boundary in
fP . There is no reason to block the input which is predicted
by the backdoor model fb with the same label as the one predicted by a benign model fP . If a large noise δ was added
to the inputs, the performance of the deep neural networks
(both fP and fb ) will decrease. Consequently, even though
the backdoor is suppressed, the benign model for the primary
task will become worse, indicating a trade-off between utility and security, which will be discussed in the next section.
In general, there is no significant difference between the robustness of the benign model fP and the backdoor model fb
for trigger-carrying inputs, when backdoor distance is small.
Thus, the input preprocessing methods may only moderately
suppress the backdoors with small backdoor distances.
13
Appendix of IMC Experiments
We exploited IMC to generate 200 backdoored models on
CIFAR10 with its official code that has been integrated into
the TrojnZoo framework. In this experiment, those backdoors
carried on those backdoored models are source-specific with
the source class is 1 and the target class is 0. We set β = 0.1,
i.e., injecting 500 trigger-carrying inputs into the training set,
and kept other parameters as the default values, i.e., the trigger size is 3x3 and the transparency of trigger is 0 (meaning
the trigger is clear and has not been blurred). Table 6 illustrates the accuracy of 6 detections in distinguishing 200 IMC
backdoored models from 200 benign models, comparing with
what has been obtained by TSA attacks.
Input preprocessing. This kind of defenses introduce a preprocessing module before feeding the inputs into the target model that removes the trigger contained in inputs [44].
Accordingly, the modified triggers no longer match the hidden backdoor and therefore preventing the activation of the
backdoor. Without knowing the details of the trigger, these
methods perform preprocessing on both benign inputs and
trigger-carrying inputs. Thus, actually, these methods disable
backdoor based on a fundamental assumption that the trigger is sensitive to noise and the robustness of benign model
fP and backdoor model fb for trigger-carrying inputs differ significantly. To study the robustness of backdoors with
small backdoor distance, we investigate the difference between the predictions for the trigger-carrying inputs and
the trigger-carrying inputs with small added noise δ, i.e.,
| fb (A(x) + δ)t − fb (A(x))t |. When the backdoor distance is
small, not only A(x) is close to the classification boundary
of the backdoor model fb but also close to the classification
boundary of benign model fP . Intuitively, A(x) + δ is close
to the classification boundary of both fb and fP when kδk is
small. Thus, | fb (A(x) + δ)t − fb (A(x))t | should be small. One
may argue that some δ would make arg max fb (A(x) + δ) j 6=
Table 6: The backdoor detection accuracies (%) for IMC and
TSA obtained by six detection methods.
IMC
TSA
j
t = arg max fb (A(x)) j even if | fb (A(x) + δ)t − fb (A(x))t | is
j
small, as small δ could flip the predicted label for A(x) that
is close to the classification boundary in fb . However, for the
22
K-ARM
89.50
59.25
MNTD
98.75
51.25
ABS
74.25
51.00
TND
80.25
48.75
SCAn
91.00
63.25
AC
73.50
55.25