1 Introduction
Anomaly detection (
AD) is ubiquitous in many domains. Some of its applications are: intrusion detection, which refers to detecting malicious activity in a computer-related system; fraud detection such as credit card fraud, insurance fraud, tax fraud; medical anomaly detection, where the anomalies are detected in medical images, or clinical
electroencephalography (
EEG) records to diagnose or prevent diseases; anomalies in social networks, which refers to irregular and often unlawful behavior patterns of individuals in social networks, some examples of these are scammers, sexual predators, online fraudsters; industrial AD, which refers to detecting anomalies in industrial processes that are carried out countless times; AD in autonomous vehicles to prevent attackers from taking over the vehicle, and so on. [
3,
27].
For real-world applications, labeling can be expensive and time-consuming, for this reason, AD is generally cast as an unsupervised learning problem. In this setup, the learning is done directly from patterns naturally occurring in the data. Then, a follow-up question is: out of a set of algorithms, how to determine which AD algorithm is better and in what way without having labels? As it turns out, evaluating an algorithm’s performance under the unsupervised learning setting is not a trivial task. Nonetheless, it is of paramount importance to be able to compare AD algorithms beyond the supervised setting in a toy dataset (i.e., simulated datasets that include labels), and labeling real-world datasets is not a practical or scalable solution for most domains.
An anomaly is defined as a rare observation or an observation that deviates from the norm [
3]. Mathematically, it can be described as a low-likelihood data point with respect to some unknown underlying likelihood function
\(f:\mathcal {X}\rightarrow \mathbb {R}^+\), where
\(\mathcal {X}\) is the spaces where the data lives. Furthermore, the set of anomalies is given by
\(a=\lbrace x:f(x)\lt \epsilon \rbrace\) for some small
\(\epsilon\). The goal of AD is to approximate
f or some form of it, especially in the regions of low likelihood. Note that if
\(g=T\circ f\) with
T being a increasing transformation, then
\(a=\lbrace x:g(x)\lt \delta \rbrace\) using an appropriate
\(\delta\) yields exactly the same
a as when using
f, which is why approximating a “form” of
f suffices.
An AD algorithm (sometimes called scoring function) is a mapping
\(r:\mathcal {X}\rightarrow \mathbb {R}^+\). Suppose
r and
s are two different AD algorithms that can detect anomalies
\(a_r=\lbrace x:r(x)\lt \delta _r\rbrace\) and
\(a_s=\lbrace x:r(x)\lt \delta _s\rbrace\), then the challenge is to
meaningfully compare
r and
s in the context of AD. This meaningful comparison entails special considerations that will be discussed later on. It is important to recall that even though AD can be described mathematically, its motivation is practical, and usually there is interest in detecting only a particular subset of
a (the subset relevant to the application in question); Foorthius et al. [
13] provide a qualitative description of different types of anomalies of interest. As a consequence, whereas mathematically one can compute a sensible distance between
f and
r, and
f and
s to determine which one is closer to
f, this is only truly meaningful if there is a high overlap between
\(a_r\) and
\(a_s\). This reasoning is further demonstrated in Figure
1.
We have distilled the problem of comparing AD algorithms in the unsupervised setting into two tasks: (1) determining to what extent the algorithms are detecting the same anomalies, if they are, (2) choosing a criterion to determine which one is “better”; we use “better” to emphasize that there is no intrinsic better, instead one chooses a criterion that has suitable properties and that criterion is used to evaluate the different AD algorithms. There has been pivotal work on (2) such as the work by Goix et al. [
9,
15,
16], who proposed criteria to evaluate the quality of unsupervised AD algorithms based on the MV (Mass–Volume) and EM (Excess–Mass) curves. In a similar fashion, Marques et al. [
22,
23] proposed a different criterion that considers the degree of separability of each point with respect to the rest weighted by the anomaly score. We will discuss these approaches and others in more depth in the RelatedWork section.
Our work focuses on task 1), which consists of defining a notion of equivalence between r and p (two AD algorithms) that is meaningful in the context of AD. The fact we are in the context of AD is crucial because it requires the notion of proximity to have certain properties. If this were not the case, a viable notion of proximity would be the Kullback–Leibler (KL) divergence assuming that r and s are converted into probability distributions. Therefore, task (1) consist of designing a notion of distance or equivalence that exhibits certain properties. One of them is that it is more important that two AD algorithms agree over sets of low-likelihood than sets of high-likelihood (which in the case of KL divergence is the opposite because sets of high-likelihood have higher mass, hence higher weight in the calculation).
Our work is complementary to that by Goix et al. and Marques et al. [
15,
22] in the quest to evaluate unsupervised AD algorithms. More specifically, any two algorithms could pass first through our algorithm to determine whether they detect the same of kind of anomalies, if so, then [
15] or [
22] provide criteria that can be used to indicate which one is “better”, if not, then further comparisons to determine which one is “better” are not meaningful; this procedure is illustrated in Figure
2, where we also included a deployment block that uses an additional model to produce of an explanation of that prediction. Moreover, this line of work as standalone is worthy of pursuit as it helps elucidate what each of the AD algorithms is doing, which is particularly useful since modern AD algorithms are often constructed as black boxes [
26].
The rest of the article is organized as follows: Section
2 presents a literature review, and also lists our specific contributions, in comparison with those contributions by other works; Section
3 provides the background and the technical motivation for our algorithm; Section
4 describes the algorithm in detail; Section
5 shows and discusses the properties our algorithm has; Section
6 shows empirical evidence of the results our equivalence criterion can achieve on simulated data and on a real-world dataset of the daily energy consumption registered by electrical meters; finally Section
7 provides the conclusions.
2 Related Work
The literature on AD is vast, as there are countless AD algorithms [
11]. In contrast, the literature on how to compare different AD algorithms is scant. Traditionally, there have been three approaches to compare AD algorithms: the first one is supervised learning on publicly available datasets that are labeled. Among these, there is the work by Hasain et al. [
17], where they deal with detecting anomalies in data that is being streamed. Another example is [
28], where the application is cybersecurity and network intrusion in emerging technologies such
IoT (
Internet of Things) or the work in [
10] that deals on a similar problem. In this approach, comparing AD algorithms is trivial because the
receiving operating characteristic (
ROC) curve can be calculated, as well as the area under the ROC curve (AUC), which also serves as comparison criterion. The drawback, though, is that it can only be utilized for applications that have publicly available datasets that have been labeled. Even in those cases, it is necessary to assess whether the data in question has a similar distribution to that of the public dataset, e.g., fraudsters are adaptive [
7], then an outdated dataset becomes obsolete for training algorithms for fraud detection.
The second approach to comparing two AD algorithms is supervised learning on simulated data, where labels can be generated. Some of the works using this approach are: Anton et al. [
2], where they deal with cybersecurity and intrusion detection; Flach et al. [
12], where they want to detect anomalies on Earth observations; or Meire et al. [
24], where they want to detect anomalies on acoustic sounds. Comparing algorithms is trivial for the same reason as in the first approach. The drawback with this approach is that it requires an accurate model to simulate non-anomalous and anomalous points, and for many real-world applications even state-of-the-art simulators still fail to close the gap between real-world data and simulated data [
31].
Finally, the third approach is unsupervised learning, which is the case when there are no labels. This is the most useful case because it does not require any human labeling and it can be used on any application; the challenge is that it is not trivial to compare different AD algorithms. In fact, despite its usefulness this approach has been vastly understudied, as pointed out by Ma et al. [
21]. More specifically, Ma et al. found only three techniques (that we also found independently) that address the problem of algorithm comparison and evaluation in the context of AD in the unsupervised setting. Those three approaches are [
15,
22], which we mentioned previously, and [
25].
The criterion provided by Goix et al. [
15], namely the MV curve, is defined by the following equation
\(MV_s(\alpha)=\lbrace \inf _{u\ge 0}\lambda (s\ge u) \, | \, \mathbb {P}(s(X)\ge u)\ge \alpha \rbrace\), where
\(\lambda\) is the Lebesgue measure,
s is the AD algorithm, and
\(\alpha \in [0,1]\). For a given
\(\alpha\),
\(MV_s(\alpha)\) consists of finding the infimum over
u of the set
\(\lambda (s\ge u)\) (which amounts to making
u as big as possible), such that the probability of
\(s(X)\) being greater or equal than
u is greater or equal than
\(\alpha\), then the
\(MV_s(\alpha)\) corresponds to
\(\lambda (s\ge u)\). If
\(MV_r(\alpha)\le MV_s(\alpha)\), then AD algorithm
r is better than
s. The EM curve has a similar construction. Intuitively, the idea is that the level sets have minimum volume, hence minimum MV curve, when the generating function,
f, is used to generate the MV curve, therefore as an anomaly scoring function
r gets closer to
f, its MV curve will decrease (we recognize this discussion of that work is rather opaque, but an in-depth discussion is out of the scope. Interested readers may check [
9]). We borrow an idea from this work indirectly, which is that you do not have to compare directly the generating function
f with the scoring function
r, instead you can compare their level sets or even their rankings.
Marques et al. [
22] criterion to compare AD algorithms is named IREOS (Internal, Relative Evaluation of Outlier Solutions). It estimates how separable is each point from all the other points by using a maximum-margin classifier, which in their case it is a nonlinear SVM, then this is weighted by the anomaly score of the corresponding point. If the points with largest margins have the highest anomaly scores, then IREOS will be the highest possible. The higher the IREOS, the better the scoring function. Nguyen et al. [
25] propose different indexes whose underlying idea is that normal data should form one big cluster whereas anomalies will form a smaller cluster. They focus on two concepts: compactness and separability. Therefore, according to them, a good AD algorithm should separate the data in compact clusters that are far from each other. One of their indexes, root-mean-square
standard deviation (
SDT) is given by the formula
\(RMSSTD=(\sum _{i=1}^{NC}\sum _{x\in C_i} ||x-c_i||^2)/(P\sum _{i=1}^{NC}(n_i-1))\), where
NC is the number of clusters,
\(C_i\) is the
ith cluster,
\(c_i\) is the center of
\(C_i\),
\(n_i\) is the number of objects in
\(C_i\),
P is the number of attributes in the dataset, and
x is the datapoint. The numerator is measuring how compact each cluster is by calculating
\(||x-c_i||\) and summing over clusters; the denominator is a normalization constant. Of those three approaches, Ma et al. [
21] suggest that none of them is useful in practice, as they select models only comparable to a state-of-the-art detector with random hyperparameters. While we find that claim arguable, because the AD models were not tested as to whether they detect the same anomalies or not, it is clear that significant improvement is needed in this area.
A research area, relevant to AD, that has received significant attention in the last few years is
Explainable Artificial Intelligence (
XAI) [
6]. XAI seeks to equip opaque models with explanations or interpretations that are understood and accepted by a human audience [
6,
8]; this process is known as post-hoc explainability [
6]. The definition given by Arrieta et al. [
6] recites as follows: “post-hoc explainability techniques aim at communicating understandable information about how an already developed model produces its predictions for any given input”. XAI is important for AD because many times flagging a data point as anomaly is not enough; an explanation of why the point was flagged is also needed.
An example of this is the work by Kim et al. [
20]. Their work is on AD on maritime engines. For this application, knowing that there is an anomaly is not sufficient, instead one needs to know what sensor is triggering the anomaly. In their work, they use traditional AD algorithms and equip them with XAI models that generate an explanation. Another example is the work by Gnoss et al. [
14], which is on detection of erroneous or fraudulent business transactions and corresponding journal entries; and in this case too, when an anomaly detected, an explanation is needed. Furthermore, there is work such as the one by Barbado et al. [
4], whose application is AD for fuel consumption in fleet vehicles. Their work, not only generates an explanation why a point was labeled an anomaly, but it also provides a counterfactual recommendation, that is, it provides what could have been done differently about that vehicle to turn it into an inlier. All the works presented in this paragraph employ post-hoc explainability techniques. There are many more examples of AD applications that have incorporated XAI; a more comprehensive list can be found on the review by Yepmo et al. [
30].
Another approach to having AD algorithms with explanations is to make the AD algorithms transparent, that is, explainable by themselves; a typical example of a transparent algorithm is linear regression [
6]. In this category, Alvarez-Melis et al. [
1] propose self-explaining neural networks. A self-explaining neural network is a neural network that behaves smoothly, as a result, it can be approximated by a linear function at any point. Alvarez-Melis et al. enforce the smoothness by adding the term
\(\mathcal {L}_\theta =||\nabla _x f(x)-\theta (x)^TJ_x^h(x)||\) to the loss function, where
f is the neural network function,
x is the datapoint,
h is a function that maps from input space to feature space,
\(J_x^h(x)\) is the Jacobian of
h with respect to
x, and
\(\theta (x)\) refers to the parameters of the linear approximation at point
x. If
\(\mathcal {L}_\theta =0\), then the model can be locally approximated by a linear function, thus making the model explainable. A different approach with the same underlying idea was taken by Barbado et al. [
5]. Their approach consists of taking a trained one-class SVM and extracting rules from it. The rules are obtained by partitioning the input space in hypercubes that separate anomalous points from non-anomalous points, in other words, the rules are of the form: if
x (the datapoint) is inside a hypercube that contains inliers, then
x is an inlier. Then, these rules are used for prediction and also serve as explanations.
Above we presented some examples of AD algorithms equipped with XAI post-hoc techniques, and transparent AD algorithms. Our work and post-hoc techniques operate in a complementary fashion. To see this, we consider the definition of post-hoc technique presented above. From that definition, we can notice there are already two differences between our work and post-hoc techniques. First, our work does not say anything about
how the models produce predictions, rather our work deals with
what predictions one model produces in contrast to another model; second, our work does not require a human audience to interpret explanations. Note that the first and second differences come together because if there is no explanation, then there is no need for an explainee. Another advantage of not using explanations to analyze a model is that one does not have to compare different explanations for the same model to determine which explanation is better. In fact, this is one weakness of XAI, namely, there is no consensus or a rigorous mathematical definition of what constitutes a good explanation [
6,
8]. Although there has been progress, such as the work by Hoffman et al. [
18] that proposes metrics to quantify the quality of explanations, Arrieta et al. still conclude that more quantifiable, general XAI metrics are needed to support the existing measurement procedures and tools proposed by the community [
6].
To sum up, our work and XAI try to achieve the same general goal, namely, to obtain insights from black-box models; but do so in a different way. XAI does so by generating explanations that have to interpreted by a human, whereas our work directly answers the question whether two AD algorithms are equivalent without. XAI and our work can work jointly, as illustrated in Figure
2. In fact, if two algorithms are equivalent as given by our work, then those two algorithms should have the same explanations; the converse should also hold, that is, if two algorithms, for every data point, have the same explanation, then they should be equivalent.
In this work, we propose, to best of the authors’ knowledge, the first equivalence criterion for AD algorithms. More concretely, our work aims to determine to what degree two AD algorithms (scoring functions) are detecting the same kind of anomalies, which is different from designing a criterion to determine which AD algorithm is better. An equivalence measure is crucial because it only makes sense to determine “better” algorithms from a set of AD algorithms that detect the same kind of anomalies, otherwise they are just different and the search for “better” AD algorithms becomes superfluous.
3 Background
3.1 Problem Setup
The goal of this work is to develop a criterion of equivalence between two anomaly scores that matches intuition. While it is not clear what this criterion should be, a set of desirable properties can be established and then be used as a heuristic on the design of it.
First, let us introduce some notation: \(\mathcal {C}(\mathbf {r},\mathbf {s})\) is shorthand notation for \(\mathcal {C}(r(\mathcal {X}),s(\mathcal {X}))\), where \(\mathcal {X}\) is a dataset with n data points, and \(\mathcal {C}\) is the equivalence criterion; in our notation r refers to the scoring function itself and \(\mathbf {r}\) (in boldface) refers to the scoring function evaluated at a dataset, thus resulting in a set with n elements that we will call anomaly score. We are interested in measuring a distance between \(\mathbf {r}\) and \(\mathbf {s}\), that is, the anomaly scores, rather than r and s, that is, the scoring functions. Consequently, \(\mathcal {C}(\mathbf {r},\mathbf {s})\) is only meaningful if the dataset contains enough points, some of which have to be anomalies.
In our case, we consider an equivalence criterion that can be interpreted as a correlation. Specifically, it is defined as follows: let \(\mathcal {C}:\mathcal {A}\times \mathcal {A}\rightarrow [-1,1]\), where \(\mathcal {A}\) is the space of anomaly scores and \(\mathcal {C}\) is the equivalence criterion. By definition if \(\mathcal {C}(\mathbf {r},\mathbf {s})=1\), then \(\mathbf {r}\) and \(\mathbf {s}\) are equivalent; if \(\mathcal {C}(\mathbf {r},\mathbf {s})=0\), then \(\mathbf {r}\) and \(\mathbf {s}\) are uncorrelated, i.e., \(\mathbf {r}\) looks completely random with respect to \(\mathbf {s}\); if \(\mathcal {C}(\mathbf {r},\mathbf {s})=-1\), then \(\mathbf {r}\) and \(\mathbf {s}\) are inversely correlated, e.g., \(\mathbf {r}(x)=x\) and \(\mathbf {s}(x)=-x\). We will refer to \(\mathcal {C}(\mathbf {r},\mathbf {s})\) as the equivalence criterion value, which is the numerical value obtained by evaluating \(\mathcal {C}\) at \(\mathbf {r}\) and \(\mathbf {s}\).
Some of the fundamental properties any equivalence criterion should have are: (a)
\(\mathcal {C}(\mathbf {r},\mathbf {r})=1\); (b)
\(\mathcal {C}(\mathbf {r},\mathbf {s})=\mathcal {C}(\mathbf {s},\mathbf {r})\); (c)
\(\mathcal {C}(\mathbf {r},-\mathbf {r})=-1\). We will discuss more properties in Section
3.4.
3.2 Simple Equivalence Criterion (SEC)
Note that if we let
\(g=q-f\) for some constant
q to guarantee
g is nonnegative, then the set of anomalies is identical to before, but it is now given by
\(a=\lbrace x:g(x)\gt \delta \rbrace\). From now on, we assume the higher the score, the more anomalous the point is, and we consider sets of the form
\(\lbrace x:g(x)\gt \delta \rbrace\), also known as super-level sets. From this definition, a natural starting point for an equivalence criterion between
\(\mathbf {r}\) and
\(\mathbf {s}\) is
where
\(\delta _i\gt \delta _j\) if
\(i\gt j\) and
\(r_i\ne r_j\) for
\(i\ne j\), and
\(|\cdot |\) represents the order of set, which is the number of elements in the set.
Intuitively, \(\hat{\sigma }\) compares how similar are all super-level sets of \(\mathbf {r}\) with all super-level sets of \(\mathbf {s}\), which really is a comparison between anomalies detected by \(\mathbf {r}\) and \(\mathbf {s}\). Comparing all super-level sets, as opposed to some, is advantageous because it produces a robust equivalence criterion because even if \(\mathbf {r}\) and \(\mathbf {s}\) differ slightly, there will be super-level sets that account for that difference. If \(\mathbf {r}=\mathbf {s}\), then \(\hat{\sigma }(\mathbf {r},\mathbf {s})\) will be maximum; in contrast, if \(\mathbf {r}\)=\(-\mathbf {s}\), then \(\hat{\sigma }(\mathbf {r},\mathbf {s})\) will be minimum, although not zero.
If a data point has rank
k in
\(\mathbf {r}\) and
l in
\(\mathbf {s}\), then the element will be contribute
\(k\cdot l\) times to the sum, which are
\(\delta _1^{(r)},\delta _2^{(r)},\ldots ,\delta _k^{(r)}\) crossed with
\(\delta _1^{(s)},\delta _2^{(s)},\ldots ,\delta _l^{(s)}\). Moreover, since we want a correlation-like equivalence criterion, we need to include a normalization factor to set the maximum equal to 1, which all together yields Equation (
2).
where
\(\hat{\mathbf {r}}\) is the ranking of data points given by
\(\mathbf {r}\). Recall that the data point ranked one has the lowest score, hence it is the least anomalous (according to
\(\mathbf {r}\)); conversely, the data point ranked last has the highest score and, hence it is the most anomalous (again, according to
\(\mathbf {r}\)). If we consider
\(\sigma _n\) as a random variable with the rankings,
\(\hat{\mathbf {r}}\) and
\(\hat{\mathbf {s}}\), being uniformly distributed across all permutations, then
\(\mathbb {E}[\sigma _n]=3/4\) (The proof is in the supplementary material
A.1).
Then, to make
\(\sigma _n\) have range
\([-1,1]\), since its range is
\([0.5,1]\) we simply apply
\(\sigma = 4\sigma _n-3\). We will analyze this equivalence criterion further and compare it to our proposed measure in the
TheoreticalAnalysis section. We also present an example on the calculation of SEC in the supplementary material part D.
3.3 Toolbox for Proposed Criterion
Let
\(\mathbf {r}\) and
\(\mathbf {s}\) be the anomaly scores of two AD algorithms on a dataset
\(\mathcal {D}\). Let
\(\mathbf {m}=\mathcal {R}(\mathbf {r})\ ,\mathbf {n}=\mathcal {R}(\mathbf {s})\) be the rankings generated from
\(\mathbf {r}\) and
\(\mathbf {s}\). Then,
\(\mathbf {n}\) can be seen as a permutation of
\(\mathbf {m}\) (or vice versa). One can calculate a distance between
\(\mathbf {m}\) and
\(\mathbf {n}\); in our case, we use as distance the minimum number of moves that it would take to convert
\(\mathbf {n}\) into
\(\mathbf {m}\) divided by a normalization factor, which is known as the Kendall Tau correlation coefficient [
19] and it is calculated as follows:
where
\(\langle \mathbf {m},\mathbf {n}\rangle =\sum _{j\lt i}\text{sgn}(m_i-m_j)\text{sgn}(n_i-n_j)\), and
\(||\mathbf {m}||=\sqrt {\langle \mathbf {m},\mathbf {m} \rangle }\)A different perspective to Kendall Tau coefficient is
where concordant pairs refer to set of pairs about which
\(\mathbf {m}\) and
\(\mathbf {n}\) agree as to which element of the pair has a higher ranking (is greater), and discordant pairs is the set of pairs with disagreement between
\(\mathbf {m}\) and
\(\mathbf {n}\). An example of the calculation of the Kendall Tau correlation coefficient is included in the appendix Section
E.
In subsequent work, Vigna et al. [
29] showed that adding weights to the pairs still preserves many of properties of the original Kendall Tau coefficient (e.g.,
\(\langle \mathbf {m},\mathbf {n} \rangle\) forms an inner product, and hence one gets Cauchy–Schwarzt-like inequalities such as
\(|\langle \mathbf {m} ,\mathbf {n} \rangle |\le ||\mathbf {m}||||\mathbf {n}||\)), while still adding new ones. Their formulation of the weighted Kendall Tau coefficient uses the following calculation:
The weighted Kendall Tau coefficient becomes \(\tau (\mathbf {m},\mathbf {n})_\omega =\tfrac{\langle \mathbf {m},\mathbf {n}\rangle _\omega }{||\mathbf {m}||_\omega \cdot ||\mathbf {n}||_\omega }.\)
This weighted formulation gives leeway to enforce additional desirable properties for our proposed equivalence criterion.
3.4 Desirable Properties of an Equivalence Criterion
These properties are considered as an addition to the properties mentioned in Section
3.1. With the properties mentioned here and Section
3.1, we formalize our intuition with respect to the equivalence criterion between anomaly scores. They, however, are by no means exhaustive and only reflect what we consider is important when formulating an equivalence criterion.
The properties are described in terms of the fact that one anomaly score be seen as the permutation of another anomaly score. More specifically, a permutation move refers to the number of moves that it would take to move data point
j from the ranking
\(s_j\) so that it has the same ranking given by
\(r_j\), e.g., if
\(\mathbf {s}\) ranks a point
uth, and
\(\mathbf {r}\) ranks the same point
vth, then the permutation move for that point would be
\(|u-v|\) (the sign is not relevant, as one can always fix
\(\mathbf {s}\) and see
\(\mathbf {r}\) as a permutation or vice versa to get positive permutation moves). We will consider two types of permutations moves: large permutation moves, which refers to when the difference in rankings of
\(\mathbf {s}\) and
\(\mathbf {r}\) in absolute value is large (this corresponds to element
a in Figure
3); and small or local permutation moves, which refers to when the difference in rankings of
\(\mathbf {s}\) and
\(\mathbf {r}\) in absolute value is small (this corresponds to element
b in Figure
3). Then, the question is how to assign different weights to different permutations moves; this is going to be addressed by our properties, which are:
(1)
A k-move permutation move should reduce the equivalence criterion value more than k 1-move permutation moves.
(2)
Locally moving points that have high anomaly scores should reduce the equivalence criterion value more than locally moving points that have low anomaly scores.
The rationale behind these two properties is as follows: with respect to property one, firstly, we have an immediate corollary, which is l k-move permutations should reduce the equivalence criterion value more than k l-move permutation as long as \(l\lt k\). We will refer to l-move permutations as small disagreements, and k-move permutations as major disagreements. Intuitively, property 1 is desirable because small disagreements indicate the anomaly scores differ locally, hence those local differences should not reduce the equivalence criterion value significantly; on the contrary, major disagreements are signs that the anomaly scores are fundamentally different, hence they should reduce the equivalence criterion value more dramatically. An example of what property 1 implies is: if \(\mathbf {r}\) is an anomaly score, and \(\mathbf {s}=\lbrace r_i+\epsilon _i\rbrace _{i=1}^n\) for small \(\epsilon _i\), then \(\mathcal {C}(\mathbf {r},\mathbf {s})\) should be close to 1. With respect to the second property: high density regions on the anomaly score are more susceptible to having discordant pairs by pure randomness as points are close together; these high density regions correspond precisely to the data points that have low anomaly scores because that is where all non-anomalous data points reside (recall the majority of the points are not anomalous). Therefore, discordant pairs in these high density regions should have lower weights, which is equivalent to saying that non-high density regions should have larger weights; in other words, regions where the anomalous points reside should have higher weights.
In the Theoretical Analysis section, we mathematically define these two properties and show that our proposed equivalence criterion satisfies both.
4 Gaussian Equivalence Criterion (GEC)
The analysis so far uses rankings, instead of anomaly scores, because the work done in [
19,
29] was formulated in terms of rankings. However, Equation (
3) and its weighted version can be calculated with anomaly scores because the ranking map
\(\mathcal {R}:\mathbb {R}\rightarrow \lbrace 1,2,\ldots ,n\rbrace \in \mathbb {N}\) is order preserving. Moreover, we have that converting
\(\mathbf {s}\) into
\(\mathbf {r}\) takes a permutation; and a re-scaling map
g, that matches the values of
\(\mathbf {r}\) with those values of the permuted version of
\(\mathbf {s}\). While our equivalence criterion does not concern itself with
g, it explicitly uses the anomaly scores,
\(\mathbf {r}\) and
\(\mathbf {s}\), in the weight calculation.
4.1 Algorithm for GEC
Sort and then divide \(\mathbf {r}\) and \(\mathbf {s}\) into k subsets that correspond to different classes within the anomaly scores, e.g., normal (non-anomalous) set can be from 0th to 90th percentile; 90th to 98th percentile can be grey area set; 98th to 100th percentile can be anomalous set, which would give \(k=3\). Then, from the points corresponding to each subset calculate the mean \(\mu =\lbrace \mu _1,\mu _2,\ldots ,\mu _k\rbrace\) and the STD \(\sigma =\lbrace \sigma _1,\sigma _2,\ldots ,\sigma _k\rbrace\).
Then, GEC of
\(\mathbf {r}\) and
\(\mathbf {s}\),
\(\phi (\mathbf {r}, \mathbf {s})\), is calculated using the following equations:
We propose that two anomaly scores are equivalent if their GEC is greater than 0.8; this, however, is an empirical observation whose analysis is discussed in the Results section. Depending on the application, one might use a threshold different than 0.8. GEC is clearly symmetrical, i.e.,
\(\phi (\mathbf {r},\mathbf {s})=\phi (\mathbf {s},\mathbf {r})\), and exhibits a resemblance of the transitive property, that is, if
\(\mathbf {r}\) is equivalent to
\(\mathbf {s}\) and
\(\mathbf {s}\) is equivalent to
\(\mathbf {p}\), then
\(\mathbf {r}\) is equivalent to
\(\mathbf {p}\). Therefore, if there are
n different anomaly scores, there is no need to run the equivalence algorithm
\(\mathcal {O}(n^2)\) times; instead,
\(\mathcal {O}(n\log n)\) times suffices to determine the equivalence relations among all of them. The pseudocode for the GEC is provided in the supplementary material Section
E.
4.2 Intuition about GEC
In this section, we will discuss the intuition and motivation of each of the steps of GEC.
Anomaly scoring functions associate levels of anomaly with datapoints, i.e., the higher the score, the more anomalous. Dividing anomaly scores in subsets the way we propose allows to have a distinction between subsets of points that are highly anomalous and subsets of points that are not, which is beneficial as different subsets can then be treated differently. This distinction, while helpful, creates an issue, namely, some points might be arbitrarily close, but they might be assigned to different subsets. The way weights are defined resolves this issue as points that are arbitrarily close will have the same \(F_\mu ,\sigma\) value and hence the same weight, i.e., \(F_{\mu ,\sigma }(x_1,x_2)\approx F_{\mu ,\sigma }(x_1,x_2+\epsilon)\) for some small \(\epsilon\); and more generally, \(\omega _r(i,j)\) will be approximately equal to \(\omega _r(i,j+k)\) for a small k, that is, the weight function is smooth. In addition, this weighting scheme has the property that all weights will be bounded by k.
To sum up, the partition of the anomaly scores into subsets introduces the notion of classes (i.e., anomalous, normal, grey area, etc), and the weighting scheme gives a sensible notion of distance between anomaly score points that is smooth and bounded, thus resulting in an equivalence criterion that allows flexibility while being also robust.