Within a large database 𝒢 containing graphs with labeled nodes and directed, multi-edges; how can we detect the anomalous graphs? Most existing work are designed for plain (unlabeled) and/or simple (unweighted) graphs. We introduce CODEtect, the first approach that addresses the anomaly detection task for graph databases with such complex nature. To this end, it identifies a small representative set 𝒮 of structural patterns (i.e., node-labeled network motifs) that losslessly compress database 𝒢 as concisely as possible. Graphs that do not compress well are flagged as anomalous. CODEtect exhibits two novel building blocks: (i) a motif-based lossless graph encoding scheme, and (ii) fast memory-efficient search algorithms for 𝒮. We show the effectiveness of CODEtect on transaction graph databases from three different corporations and statistically similar synthetic datasets, where existing baselines adjusted for the task fall behind significantly, across different types of anomalies and performance metrics.
1 Introduction
Given hundreds of thousands of annual transaction records of a corporation, how can we identify the abnormal ones, which may indicate entry errors or employee misconduct? How can we spot anomalous daily e-mail/call interactions or software programs with bugs?
We introduce a novel anomaly detection technique called CODEtect, for node-Labeled, Directed, Multi-graph(LDM)databases, which emerge from many applications. Our motivating domain is accounting, where each transaction record is represented by a graph in which the nodes are accounts and directed edges reflect transactions. Account types (revenue, expense, etc.) are depicted by (discrete) labels and separate transactions between the same pair of accounts create edge multiplicities. The problem is then identifying the anomalous graphs within LDM graph databases. In these abstract terms, CODEtect applies more broadly to other domains exhibiting graph data with such complex nature, e.g., detecting anomalous employee e-mail graphs with job titles as labels, call graphs with geo-tags as labels, control flow graphs with function-calls as labels, and so on. What is more, CODEtect can handle simpler settings with any subset of the LDM properties.
Graph anomaly detection has been studied under various non-LDM settings. Most of these work focus on detecting anomalies within a single graph; either plain (i.e., unlabeled), attributed (nodes exhibiting an array of (often continuous) features), or dynamic (as the graph changes over time) [1, 2, 14, 20, 28, 38, 39, 46] (See Table 1 for overview). None of these applies to our setting, as we are to detect graph-level anomalies within a graph database. There exist related work for node-labeled graph databases [34], which, however, does not handle multi-edges, and as we show in the experiments (Section 6), cannot tackle the problem well.
Table 1. Comparison with Popular Approaches to Graph Anomaly Detection, in Terms of Distinguishing Properties
Recently, general-purpose embedding/representation learning techniques achieve state-of-the-art results in graph classification tasks [15, 16, 19, 31, 33, 37]. However, they do not tackle the anomaly detection problem per se—the embeddings need to be fed into an off-the-shelf vector outlier detector. Moreover, most embedding methods [15, 16, 19] produce node embeddings; how to use those for graph-level anomalies is unclear. Trivially aggregating node representations, e.g., by mean or max pooling, to obtain the entire-graph representation provides suboptimal results [31]. Graph embedding techniques [31, 37] as well as graph kernels [45, 54] (paired with a state-of-the-art detector), yield poor performance as we show through experiments (Section 6), possibly because embeddings capture general patterns, leaving rare structures out, which are critical for anomaly detection.
Our main contributions are summarized in the following:
–
Problem Formulation: Motivated by application to business accounting, we consider the anomaly detection problem in LDM databases and propose CODEtect; (to our knowledge) the first method to detect anomalous graphs with such complex nature (Section 2). CODEtect also generally applies to simpler, non-LDM settings. The main idea is to identify a few representative network motifs that are used to encode the database in a lossless fashion as succinctly as possible. CODEtect then flags those graphs that do not compress well under this encoding as anomalous (Section 3).
–
New Encoding & Search Algorithms: The graph encoding problem is two-fold: how to encode and which motifs to encode with. To this end, we introduce (1) new lossless motif and graph encoding schemes (Section 4), and (2) efficient search algorithms for identifying key motifs with a goal to minimize the total encoding cost (Section 5).
–
Real-world Application: In collaboration with industry, we apply our proposed techniques to annual transaction records from three different corporations, from small- to large-scale. We show the superior performance of CODEtect over existing baselines in detecting injected anomalies that mimic certain known malicious schemes in accounting. Case studies on those as well as the public Enron e-mail database further show the effectiveness of CODEtect in spotting noteworthy instances (Section 6). To facilitate reproducibility, we also confirm our performance advantages on statistically similar datasets resembling our real-world databases.
Graph Anomaly Detection: Graph anomaly detection has been studied under various settings for plain/attributed, static/dynamic, etc. graphs, including the most recent deep learning based approaches [1, 11, 14, 20, 38, 39, 46, 57] (See [2] and [27] for a survey.) These works focus on detecting node/edge/subgraph anomalies within a single graph, none of which applies to our setting, as we are to detect anomalous graphs (or graph-level anomalies) within a graph database.
On anomalous graph detection in graph databases, Gbad [12] has been applied to flag graphs as anomalous if they experience low compression via discovered substructures over the iterations. Further, it has been used to identify graphs that contain substructures \(S^{\prime }\) with small differences (a few modifications, insertions, or deletions) from the best one S, which are attributed to malicious behavior [12]. Gbad also has very high time complexity due to the nested searches for substructures to compress the graphs through many iterations (failed to complete on multiple cases of our experiments - see Section 6). Our work is on the same lines with these work in principle, however our encoding scheme is lossless. Moreover, these work cannot handle graphs with weighted/multi edges. There exist other graph anomaly detection approaches [14, 28], however none of them simultaneously handle node-labeled graphs with multi-edges. SnapSketch [36] was introduced recently as an unsupervised graph representation approach for intrusion detection in a graph stream and showed better detection than previous works [14, 28], however, SnapSketch was originally designed to work on undirected graphs. Note also that these works [14, 28, 36] focus on graph streams, i.e., time-ordered graphs, and may not work well in our setting of unordered graph databases. We present a qualitative comparison of related work to CODEtect in Table 1.
Graph Embedding for Anomaly Detection: Recent graph embedding methods [11, 15, 16, 19, 31, 33, 37, 57] and graph kernels [45, 54] find a latent representation of node, subgraph, or the entire graph and have been shown to perform well on classification and link prediction tasks. However, graph embedding approaches, like [15, 16, 19], learn a node representation, which is difficult to use directly for detecting anomalous graphs. Peng et al. [37] propose a graph convolutional neural network via motif-based attention, but, this is a supervised method and, thus, not suitable for anomaly detection. Our experimental results show that other recent graph embedding [31] and graph kernel methods [54], that produce a direct graph representation, when combined with a state-of-the-art anomaly detector have low performance and are far less accurate than CODEtect. Concurrent to our work, graph neural networks for anomalous graph detection is studied in [26, 56] that examines end-to-end graph anomaly detection. Furthermore, [6] investigates outlier resistant architectures for graph embedding. A key challenge, in general, for deep learning-based models for unsupervised anomaly detection is their sensitivity to many hyper-parameter settings (including those for regularization: such as weight decay, and drop-out rate; optimization: such as learning rate, and achitecture: such as depth, width, etc.), which are not straightforward to set in the absence of any ground-truth labels. Distinctly, our work leverages the Minimum Description Length principle and does not exhibit any hyper-parameters.
Graph Motifs: Network motifs have proven useful in understanding the functional units and organization of complex systems [7, 30, 51]. Motifs have also been used as features for network classification [29], community detection [5, 55], and in graph kernels for graph comparison [45]. On the algorithmic side, several works have designed fast techniques for identifying significant motifs [8, 9, 21, 22], where a sub-graph is regarded as a motif only if its frequency is higher than expected under a network null model.
Prior works on network motifs mainly focus on 3- or 4-node motifs in undirected unlabeled/ plain graphs [4, 13, 42, 50], either using subgraph frequencies in the analysis of complex networks or most often developing fast algorithms for counting (e.g., triangles) (See [43] for a recent survey). Others have also studied directed [7] and temporal motifs [24, 35]. Most relatedly, there is a recent work on node-labeled subgraphs referred to as heterogeneous network motifs [41], where again, the focus is on scalable counting. Our work differs in using heterogeneous motifs as building blocks of a graph encoding scheme, toward the goal of anomaly detection.
Data Compression via MDL-Encoding: The Minimum Description Length (MDL) principle by Rissanen [40] states that the best theory to describe a data is the one that minimizes the sum of the size of the theory, and the size of the description of data using the theory. The use of MDL has a long history in itemset mining [48, 52], for transaction (tabular) data, also applied to anomaly detection [3, 47].
MDL has also been used for graph compression. Given a pre-specified list of well-defined structures (star, clique, etc.), it is employed to find a succinct description of a graph in those “vocabulary” terms [23]. This vocabulary is later extended for dynamic graphs [44]. A graph is also compressed hierarchically, by sequentially aggregating sets of nodes into super-nodes, where the best summary and associated corrections are found with the help of MDL [32].
There exists some work on attributed graph compression [49], but the goal is to find super-nodes that represent a set of nodes that are homogeneous in some (user-specified) attributes. Subdue [34] is one of the earliest work to employ MDL for substructure discovery in node-labeled graphs. The aim is to extract the “best” substructure S whose encoding plus the encoding of a graph after replacing each instance of S with a (super-)node is as small as possible.
3 Preliminaries & The Problem
As input, a large set of J graphs \(\mathcal {G}= \lbrace G_1, \ldots , G_J\rbrace\) is given. Each graph \(G_j=(V_j, E_j, \tau)\) is a directed, node-labeled, and multi-graph, which may contain multiple edges that have the same end nodes. \(\tau : V_j \rightarrow \mathcal {T}\) is a function that assigns labels from an alphabet \(\mathcal {T}\) to nodes in each graph. The number of realizations of an edge \((u,v)\in E_j\) is called its multiplicity, denoted \(m(u,v)\). (See Figure 1(a), for example.)
Fig. 1.
Our motivating domain is business accounting, in which each \(G_j\) corresponds to a graph representation of what-is-called a “journal entry”: a detailed transaction record. Nodes capture the unique accounts associated with the record, directed edges the transactions between these accounts, and node labels the financial statement (FS) account types (e.g., assets, liabilities, revenue). Bookkeeping data are kept as a chronological listing (called General Ledger) of each separate business transaction, where multiple transactions involving same account-pairs generate multi-edges between two nodes.
Our high-level idea for finding anomalous graphs in database \(\mathcal {G}\) is to identify key characteristic patterns of the data that “explain” or compress the data well, and flag those graphs that do not exhibit such patterns as expected—simply put, graphs that do not compress well are anomalous. More specifically, graph patterns are substructures or subgraphs, called motifs, which occur frequently within the input graphs. “Explaining” the data are encoding each graph using the frequent motifs that it contains. The more frequent motifs we use for encoding, the more we can compress the data; simply by encoding the existence of each such motif with a short code.
The goal is to find a (small) set of motifs that compresses the data the best. Building on the MDL principle [17], we aim at finding a model, namely, a motif table (denoted \({MT}\)) that contains a carefully selected subset of graph motifs, such that the total code length of (1) the model itself plus (2) the encoding of the data using the model is as small as possible. In other words, we are after a small model that compresses the data the most. The two-part objective of minimizing the total code length is given as follows:
where \(\mathcal {MT}\) denotes the set of all possible candidate motif tables. The first term can be seen as a model regularizer that penalizes using an unneccesarily large set of motifs to explain the data. The second term is the compression length of the data with the (selected) motifs and decomposes as \(L(\mathcal {G} | {MT}) = \sum _j L({G_j} | {MT})\) since individual journals are independent. The encoding length \(L({G_j} | {MT})\) is also the anomaly score for the jth graph—the larger, the more anomalous.
As such, we have a combinatorial subset selection problem toward optimizing Equation (1). To this end, we address two subproblems outlined below.
4 Encoding Schemes
MDL-encoding of a dataset with a model can be thought to involve a Sender and a Receiver communicating over a channel, where the Sender—who has the sole knowledge of the data—generates a bitstring based on which the Receiver can reconstuct the original data on their end losslessly. To this end, the Sender first sends over the model, in our case the motif table \({MT}\), which can be thought as a “code-book” that establishes a certain “language” between the Sender and the Receiver. The Sender then encodes the data instances using the code-words in the “code-book”, in our case the graph \(G_j\)’s using the motifs in the \({MT}\).
It is important to note that existing works [9, 10] presented different graph/substructure encoding schemes; however, each has its own limitation and does not work in our settings. The encoding in [10] is lossy and may not reflect accurate code length, while the work in [9] is restricted on simple graphs with non-overlapping nodes of motifs’ occurrences. Our encoding algorithm is both lossless and applicable on multi-graph with node-overlapping occurrences of motifs.
4.1 Encoding the Motif Table
The motif table \({MT}\) is simply a two-column translation table that has motifs in the first column, and a unique code-word corresponding to each motif in the second column, as illustrated in Figure 1(b). We use \(\mathcal {M}\) to refer to the set of motifs in \({MT}\). A motif, denoted by (lower-case) g, is a connected, directed, node-labeled, simple graph, with possible self-loops on the nodes. For \(g\in \mathcal {M}\), \(code_{{MT}}(g)\) (or c for short) denotes its code-word.1
To encode the motif table, the Sender encodes each individual motif \(g_i\) in \({MT}\) and also sends over the code-word \(c_i\) that corresponds to \(g_i\). Afterwards, for encoding a graph with the \({MT}\), every motif that finds an occurrence in the graph is simply communicated through its unique code-word only.
The specific materialization of the code-words (i.e., the bitstrings themselves) are not as important to us as their lengths, which affect the total graph encoding length. Each code length \(L(c_i)\) depends on the number of times that \(g_i\) is used in the encoding of the graphs in \(\mathcal {G}\), denoted \(usage_{\mathcal {G}}(g_i)\)—intuitively, the more frequently a motif is used, the shorter its code-word is; so as to achieve compression (analogous to compressing text by assigning frequent words a short code-word). Formally, the optimal prefix code length for \(g_i\) can be calculated through the Shannon entropy [40]:
We provide the details of how the motif usages are calculated in the next section, when we introduce graph encoding.
Next, we present how a motif \(g_i\) is encoded. Let \(n_i\) denote the number of nodes it contains. (e.g., \(g_1\) in Figure 1(b) contains three nodes.) The encoding progresses recursively in a Depth First Search (DFS) -like fashion, as given in Algorithm 1. As noted, the encoding lengths summed over the course of the algorithm provides \(L(g_i)\), which can be explicitly written as follows:
where we first encode the number of unique node labels, followed by the entries (motifs and codes) in the motif table.
4.2 Encoding a Graph Given the Motif Table
To encode a given graph \(G_j\) based on a motif table \({MT}\), we “cover” its edges by a set of motifs in the \({MT}\). To formally define coverage, we first introduce a few definitions.
Given a motif occurrence \(g_{ij}\), we say that \(g_{ij}\)covers the edge set \(E_{ij}\subseteq E_j\) of \(G_j\). The task of encoding a graph \(G_j\) is then to cover all of its edges\(E_j\) using the motifs in \({MT}\).
5 Search Algorithm
Our aim is to compress as a large portion of the input graphs as possible using motifs. This goal can be restated as finding a large set of non-overlapping motif occurrences that cover these graphs. We set up this problem as an instance of the Maximum Independent Set (MIS) problem on, what we call, the occurrence graph \({G_\mathcal {O}}\). In \({G_\mathcal {O}}\), the nodes represent motif occurrences and edges connect two occurrences that share a common edge. MIS ensures that occurrences in the solution are non-overlapping (thanks to independence, no two are incident to the same edge). Moreover, it helps us identify motifs that have large usages, i.e., number of non-overlapping occurrences (thanks to maximality), which is associated with shorter code length and hence better compression.
In this section, first we describe how we set up and solve the MIS problems, which provides us with a set of candidate motifs that can go into the \({MT}\) as well as their (non-overlapping) occurrences in input graphs. We then present a search procedure for selecting a subset of motifs among those candidates to minimize the total encoding length in Equation (1).
5.1 Step 1: Identifying Candidate Motifs & Their Occurrences
As a first attempt, we explicitly construct a \({G_\mathcal {O}}\) per input graph and solve MIS on it. Later, we present an efficient way for solving MIS without constructing any \({G_\mathcal {O}}\)’s, which cuts down memory requirements drastically.
5.1.1 First Attempt: Constructing G𝒪’s Explicitly.
Occurrence Graph Construction. For each \(G_j\in \mathcal {G},\) we construct a \({G_\mathcal {O}}=({V_\mathcal {O}},{E_\mathcal {O}})\) as follows. For \(k=3,\ldots ,10\), we enlist all connected induced k-node subgraphs of \(G_j\), each of which corresponds to an occurrence of some k-node motif in \(G_j\). All those define the node set \({V_\mathcal {O}}\) of \({G_\mathcal {O}}\). If any two enlisted occurrences share at least one common edge in \(G_j\), we connect their corresponding nodes in \({G_\mathcal {O}}\) with an edge.
Notice that we do not explicitly enumerate all possible k-node labeled motifs and then identify their occurrences in \(G_j\), if any, which would be expensive due to subgraph enumeration (especially with many node labels) and numerous graph isomorphism tests. The above procedure yields occurrences of all existing motifs in \(G_j\)implicitly.
Greedy \({MIS}\) Solution. For the \({MIS}\) problem, we employ a greedy algorithm (Algorithm 3) that sequentially selects the node with the minimum degree to include in the solution set \(\mathcal {O}\). It then removes it along with its neighbors from the graph until \({G_\mathcal {O}}\) is empty. Let \(deg_{{G_\mathcal {O}}}(v)\) denote the degree of node v in \({G_\mathcal {O}}\) (the initial one or the one after removing nodes from it along the course of the algorithm), and \(\mathcal {N}(v) = \lbrace u \in {V_\mathcal {O}}| (u,v) \in {E_\mathcal {O}}\rbrace\) be the set of v’s neighbors in \({G_\mathcal {O}}\).
Approximation ratio: The greedy algorithm for MIS provides a \(\Delta\)-approximation [18] where \(\Delta = \max _{v \in {V_\mathcal {O}}} deg_{{G_\mathcal {O}}}(v)\). In our case, \(\Delta\) could be fairly large, e.g., in 1,000s, as many occurrences overlap due to edge multiplicities. Here, we strengthen this approximation ratio to \(\min \lbrace \Delta , \Gamma \rbrace\) as follows, where \(\Gamma\) is around 10 in our data.
Proof. Let \(MIS(G_{\mathcal {O}})\) denote the number of nodes in the optimal solution set for MIS on \(G_{\mathcal {O}}\). We prove that
Let \(G_{\mathcal {O}}(v)\) be the vertex-induced subgraph on \(G_{\mathcal {O}}\) by the vertex set \(\mathcal {N}_v = \lbrace v\rbrace \cup \mathcal {N}(v)\) when node v is selected by the greedy algorithm to include in \(\mathcal {O}\). \(G_{\mathcal {O}}(v)\) has \(\lbrace (u,w) \in E_{\mathcal {O}} | u,w \in \mathcal {N}_v\rbrace\) as the set of edges. In our greedy algorithm, when a node v is selected to the solution set \(\mathcal {O}\), v and all of its current neighbors are removed from \(G_{\mathcal {O}}\). Hence,
\[\begin{eqnarray*} \forall u, v \in \mathcal {O}, \mathcal {N}_u \cap \mathcal {N}_v = \emptyset . \nonumber \nonumber \end{eqnarray*}\]
In other words, all the subgraphs \(G_{\mathcal {O}}(v), v \in \mathcal {O}\) are independent (they do not share any nodes or edges). Moreover, since the greedy algorithm runs until \(V_{\mathcal {O}} = \emptyset\) or all the nodes are removed from \(\mathcal {O}\), we also have
This upper-bound of \(MIS(G_{\mathcal {O}})\) can easily be seen by the fact that the optimal solution set in \(G_{\mathcal {O}}\) can be decomposed into subsets where each subset is an independent set for a subgraph \(G_{\mathcal {O}}(v); hence, it is only a feasible solution of the MIS\) problem on that subgraph with a larger optimal solution.
Next, we will find an upper-bound for \(MIS(G_{\mathcal {O}}(v))\) and then plug it back in Equation (11) to derive the overall bound. There are two different upper-bounds for \(MIS(G_{\mathcal {O}}(v))\) as follows:
Upper-bound 1: Assume that \(E_{\mathcal {O}} \ne \emptyset\), otherwise the greedy algorithm finds the optimal solution which is \(V_{\mathcal {O}}\) and the theorem automatically holds, we establish the first bound
Note that \(\Delta\) is the maximum degree in the initial graph, which is always larger or equal to the maximum degree at any point after removing nodes and edges.
Upper-bound 2: Let us consider the case that the optimal solution for MIS on \(G_{\mathcal {O}}(v)\) is the subset of \(\mathcal {N}(v)\). Note that each node \(v \in V_{\mathcal {O}}\) is associated with an occurrence, which contains a set of edges in \(G_j\), denoted by \(E_j(v) = \lbrace e_{v1}, e_{v2}, \dots \rbrace\), and, u and v are neighbors if \(E_j(u) \cap E_j(v) \ne \emptyset\). Moreover, any pair of nodes \(u, w\) in the optimal independent set solution of \(G_{\mathcal {O}}(v)\) satisfies \(E_j(u) \cap E_j(w) = \emptyset\). Thus, the largest possible independent set is \(\lbrace u_1, \dots , u_{|E_j(v)|}\rbrace\) such that \(E_j(u_l) \cap E_j(v) = \lbrace e_{vl}\rbrace\) (\(u_l\) must be a minimal neighbor - sharing a single edge between their corresponding occurrences) and \(E_j(u_l) \cap E_j(u_k) = \emptyset , \forall l \ne k\) (every node is independent of the others). Therefore, we derive the second upper-bound:
Complexity analysis: Algorithm 3 requires finding the node with minimum degree (line 3) and removing nodes and incident edges (line 5); hence, a naïve implementation would have \(O(|{V_\mathcal {O}}|^2 + |{E_\mathcal {O}}|)\) time complexity; \(O(|{V_\mathcal {O}}|^2)\) for searching nodes with minimum degree at most \(|{V_\mathcal {O}}|\) times (line 3) and \(O({E_\mathcal {O}})\) for updating node degrees (line 5). If we use a priority heap to maintain node degrees for a quick minimum search, time complexity becomes \(O((|{V_\mathcal {O}}| + |{E_\mathcal {O}}|) \log |{V_\mathcal {O}}|)\); \(|{V_\mathcal {O}}| \log |{V_\mathcal {O}}|\) for constructing the heap initially and \(|{E_\mathcal {O}}| \log |{V_\mathcal {O}}|\) for updating the heap every time a node and its neighbors are removed (includes deleting the degrees for the removed nodes and updating those of all \(\mathcal {N}(v)\)’s neighbors).
Algorithm 3 takes \(O(|{V_\mathcal {O}}| + |{E_\mathcal {O}}|)\) to store the occurrence graph.
5.1.2 Memory-Efficient Solution: MIS w/out Explicit G𝒪.
Notice that the size of each \({G_\mathcal {O}}\) can be very large due to the combinatorial number of k-node subgraphs induced, which demands huge memory and time. Here, we present an efficient version of the greedy algorithm that drastically cuts down the input size to MIS. Our new algorithm leverages a property stated as follows:
Let us introduce a new definition called simple occurrence. A simple occurrence \(sg_{ij} = (V_{ij}, E_{ij})\) is a simple subgraph without the edge multiplicities of \(G_j\) that is isomorphic to a motif. Let \(\lbrace sg_{1j}, sg_{2j}, \dots , sg_{tj}\rbrace\) be the set of all simple occurrences in \(G_j\). Note that two simple occurrences may correspond to the same motif.
Recall that the greedy algorithm only requires the node degrees in \({G_\mathcal {O}}\). Since all the nodes corresponding to occurrences that “spring out” of a simple occurrence have the same degree (Property 1), we simply use the simple occurrence as a “compound node” in place of all those degree-equivalent occurrences.2 As such, the nodes in \({G_\mathcal {O}}\) now correspond to simple occurrences only. The degree of each node (say, \(sg_{ij}\)) is calculated as follows:
where \(m(u,v)\) is the multiplicity of edge \((u,v)\) in \(G_j\). The first line of Equation (15) depicts the “internal degree” among the degree-equivalent occurrences that originate from \(sg_{ij}\). The rest captures the “external degree” to other occurrences that have an overlapping edge.
Memory-Efficient Greedy \({MIS}\) Solution. The detailed steps of our memory-efficient greedy \({MIS}\) algorithm is given in Algorithm 4. We first calculate degrees of all simple occurrences (line 2) and then sequentially select the one with the minimum degree, denoted \(sg_{i^*j}\), included in the solution list \(\mathcal {S}\) (lines 5, 6). To account for this selection, we decrease multiplicities of all its edges \((u,v)\in E_{i^{*}j}\) by 1 (line 7) and recalculate the degrees of simple occurrences that overlap with \(sg_{i^{*}j}\) (lines 8-12). If one of those simple occurrences that has an intersecting edge set with that of \(sg_{i^{*}j}\) contains at least one edge with multiplicity equal to 0 (due to decreasing edge multiplicities in line 7), a special value of \(deg_{\max }+1\) (line 3) is assigned as its degree. This is to signify that this compound node contains no more occurrences, and it is not to be considered in subsequent iterations (line 4).
Notice that we need not even construct an occurrence graph in Algorithm 4, which directly operates on the set of simple occurrences in \(G_j\), computing and updating degrees based on Equation (15). Note that the same simple occurrence could be picked more than once by the algorithm. The number of times a simple occurrence appears in \(\mathcal {S}\) is exactly the number of non-overlapping occurrences that spring out of it and get selected by Algorithm 3 on \({G_\mathcal {O}}\). As such \(\mathcal {O}\) and \(\mathcal {S}\) have the same cardinality and each motif has the same number of occurrences and simple occurrences in \(\mathcal {O}\) and \(\mathcal {S}\), respectively. As we need the number of times each motif is used in the cover set of a graph (i.e., its usage), both solutions are equivalent.
Complexity analysis: Calculating \(deg_{{G_\mathcal {O}}}(sg_{ij})\) of a simple occurrence takes \(O(t \cdot \Gamma)\), as Equation (15) requires all t intersecting simple occurrences \(\lbrace sg_{lj} \;\vert \; E_{lj} \cap E_{ij} \ne \emptyset \rbrace\) where intersection can be done in \(O(\Gamma)\), the maximum number of edges in a motif. Thus, Algorithm 4 requires \(O(t^2 \cdot \Gamma)\) for line 2 to calculate all degrees. Within the while loop (lines 4–12), the total number of degree recalculations (line 10) is bounded by \(O(t \cdot \Gamma \cdot \max _{(u,v) \in E_j} m((u,v)))\) since each simple occurrence gets recalculated for at most \(\Gamma \cdot \max _{(u,v) \in E_j} m((u,v))\) times. Finding the intersecting simple occurrences (line 8) is \((t \cdot \Gamma)\). Overall, Algorithm 4 time complexity is \(O(t^2 \cdot \Gamma ^2 \cdot \max _{(u,v) \in E_j} m(u,v))\).
The space complexity of Algorithm 4 is \(O(t \cdot \Gamma + |E_j|)\), for t simple occurrences with at most \(\Gamma\) edges each, plus input edge multiplicities. Note that this is significantly smaller than the space complexity of Algorithm 3, i.e., \(O(|{V_\mathcal {O}}| + |{E_\mathcal {O}}|)\), since \(t \ll |{V_\mathcal {O}}|\) and \(|E_j| \ll |{E_\mathcal {O}}|\). Our empirical measurements on our SH dataset (see Table 2 in Section 6 for details) in Figure 2 show that \(|{V_\mathcal {O}}|\) is up to 9 orders of magnitude larger than t.
Fig. 2.
Table 2.
Name
#Graphs
#Node labels
#Nodes
#Multi-edges
SH
38,122
11
[2, 25]
[1, 782]
SH_Synthetic
38,122
11
[2, 25]
[1, 782]
HW
90,274
11
[2, 25]
[1, 897]
KD
152,105
10
[2, 91]
[1, 1,774]
Enron
1,139
16
[2, 87]
[1, 1,356]
Table 2. Summary Statistics of Graph Datasets Used in Experiments
5.1.3 Weighted Maximum Independent Set (WMIS).
A motivation leading to our \({MIS}\) formulation is to cover as much of each input graph as possible using motifs. The amount of coverage by an occurrence of a motif can be translated into the number of edges it contains. This suggests a weighted version of the maximum independent set problem, denoted \({WMIS}\), where we set the weight of an occurrence, i.e., node in \({G_\mathcal {O}}\), to be the number of edges in the motif it corresponds to. Hence, the goal is to maximize the total weight of the non-overlapping occurrences in the solution set.
Greedy \({WMIS}\) Solution. For the weighted version of \({MIS}\), we also have a greedy algorithm with the same approximation ratio as the unweighted one. The only difference from Algorithm 3 is the selection of node \(v_{\min }\) to remove (line 3). Let \(w_v\) denote the weight of node v in \({G_\mathcal {O}}\). Then,
Intuitively, Equation (16) prefers selecting large-weight nodes (to maximize total weight), but those that do not have too many large-weight neighbors in the \({G_\mathcal {O}}\), which cannot be selected.
Memory-efficient Greedy \({WMIS}\) Solution. Similar to the unweighted case, we can derive a memory-efficient greedy algorithm for \({WMIS}\) since Property 1 holds for both degree \(deg_{{G_\mathcal {O}}}(v)\) and \(w_v\)—the two core components in selecting nodes (Equation (16)) in greedy algorithm for \({WMIS}\)—since all occurrences of the same simple occurrence have the same number of edges.
5.2 Step 2: Building the Motif Table
The \((W)MIS\) solutions, \(\mathcal {S}_j\)’s, provide us with non-overlapping occurrences of k-node motifs (\(k\ge 3\)) in each \(G_j\). The next task is to identify the subset of those motifs to include in our motif table \({MT}\) so as to minimize the total encoding length in Equation (1). We first define the set \(\mathcal {C}\) of candidate motifs:
We start with encoding the graphs in \(\mathcal {G}\) using the simplest code table that contains only the 2-node motifs. This code table, with optimal code lengths for database \(\mathcal {G}\), is called the Standard Code Table, denoted by \({SMT}\). It provides the optimal encoding of \(\mathcal {G}\) when nothing more is known than the frequencies of labeled edges (equal to usages of the corresponding 2-node motifs), which are assumed to be fully independent. As such, \({SMT}\) does not yield a good compression of the data but provides a practical bound.
To find a better code table, we use the best-first greedy strategy: Starting with \({MT}:={SMT}\), we try adding each of the candidate motifs in \(\mathcal {C}\) one at a time. Then, we pick the “best” one that leads to the largest reduction in the total encoding length. We repeat this process with the remaining candidates until no addition leads to a better compression or all candidates are included in the \({MT}\), in which case the algorithm terminates.
The details are given in Algorithm 5. We first calculate the usage of 2-node motifs per \(G_j\) (lines 1–4), and set up the \({SMT}\) accordingly (line 5). For each candidate motif \(g\in \mathcal {C}\) (line 7) and each \(G_j\) (line 8), we can identify the occurrences of g in \(G_j\)’s cover set, \(\mathcal {O}(g,\mathcal {CS}_j)\), which is equivalent to the simple occurrences selected by Algorithm 4 that are isomorphic to g (line 9). When we insert a g into \({MT}\), usage of some 2-node motifs, specifically those that correspond to the labeled edges of g, decreases by \(|\mathcal {O}(g,\mathcal {CS}_j)|\); the usage of g in \(G_j\)’s encoding (lines 10–12). Note that the usage of \((k\ge 3)\)-node motifs already in the \({MT}\) do not get affected, since their occurrences in each \(\mathcal {S}_j\), which we use to cover \(G_j\), do not overlap; i.e., their uses in covering a graph are independent. As such, updating usages when we insert a new motif to \({MT}\) is quite efficient. Having inserted g and updated usages, we remove 2-node motifs that reduce to zero usage from \({MT}\) (line 13), and compute the total encoding length with the resulting \({MT}\) (lines 14–15). The rest of the algorithm (lines 16–20) picks the “best” g to insert that leads to the largest savings, if any, or otherwise quits.
6 Experiments
Datasets. Our work is initiated by a collaboration with industry, and CODEtect is evaluated on large real-world datasets containing all transactions of 2016 (tens to hundreds of thousands transaction graphs) from three different companies, anonymously SH, HW, and KD (proprietary) summarized in Table 2. These do not come with ground truth anomalies. For quantitative evaluation, our expert collaborators inject two types of anomalies into each dataset based on domain knowledge (Section 6.1), and also qualitatively verify the detected anomalies from an accounting perspective (Section 6.2).
Since our transaction data are not shareable, to facilitate reproducibility of the results, we generate synthetic data that resemble the real-world datasets used. In particular, we add SH_Synthetic that contains random graphs generated based on statistical characteristics of SH as follows: (1) A graph in SH_Synthetic has the number of nodes randomly selected following its distribution in SH depicted in Figure 3(a); (2) The number of single edges in a graph is drawn randomly following the distribution in SH illustrated in Figure 3(b) (If the graph has more edges than the maximum number - \(n\times (n-1)/2\), we restart the process); (3) For each node, we randomly assign its label following the label distribution in SH as shown in Figure 3(c); (4) For each single edge, we then randomly generate its multiplicity based on the distribution of the corresponding node label pair (an example is given in Figure 3(d)). At the end, we have the same number of graphs with similar characteristics as SH and share this data along with our code.
Fig. 3.
We also study the public Enron database [20], consisting of daily e-mail graphs of its 151 employees over 3 years surrounding the financial scandal. Nodes depict employee e-mail addresses and edges indicate email exchanges. Each node is labeled with the employee’s department (Energy Operations, Regulatory and Government Affairs, etc.) and edge multiplicity denotes the number of e-mails exchanged.
6.1 Anomaly Detection Performance
We show that CODEtect is substantially better in detecting graph anomalies as compared to a list of baselines across various performance measures. The anomalies are injected by domain experts, which mimic schemes related to money laundering, entry error or malfeasance in accounting, specifically:
–
Path injection (money-laundering-like): (i) Delete a random edge \((u,v) \in E_j\), and (ii) Add a length- 2 or 3 path u–w(–z)–v where at least one edge of the path is rare (i.e., exists in only a few \(G_j\)’s). The scheme mimics money-laundering, where money is transferred through multiple hops rather than directly from the source to the target.
–
Label injection (entry-error or malfeasance): (i) Pick a random node \(u \in V_j\), and (ii) Replace its label \(t(u)\) with a random label \(t \ne t(u)\). This scheme mimics either simply an entry error (wrong account), or malfeasance that aims to reflect larger balance on certain types of account (e.g., revenue) in order to deceive stakeholders.
For path injection, we choose 3% of graphs and inject anomalous paths that replace 10% of edges (or 1 edge if 10% of edges is less than 1). For label injection, we also choose 3% of graphs and label-perturb 10% of the nodes (or 1 node if 10% of nodes is less than 1). We also tested with different severity levels of injection, i.e., 30% and 50% of edges or nodes, and observed similar results to those with 10%. The goal is to detect those graphs with injected paths or labels.
Baselines: We compare CODEtect with:
–
SMT: A simplified version of CODEtect that uses the Standard Motif Table to encode the graphs.
–
GLocalKD [26]: A recent graph neural network (GNN)-based approach that leverages knowledge distillation to train and was shown to achieve good performance and robustness in semi-supervised and unsupervised settings. We used the default setting of 3 layers, 512 hidden dimensions, and 256 output dimensions as recommended in the original article [26]. Independently, we also performed sensitivity analysis on varying the hyper-parameter settings and found that the default one returned top performance.
–
Gbad [12]: The closest existing approach for anomaly detection in node-labeled graph databases (See Section 2). Since it cannot handle multi-edges, we input the \(G_j\)’s as simple graphs setting all the edge multiplicities to 1.
–
Graph Embedding + iForest: We pair different graph representation learning approaches with state-of-the-art outlier detector iForest [25], as they cannot directly detect anomalies. We consider the following combinations:
–
Graph2Vec [31] (G2V)+iForest: G2V cannot handle edge multiplicities; thus, we set all to 1.
GF+iForest: Graph (numerical) features (GF) include number of edges of each label-pair and number of nodes of each label.
–
Entropy quantifies skewness of the distribution on the non-zero number of edges over all possible label pairs as the anomaly score. A smaller entropy implies there exist rare label-pairs and hence higher anomalousness.
–
Multiedges uses sum of edge multiplicities as anomaly score. We tried other simple statistics, e.g., #nodes, #edges, their sum/product, which do not perform well.
Performance measures: Based on the ranking of graphs by anomaly score, we measure Precision at top-k for \(k=\lbrace 10,100,1000\rbrace\), and also report Area Under ROC Curve (AUC) and Average Precision (AP) that is the area under the precision-recall curve. Since most of the methods, including CODEtect, are deterministic, we run most of the methods once and report all the measures. For Graph2Vec and Deep Graph Kernel with some randomization, we also run multiple times to check the consistency of performance. Additionally, for those with hyper-parameters, we use the default settings from the corresponding publicly available source codes.
6.1.1 Detection of Path Anomalies.
We report detection results on SH and HW datasets in Table 3(a) and (b) (performance on KD is similar and omitted for brevity) and on SH_Synthetic in Table 3(c).
Table 3.
CODEtect consistently outperforms all baselines by a large margin across all performance measures in detecting path anomalies. More specifically, CODEtect provides 16.9% improvement over the runner-up (underlined) on average across all measures on SH, and 10.2% on HW. Note that the runner-up is not the same baseline consistently across different performance measures. Benefits of motif search is evident looking at the superior results over SMT. G2V+iForest produces decent performance w.r.t. most measures but is still much lower than those of CODEtect. Similar observations are present on SH_Synthetic .
6.1.2 Detection of Label Anomalies.
Table 4(a) and (b) reports detection results on the two larger datasets, HW and KD (performance on SH is similar and omitted for brevity), and Table 4(c) provides results on SH_Synthetic . Note that Gbad and G2V+iForest failed to complete within five days on KD ; thus, their results are absent in Table 4(b). Note that KD is a relatively large-scale dataset, having both larger and more graphs than the other datasets.
Table 4.
In general, we observe similar performance advantage of CODEtect over the baselines for label anomalies. The exceptions are Gbad and GLocalKD, which perform comparably, and appear to suit better for label anomalies, potentially because changing node labels disrupts structure more than the addition of a few short isolate paths. Gbad, however, does not scale to KD, and the runner-up on this dataset performs significantly worse. Similar observations are also seen on SH_Synthetic dataset.
6.2 Case Studies
Case 1—Anomalous transaction records: The original accounting databases we are provided with by our industry partner do not contain any ground truth labels. Nevertheless, they beg the question of whether CODEtect unearths any dubious journal entries that would raise an eyebrow from an economic bookkeeping perspective. In collaboration with their accounting experts, we analyze the top 20 cases as ranked by CODEtect. Due to space limit, we elaborate on one case study from each dataset/corporation as follows:
In SH, we detect a graph with a large encoding length yet relatively few (27) multi-edges, as shown in Figure 5, consisting of several small disconnected components. In accounting terms, the transaction is extremely complicated, likely the result of a (quite rare) “business restructuring” event. In this single journal entry, there exist many independent simple entries, involving only one or two operating-expense (OE) accounts, while other edges arise from compound entries (involving more than three accounts). This event involves reversals (back to prepaid expenses) as well as re-classification of previously booked expenses. The fact that all these bookings are recorded within a single entry leaves room for manipulation of economic performance and mis-reporting via re-classification, which deserves an audit for careful re-examination.
In Figure 4 (left), we show a motif with sole usage of 1 in the dataset, which is used to cover an anomalous graph (right) in HW. Edge NGL (non-operating gains&losses) to C (cash) depicts an unrealized foreign exchange gain and is quite unusual from an economic bookkeeping perspective. This is because, by definition, unrealized gains and losses do not involve cash. Therefore, proper booking of the creation or relinquishment of such gains or losses should not involve cash accounts. Another peculiarity is the three separate disconnected components, each of which represents very distinct economic transactions: one on a bank charge related to a security deposit, one on health-care and travel-related foreign-currency business expense (these two are short-term activities), and a third one on some on-going construction (long-term in nature). It is questionable why these diverse transactions are grouped into a single journal. Finally, the on-going construction portion involves reclassifying a long-term asset into a suspense account, which requires follow-up attention and final resolution.
Fig. 4.
Fig. 5.
Finally, the anomalous journal entry from KD involves the motif shown in Figure 6 (left) where the corresponding graph is the exact motif with multiplicity 1 shown on the (right). This motif has sole usage of 1 in the dataset and is odd from an accounting perspective. Economically, it represents giving up an existing machine, which is a long-term operating asset (LOA), in order to reduce a payable or an outstanding short-term operating liability (SOL) owed to a vendor. Typically one would sell the machine and get cash to payoff the vendor with some gains or losses. We also note that the \({MT}\) does not contain the 2-node motif LOA\(\rightarrow\)SOL. The fact that it only shows up once, within single-usage motif, makes it suspicious.
Fig. 6.
Besides the quantitative evidence on detection performance, these number of case studies provide qualitative support to the effectiveness of CODEtect in identifying anomalies of interest in accounting domain terms, worthy of auditing and re-examination.
Case 2 - Enron scandal: We study the correlation between CODEtect’s anomaly scores of the daily e-mail exchange graphs and the major events in Enron’s timeline. Figure 7 shows that days with large anomaly scores mark drastic discontinuities in time, which coincide with important events related to the financial scandal.3 It is also noteworthy that the anomaly scores follow an increasing trend over days, capturing the escalation of events up to key personnel testifying in front of Congressional committees.
Fig. 7.
6.3 Scalability
To showcase the scalability of CODEtect, in regard to running time and memory consumption, we randomly selected subsets of graphs in KD database with different sizes, i.e., \(\lbrace 40, 45, 50, 55, 60, 65, 70, 75, 80, 85, 90, 95, 100\rbrace \,\times \, 10^3\). We re-sample each subset of graphs three times and report the averaged result for each setting. A summary of the results is presented in Figure 8. We observe a linear scaling of CODEtect with increasing size of input graphs as measured in number of multi-edges with respect to both time and memory usage.
Fig. 8.
7 Conclusion
We introduced CODEtect, (to our knowledge) the first graph-level anomaly detection method for node-labeled multi-graph databases; which appear in numerous real-world settings such as social networks and financial transactions, to name a few. The main idea is to identify key network motifs that encode the database concisely and employ compression length as the anomaly score. To this end, we presented (1) novel lossless encoding schemes and (2) efficient search algorithms. Experiments on transaction databases from three different corporations quantitatively showed that CODEtect significantly outperforms the prior and more recent GNN-based baselines across datasets and performance metrics. Case studies, including the Enron database, presented qualitative evidence to CODEtect’s effectiveness in spotting instances that are noteworthy of auditing and re-examination.
Footnotes
1
To ensure unique decoding, we assume prefix code(word)s, in which no code is the prefix of another.
1
MDL-optimal cost of integer k is \(L_{\mathbb {N}}(k) = \log ^{\star }k + \log _2 c\); \(c \approx 2.865\); \(\log ^{\star }k = \log _2 k + \log _2(\log _2 k) + \ldots\) summing only the positive terms [40].
2
Given \(sg_{ij}\), the number of its degree-equivalent occurrences in \(G_j\) is the product of edge multiplicities \(m(u,v)\) for \((u,v)\in E_{ij}\).
Leman Akoglu, Mary McGlohon, and Christos Faloutsos. 2010. Oddball: Spotting anomalies in weighted graphs. In Proceedings of the Pacific-Asia Conference on Knowledge Discovery and Data Mining. Springer, 410–421.
Leman Akoglu, Hanghang Tong, and Danai Koutra. 2015. Graph based anomaly detection and description: A survey. Data Mining and Knowledge Discovery 29, 3 (2015), 626–688.
Leman Akoglu, Hanghang Tong, Jilles Vreeken, and Christos Faloutsos. 2012. Fast and reliable anomaly detection in categorical data. In Proceedings of the 21st ACM International Conference on Information and Knowledge Management. 415–424.
Mohammad Al Hasan and Vachik S. Dave. 2018. Triangle counting in large networks: A review. Wiley Interdisciplinary Reviews: Data Mining and Knowledge Discovery 8, 2 (2018), e1226.
Sambaran Bandyopadhyay, Saley Vishal Vivek, and M. N. Murty. 2020. Outlier resistant unsupervised deep architectures for attributed network embedding. In Proceedings of the 13th International Conference on Web Search and Data Mining. 25–33.
Peter Bloem and Steven de Rooij. 2020. Large-scale network motif analysis using compression. Data Mining and Knowledge Discovery 34, 5 (2020), 1421–1453.
Kaize Ding, Jundong Li, Rohit Bhanushali, and Huan Liu. 2019. Deep anomaly detection on attributed networks. In Proceedings of the 2019 SIAM International Conference on Data Mining. SIAM, 594–602.
Ethan R. Elenberg, Karthikeyan Shanmugam, Michael Borokhovich, and Alexandros G. Dimakis. 2015. Beyond triangles: A distributed framework for estimating 3-profiles of large graphs. In Proceedings of the 21th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, Longbing Cao, Chengqi Zhang, Thorsten Joachims, Geoffrey I. Webb, Dragos D. Margineantu, and Graham Williams (Eds.). ACM, 229–238.
Dhivya Eswaran, Christos Faloutsos, Sudipto Guha, and Nina Mishra. 2018. SpotLight: Detecting anomalies in streaming graphs. In Proceedings of the 24th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining. ACM, 1378–1386.
Yujie Fan, Shifu Hou, Yiming Zhang, Yanfang Ye, and Melih Abdulhayoglu. 2018. Gotcha-sly malware! scorpion a metagraph2vec based malware detection system. In Proceedings of the 24th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining. 253–262.
Aditya Grover and Jure Leskovec. 2016. Node2vec: Scalable feature learning for networks. In Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. 855–864.
Magnús M. Halldórsson and Jaikumar Radhakrishnan. 1997. Greed is good: Approximating independent sets in sparse and bounded-degree graphs. Algorithmica 18, 1 (1997), 145–163.
Will Hamilton, Zhitao Ying, and Jure Leskovec. 2017. Inductive representation learning on large graphs. In Proceedings of the 31st International Conference on Neural Information Processing Systems. 1024–1034.
Bryan Hooi, Kijung Shin, Hyun Ah Song, Alex Beutel, Neil Shah, and Christos Faloutsos. 2017. Graph-based fraud detection in the face of camouflage. TKDD 11, 4 (2017), 1–26.
Danai Koutra, U. Kang, Jilles Vreeken, and Christos Faloutsos. 2014. VOG: Summarizing and understanding large graphs. In Proceedings of the 31st International Conference on Neural Information Processing Systems. SIAM, 91–99.
Lauri Kovanen, Márton Karsai, Kimmo Kaski, János Kertész, and Jari Saramäki. 2011. Temporal motifs in time-dependent networks. Journal of Statistical Mechanics: Theory and Experiment 2011, 11 (2011), P11005.
Rongrong Ma, Guansong Pang, Ling Chen, and Anton van den Hengel. 2022. Deep graph-level anomaly detection by glocal knowledge distillation. In Proceedings of the 15th ACM International Conference on Web Search and Data Mining. 704–714.
Xiaoxiao Ma, Jia Wu, Shan Xue, Jian Yang, Chuan Zhou, Quan Z. Sheng, Hui Xiong, and Leman Akoglu. 2021. A comprehensive survey on graph anomaly detection with deep learning. IEEE Transactions on Knowledge and Data Engineering.
Emaad A. Manzoor, Sadegh M. Milajerdi, and Leman Akoglu. 2016. Fast memory-efficient anomaly detection in streaming heterogeneous graphs. In Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. ACM, 1035–1044.
Ron Milo, Shalev Itzkovitz, Nadav Kashtan, Reuven Levitt, Shai Shen-Orr, Inbal Ayzenshtat, Michal Sheffer, and Uri Alon. 2004. Superfamilies of evolved and designed networks. Science 303, 5663 (2004), 1538–1542.
Ron Milo, Shai Shen-Orr, Shalev Itzkovitz, Nadav Kashtan, Dmitri Chklovskii, and Uri Alon. 2002. Network motifs: Simple building blocks of complex networks. Science 298, 5594 (2002), 824–827.
Saket Navlakha, Rajeev Rastogi, and Nisheeth Shrivastava. 2008. Graph summarization with bounded error. In Proceedings of the 2008 ACM SIGMOD International Conference on Management of Data. ACM, 419–432.
Mathias Niepert, Mohamed Ahmed, and Konstantin Kutzkov. 2016. Learning convolutional neural networks for graphs. In International Conference on Machine Learning. 2014–2023.
Caleb C. Noble and Diane J. Cook. 2003. Graph-based anomaly detection. In Proceedings of the 9th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. ACM, 631–636.
Ashwin Paranjape, Austin R. Benson, and Jure Leskovec. 2017. Motifs in temporal networks. In Proceedings of the 10th ACM International Conference on Web Search and data Mining. ACM, 601–610.
Ramesh Paudel and William Eberle. 2020. SNAPSKETCH: Graph representation approach for intrusion detection in a streaming graph. In MLG 2020: 16th International Workshop on Mining and Learning with Graphs. ACM.
Bryan Perozzi and Leman Akoglu. 2016. Scalable anomaly ranking of attributed neighborhoods. In Proceedings of the 2016 SIAM International Conference on Data Mining. SIAM, 207–215.
Bryan Perozzi, Leman Akoglu, Patricia Iglesias Sánchez, and Emmanuel Müller. 2014. Focused clustering and outlier detection in large attributed graphs. In Proceedings of the 20th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. 1346–1355.
Seyed-Vahid Sanei-Mehri, Ahmet Erdem Sariyüce, and Srikanta Tirthapura. 2018. Butterfly counting in bipartite networks. In Proceedings of the 24th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining, Yike Guo and Faisal Farooq (Eds.). ACM, 2150–2159.
Comandur Seshadhri and Srikanta Tirthapura. 2019. Scalable Subgraph Counting: The Methods Behind The Madness. In Companion Proceedings of The 2019 World Wide Web Conference (WWW ’19). Association for Computing Machinery, New York, NY, USA, 1317–1318.
Neil Shah, Danai Koutra, Tianmin Zou, Brian Gallagher, and Christos Faloutsos. 2015. TimeCrunch: Interpretable dynamic graph summarization. In Proceedings of the 21th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. ACM, 1055–1064.
Nino Shervashidze, S. V. N. Vishwanathan, Tobias Petri, Kurt Mehlhorn, and Karsten M. Borgwardt. 2009. Efficient graphlet kernels for large graph comparison. In Proceedings of the Artificial Intelligence and Statistics. 488–495.
Kijung Shin, Tina Eliassi-Rad, and Christos Faloutsos. 2016. Corescope: Graph mining using k-core analysis–patterns, anomalies and algorithms. In 2016 IEEE 16th International Conference on Data Mining. IEEE, 469–478.
Koen Smets and Jilles Vreeken. 2011. The odd one out: Identifying and characterising anomalies. In Proceedings of the 2011 SIAM International Conference on Data Mining. SIAM/Omnipress, 804–815.
Nikolaj Tatti and Jilles Vreeken. 2008. Finding good itemsets by packing data. In Proceedings of the 2008 8th IEEE International Conference on Data Mining. IEEE Computer Society, 588–597.
Yuanyuan Tian, Richard A. Hankins, and Jignesh M. Patel. 2008. Efficient aggregation for graph summarization. In Proceedings of the 2008 ACM SIGMOD International Conference on Management of Data. ACM, 567–580.
Johan Ugander, Lars Backstrom, and Jon M. Kleinberg. 2013. Subgraph frequencies: Mapping the empirical and extremal geography of large graph collections. In Proceedings of the 22nd international conference on World Wide Web, Daniel Schwabe, Virgílio A. F. Almeida, Hartmut Glaser, Ricardo A. Baeza-Yates, and Sue B. Moon (Eds.). ACM, 1307–1318.
A. Vázquez, R. Dobrin, D. Sergi, J.-P. Eckmann, Z. N. Oltvai, and A.-L. Barabási. 2004. The topological relationship between the large-scale attributes and local interaction patterns of complex networks. PNAS 101, 52 (2004), 17940–17945.
Jilles Vreeken, Matthijs van Leeuwen, and Arno Siebes. 2011. Krimp: Mining itemsets that compress.Data Mining and Knowledge Discovery 23, 1 (2011), 169–214.
Keyulu Xu, Weihua Hu, Jure Leskovec, and Stefanie Jegelka. 2019. How powerful are graph neural networks? In Proceedings of the International Conference on Learning Representations.
Pinar Yanardag and S. V. N. Vishwanathan. 2015. Deep graph kernels. In Proceedings of the 21th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. 1365–1374.
Lingxiao Zhao and Leman Akoglu. 2021. On using classification datasets to evaluate graph outlier detection: Peculiar observations and new insights. Big Data (2021).
Tong Zhao, Chuchen Deng, Kaifeng Yu, Tianwen Jiang, Daheng Wang, and Meng Jiang. 2020. Error-bounded graph anomaly loss for GNNs. In Proceedings of the 29th ACM International Conference on Information & Knowledge Management. 1873–1882.
Sotiropoulos KZhao LLiang PAkoglu L(2023)ADAMM: Anomaly Detection of Attributed Multi-graphs with Metadata: A Unified Neural Network Approach2023 IEEE International Conference on Big Data (BigData)10.1109/BigData59044.2023.10386245(865-874)Online publication date: 15-Dec-2023
SSDBM '23: Proceedings of the 35th International Conference on Scientific and Statistical Database Management
Graph structure patterns are widely used to model different area data recently. How to detect anomalous graph information on these graph data has become a popular research problem. The objective of this research is centered on the particular issue that ...
We propose a fast methodology for encoding graphs with information-theoretically minimum numbers of bits. Specifically, a graph with property $\pi$ is called a {\em $\pi$-graph}. If $\pi$ satisfies certain properties, then an n-node m-edge $\pi$-graph ...
ALENEX '10: Proceedings of the Meeting on Algorithm Engineering & Expermiments
A k-degenerate graph is a graph in which every induced subgraph has a vertex with degree at most k. The class of k-degenerate graphs is interesting from a theoretical point of view and it plays an interesting role in the theory of fixed parameter ...
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].
Sotiropoulos KZhao LLiang PAkoglu L(2023)ADAMM: Anomaly Detection of Attributed Multi-graphs with Metadata: A Unified Neural Network Approach2023 IEEE International Conference on Big Data (BigData)10.1109/BigData59044.2023.10386245(865-874)Online publication date: 15-Dec-2023