research-article

Open access

Detecting Anomalous Graphs in Labeled Multi-Graph Databases

Authors:

Hung T. Nguyen,

Pierre J. Liang,

Leman AkogluAuthors Info & Claims

ACM Transactions on Knowledge Discovery from Data, Volume 17, Issue 2

Article No.: 16, Pages 1 - 25

https://doi.org/10.1145/3533770

Published: 20 February 2023 Publication History

All formats PDF

Abstract

Within a large database 𝒢 containing graphs with labeled nodes and directed, multi-edges; how can we detect the anomalous graphs? Most existing work are designed for plain (unlabeled) and/or simple (unweighted) graphs. We introduce CODEtect, the first approach that addresses the anomaly detection task for graph databases with such complex nature. To this end, it identifies a small representative set 𝒮 of structural patterns (i.e., node-labeled network motifs) that losslessly compress database 𝒢 as concisely as possible. Graphs that do not compress well are flagged as anomalous. CODEtect exhibits two novel building blocks: (i) a motif-based lossless graph encoding scheme, and (ii) fast memory-efficient search algorithms for 𝒮. We show the effectiveness of CODEtect on transaction graph databases from three different corporations and statistically similar synthetic datasets, where existing baselines adjusted for the task fall behind significantly, across different types of anomalies and performance metrics.

1 Introduction

Given hundreds of thousands of annual transaction records of a corporation, how can we identify the abnormal ones, which may indicate entry errors or employee misconduct? How can we spot anomalous daily e-mail/call interactions or software programs with bugs?

We introduce a novel anomaly detection technique called CODEtect, for node-Labeled, Directed, Multi-graph(LDM) databases, which emerge from many applications. Our motivating domain is accounting, where each transaction record is represented by a graph in which the nodes are accounts and directed edges reflect transactions. Account types (revenue, expense, etc.) are depicted by (discrete) labels and separate transactions between the same pair of accounts create edge multiplicities. The problem is then identifying the anomalous graphs within LDM graph databases. In these abstract terms, CODEtect applies more broadly to other domains exhibiting graph data with such complex nature, e.g., detecting anomalous employee e-mail graphs with job titles as labels, call graphs with geo-tags as labels, control flow graphs with function-calls as labels, and so on. What is more, CODEtect can handle simpler settings with any subset of the LDM properties.

Graph anomaly detection has been studied under various non-LDM settings. Most of these work focus on detecting anomalies within a single graph; either plain (i.e., unlabeled), attributed (nodes exhibiting an array of (often continuous) features), or dynamic (as the graph changes over time) [1, 2, 14, 20, 28, 38, 39, 46] (See Table 1 for overview). None of these applies to our setting, as we are to detect graph-level anomalies within a graph database. There exist related work for node-labeled graph databases [34], which, however, does not handle multi-edges, and as we show in the experiments (Section 6), cannot tackle the problem well.

Table 1.

Methods vs. Properties		Graph database	Node-labeled	Multi/ Weighted	Directed	Anomaly detection
Graph Emb.	node2vec [16], GraphSAGE [19]		✔	✔	✔
	graph2vec [31], Metagraph2vec [15], GIN [53]	✔	✔		✔
	PATCHY-SAN [33], MA-GCNN [37], Deep Graph Kernels [54]	✔	✔	✔	✔
Graph Anom. Detect.	OddBall [1]			✔	✔	✔
	FocusCO [39], AMEN [38], Dominant [11]		✔		✔	✔
	GAL [57]			✔	✔	✔
	CoreScope [46]					✔
	FRAUDAR [20]				✔	✔
	StreamSpot [28], GBAD [12]	✔	✔		✔	✔
	SpotLight [14]	✔		✔	✔	✔
	SnapSketch [36]	✔		✔		✔
	GLocalKD [26]	✔	✔			✔
	CODEtect [this article]	✔	✔	✔	✔	✔

Table 1. Comparison with Popular Approaches to Graph Anomaly Detection, in Terms of Distinguishing Properties

Recently, general-purpose embedding/representation learning techniques achieve state-of-the-art results in graph classification tasks [15, 16, 19, 31, 33, 37]. However, they do not tackle the anomaly detection problem per se—the embeddings need to be fed into an off-the-shelf vector outlier detector. Moreover, most embedding methods [15, 16, 19] produce node embeddings; how to use those for graph-level anomalies is unclear. Trivially aggregating node representations, e.g., by mean or max pooling, to obtain the entire-graph representation provides suboptimal results [31]. Graph embedding techniques [31, 37] as well as graph kernels [45, 54] (paired with a state-of-the-art detector), yield poor performance as we show through experiments (Section 6), possibly because embeddings capture general patterns, leaving rare structures out, which are critical for anomaly detection.

Our main contributions are summarized in the following:

–

Problem Formulation: Motivated by application to business accounting, we consider the anomaly detection problem in LDM databases and propose CODEtect; (to our knowledge) the first method to detect anomalous graphs with such complex nature (Section 2). CODEtect also generally applies to simpler, non-LDM settings. The main idea is to identify a few representative network motifs that are used to encode the database in a lossless fashion as succinctly as possible. CODEtect then flags those graphs that do not compress well under this encoding as anomalous (Section 3).

–

New Encoding & Search Algorithms: The graph encoding problem is two-fold: how to encode and which motifs to encode with. To this end, we introduce (1) new lossless motif and graph encoding schemes (Section 4), and (2) efficient search algorithms for identifying key motifs with a goal to minimize the total encoding cost (Section 5).

–

Real-world Application: In collaboration with industry, we apply our proposed techniques to annual transaction records from three different corporations, from small- to large-scale. We show the superior performance of CODEtect over existing baselines in detecting injected anomalies that mimic certain known malicious schemes in accounting. Case studies on those as well as the public Enron e-mail database further show the effectiveness of CODEtect in spotting noteworthy instances (Section 6). To facilitate reproducibility, we also confirm our performance advantages on statistically similar datasets resembling our real-world databases.

Reproducibility. All source code as well as public-domain and synthetic data are shared at https://bit.ly/2P0bPZQ.

2 Related Work

Graph Anomaly Detection: Graph anomaly detection has been studied under various settings for plain/attributed, static/dynamic, etc. graphs, including the most recent deep learning based approaches [1, 11, 14, 20, 38, 39, 46, 57] (See [2] and [27] for a survey.) These works focus on detecting node/edge/subgraph anomalies within a single graph, none of which applies to our setting, as we are to detect anomalous graphs (or graph-level anomalies) within a graph database.

On anomalous graph detection in graph databases, Gbad [12] has been applied to flag graphs as anomalous if they experience low compression via discovered substructures over the iterations. Further, it has been used to identify graphs that contain substructures $S^{\prime }$ with small differences (a few modifications, insertions, or deletions) from the best one S, which are attributed to malicious behavior [12]. Gbad also has very high time complexity due to the nested searches for substructures to compress the graphs through many iterations (failed to complete on multiple cases of our experiments - see Section 6). Our work is on the same lines with these work in principle, however our encoding scheme is lossless. Moreover, these work cannot handle graphs with weighted/multi edges. There exist other graph anomaly detection approaches [14, 28], however none of them simultaneously handle node-labeled graphs with multi-edges. SnapSketch [36] was introduced recently as an unsupervised graph representation approach for intrusion detection in a graph stream and showed better detection than previous works [14, 28], however, SnapSketch was originally designed to work on undirected graphs. Note also that these works [14, 28, 36] focus on graph streams, i.e., time-ordered graphs, and may not work well in our setting of unordered graph databases. We present a qualitative comparison of related work to CODEtect in Table 1.

Graph Embedding for Anomaly Detection: Recent graph embedding methods [11, 15, 16, 19, 31, 33, 37, 57] and graph kernels [45, 54] find a latent representation of node, subgraph, or the entire graph and have been shown to perform well on classification and link prediction tasks. However, graph embedding approaches, like [15, 16, 19], learn a node representation, which is difficult to use directly for detecting anomalous graphs. Peng et al. [37] propose a graph convolutional neural network via motif-based attention, but, this is a supervised method and, thus, not suitable for anomaly detection. Our experimental results show that other recent graph embedding [31] and graph kernel methods [54], that produce a direct graph representation, when combined with a state-of-the-art anomaly detector have low performance and are far less accurate than CODEtect. Concurrent to our work, graph neural networks for anomalous graph detection is studied in [26, 56] that examines end-to-end graph anomaly detection. Furthermore, [6] investigates outlier resistant architectures for graph embedding. A key challenge, in general, for deep learning-based models for unsupervised anomaly detection is their sensitivity to many hyper-parameter settings (including those for regularization: such as weight decay, and drop-out rate; optimization: such as learning rate, and achitecture: such as depth, width, etc.), which are not straightforward to set in the absence of any ground-truth labels. Distinctly, our work leverages the Minimum Description Length principle and does not exhibit any hyper-parameters.

Graph Motifs: Network motifs have proven useful in understanding the functional units and organization of complex systems [7, 30, 51]. Motifs have also been used as features for network classification [29], community detection [5, 55], and in graph kernels for graph comparison [45]. On the algorithmic side, several works have designed fast techniques for identifying significant motifs [8, 9, 21, 22], where a sub-graph is regarded as a motif only if its frequency is higher than expected under a network null model.

Prior works on network motifs mainly focus on 3- or 4-node motifs in undirected unlabeled/ plain graphs [4, 13, 42, 50], either using subgraph frequencies in the analysis of complex networks or most often developing fast algorithms for counting (e.g., triangles) (See [43] for a recent survey). Others have also studied directed [7] and temporal motifs [24, 35]. Most relatedly, there is a recent work on node-labeled subgraphs referred to as heterogeneous network motifs [41], where again, the focus is on scalable counting. Our work differs in using heterogeneous motifs as building blocks of a graph encoding scheme, toward the goal of anomaly detection.

Data Compression via MDL-Encoding: The Minimum Description Length (MDL) principle by Rissanen [40] states that the best theory to describe a data is the one that minimizes the sum of the size of the theory, and the size of the description of data using the theory. The use of MDL has a long history in itemset mining [48, 52], for transaction (tabular) data, also applied to anomaly detection [3, 47].

MDL has also been used for graph compression. Given a pre-specified list of well-defined structures (star, clique, etc.), it is employed to find a succinct description of a graph in those “vocabulary” terms [23]. This vocabulary is later extended for dynamic graphs [44]. A graph is also compressed hierarchically, by sequentially aggregating sets of nodes into super-nodes, where the best summary and associated corrections are found with the help of MDL [32].

There exists some work on attributed graph compression [49], but the goal is to find super-nodes that represent a set of nodes that are homogeneous in some (user-specified) attributes. Subdue [34] is one of the earliest work to employ MDL for substructure discovery in node-labeled graphs. The aim is to extract the “best” substructure S whose encoding plus the encoding of a graph after replacing each instance of S with a (super-)node is as small as possible.

3 Preliminaries & The Problem

As input, a large set of J graphs $\mathcal {G}= \lbrace G_1, \ldots , G_J\rbrace$ is given. Each graph $G_j=(V_j, E_j, \tau)$ is a directed, node-labeled, and multi-graph, which may contain multiple edges that have the same end nodes. $\tau : V_j \rightarrow \mathcal {T}$ is a function that assigns labels from an alphabet $\mathcal {T}$ to nodes in each graph. The number of realizations of an edge $(u,v)\in E_j$ is called its multiplicity, denoted $m(u,v)$. (See Figure 1(a), for example.)

Fig. 1.

Our motivating domain is business accounting, in which each $G_j$ corresponds to a graph representation of what-is-called a “journal entry”: a detailed transaction record. Nodes capture the unique accounts associated with the record, directed edges the transactions between these accounts, and node labels the financial statement (FS) account types (e.g., assets, liabilities, revenue). Bookkeeping data are kept as a chronological listing (called General Ledger) of each separate business transaction, where multiple transactions involving same account-pairs generate multi-edges between two nodes.

Our high-level idea for finding anomalous graphs in database $\mathcal {G}$ is to identify key characteristic patterns of the data that “explain” or compress the data well, and flag those graphs that do not exhibit such patterns as expected—simply put, graphs that do not compress well are anomalous. More specifically, graph patterns are substructures or subgraphs, called motifs, which occur frequently within the input graphs. “Explaining” the data are encoding each graph using the frequent motifs that it contains. The more frequent motifs we use for encoding, the more we can compress the data; simply by encoding the existence of each such motif with a short code.

The goal is to find a (small) set of motifs that compresses the data the best. Building on the MDL principle [17], we aim at finding a model, namely, a motif table (denoted ${MT}$) that contains a carefully selected subset of graph motifs, such that the total code length of (1) the model itself plus (2) the encoding of the data using the model is as small as possible. In other words, we are after a small model that compresses the data the most. The two-part objective of minimizing the total code length is given as follows:

\begin{equation} \boxed{\underset{{{MT}\subseteq \mathcal {MT}}}{\text{minimize}} \;\;\; L({MT}, \mathcal {G}) = \underbrace{L({MT})}_{\text{model code length}} \;+\; \underbrace{L(\mathcal {G} | {MT})}_{\text{data code length}},} \end{equation}

(1)

where $\mathcal {MT}$ denotes the set of all possible candidate motif tables. The first term can be seen as a model regularizer that penalizes using an unneccesarily large set of motifs to explain the data. The second term is the compression length of the data with the (selected) motifs and decomposes as $L(\mathcal {G} | {MT}) = \sum _j L({G_j} | {MT})$ since individual journals are independent. The encoding length $L({G_j} | {MT})$ is also the anomaly score for the jth graph—the larger, the more anomalous.

As such, we have a combinatorial subset selection problem toward optimizing Equation (1). To this end, we address two subproblems outlined below.

Problem 1.

Our graph encoding problem is two-fold: (1) how to encode, and (2) which motifs to encode with, or,

(1)

Encoding Schemes (Section 4): Define schemes for (i) $L({MT})$; encoding the motifs in ${MT}$, and (ii) $L({G_j} | {MT})$; encoding a graph with the given motifs; and

(2)

Search Algorithm (Section 5): Derive a subset selection algorithm for identifying a set of motifs to put in ${MT}$.

4 Encoding Schemes

MDL-encoding of a dataset with a model can be thought to involve a Sender and a Receiver communicating over a channel, where the Sender—who has the sole knowledge of the data—generates a bitstring based on which the Receiver can reconstuct the original data on their end losslessly. To this end, the Sender first sends over the model, in our case the motif table ${MT}$, which can be thought as a “code-book” that establishes a certain “language” between the Sender and the Receiver. The Sender then encodes the data instances using the code-words in the “code-book”, in our case the graph $G_j$’s using the motifs in the ${MT}$.

It is important to note that existing works [9, 10] presented different graph/substructure encoding schemes; however, each has its own limitation and does not work in our settings. The encoding in [10] is lossy and may not reflect accurate code length, while the work in [9] is restricted on simple graphs with non-overlapping nodes of motifs’ occurrences. Our encoding algorithm is both lossless and applicable on multi-graph with node-overlapping occurrences of motifs.

4.1 Encoding the Motif Table

The motif table ${MT}$ is simply a two-column translation table that has motifs in the first column, and a unique code-word corresponding to each motif in the second column, as illustrated in Figure 1(b). We use $\mathcal {M}$ to refer to the set of motifs in ${MT}$. A motif, denoted by (lower-case) g, is a connected, directed, node-labeled, simple graph, with possible self-loops on the nodes. For $g\in \mathcal {M}$, $code_{{MT}}(g)$ (or c for short) denotes its code-word.¹

To encode the motif table, the Sender encodes each individual motif $g_i$ in ${MT}$ and also sends over the code-word $c_i$ that corresponds to $g_i$. Afterwards, for encoding a graph with the ${MT}$, every motif that finds an occurrence in the graph is simply communicated through its unique code-word only.

The specific materialization of the code-words (i.e., the bitstrings themselves) are not as important to us as their lengths, which affect the total graph encoding length. Each code length $L(c_i)$ depends on the number of times that $g_i$ is used in the encoding of the graphs in $\mathcal {G}$, denoted $usage_{\mathcal {G}}(g_i)$—intuitively, the more frequently a motif is used, the shorter its code-word is; so as to achieve compression (analogous to compressing text by assigning frequent words a short code-word). Formally, the optimal prefix code length for $g_i$ can be calculated through the Shannon entropy [40]:

\begin{equation} L(c_i) \;=\; |code_{{MT}}(g_i)| \;=\; -\log _2[P(g_i | \mathcal {G})], \end{equation}

(2)

where P is a probability distribution of $g_k \in \mathcal {M}$ for $\mathcal {G}$:

\begin{align} P(g_i | \mathcal {G}) & = \frac{\sum _{G_j\in \mathcal {G}} usage_{G_j}(g_i)}{\sum _{g_k\in \mathcal {M}} \sum _{G_j\in \mathcal {G}} usage_{G_j}(g_k)}. \end{align}

(3)

We provide the details of how the motif usages are calculated in the next section, when we introduce graph encoding.

Next, we present how a motif $g_i$ is encoded. Let $n_i$ denote the number of nodes it contains. (e.g., $g_1$ in Figure 1(b) contains three nodes.) The encoding progresses recursively in a Depth First Search (DFS) -like fashion, as given in Algorithm 1. As noted, the encoding lengths summed over the course of the algorithm provides $L(g_i)$, which can be explicitly written as follows:

\begin{align} L(g_i) =\, & L_{\mathbb {N}}(n_i) + \sum _{v\in V_i} \log _2(n_i) + \log _2(T) \nonumber \nonumber\\ \,& + L_{\mathbb {N}}(\text{outdeg}(v)) + \log _2{n_i\choose \text{outdeg}(v)}. \end{align}

(4)

Overall, the total model encoding length is given as

\begin{equation} \boxed{L({MT}) = L_{\mathbb {N}}(T) + \sum _{g_i\in {MT}} \big [ L(g_i) + L(c_i) \big ],} \end{equation}

(5)

where we first encode the number of unique node labels, followed by the entries (motifs and codes) in the motif table.

4.2 Encoding a Graph Given the Motif Table

To encode a given graph $G_j$ based on a motif table ${MT}$, we “cover” its edges by a set of motifs in the ${MT}$. To formally define coverage, we first introduce a few definitions.

Definition 4.1 (Occurrence).

An occurrence (a.k.a. a match) of a motif $g_i=(V_i,E_i)$ in a graph $G_j=(V_j, E_j)$ is a simple subgraph of $G_j$, denoted $o(G_j, g_i) \equiv g_{ij} = (V_{ij}, E_{ij})$ where $V_{ij}\subseteq V_j$ and $E_{ij}\subseteq E_j$, which is isomorphic to $g_i$, such that there exists a structure- and label-preserving bijection between node-sets $V_i$ and $V_{ij}$. Graph isomorphism is denoted $g_{ij} \simeq g_i$.

Given a motif occurrence $g_{ij}$, we say that $g_{ij}$ covers the edge set $E_{ij}\subseteq E_j$ of $G_j$. The task of encoding a graph $G_j$ is then to cover all of its edges $E_j$ using the motifs in ${MT}$.

Definition 4.2 (Cover Set).

Given a graph $G_j$ and a motif table ${MT}$, $\mathcal {CS}(G_j,{MT})$ is a set of motif occurrences so that

–

If $g_{kj} \in \mathcal {CS}(G_j,{MT})$, then $g_k\in \mathcal {M}$,

–

$\bigcup _{g_{kj}\in \mathcal {CS}(G_j,{MT})} E_{kj} = E_j$, and

–

If $g_{kj}, g_{lj} \in \mathcal {CS}(G_j,{MT})$, then $E_{kj} \cap E_{lj} = \emptyset$.

We say that $\mathcal {CS}(G_j,{MT})$, or $\mathcal {CS}_j$ for short, covers $G_j$. The last item enforces the motif occurrences to cover non-overlapping edges of a graph, which, in turn, ensures that the coverage of a graph is always unambiguous. This is mainly a computational choice—allowing overlaps would enable many possible covers, where enumerating and computing all of them would significantly increase the computational cost in the search of a motif table (See Equation (1)).

To encode a $G_j$ via ${MT}$, the Sender communicates the code of the motif associated with each occurrence in $G_j$’s cover set:

\begin{equation*} G_j \; \rightarrow \; \lbrace code_{{MT}}(g_i) \;|\; g_{ij} \in \mathcal {CS}(G_j,{MT})\rbrace . \end{equation*}

However, it is not enough for the Receiver to simply know which collection of motifs occur in a graph in order to decode the graph losslessly. The Sender needs to encode further information regarding the structure of the graph. This can be achieved by, in addition to encoding which motif occurs, also encoding the specific graph-node-IDs that correspond to or align with the motif-node-IDs. The motif-node-IDs of a motif $g_i$ is simply the increasing set $\lbrace 1,\ldots ,n_i\rbrace$. Then, the sequence of matching node IDs in graph $G_j$ is simply a permutation of $n_i$ unique values in $\lbrace 1,\ldots , |V_j|\rbrace$, which can be encoded using $L_{\text{perm}}(|V_j|,n_i)$ bits, where

\begin{equation} L_{\text{perm}}(|V_j|,n_i) = \log _2(|V_j|\cdot |V_j-1|\cdot \ldots \cdot |V_j-n_i+1|). \end{equation}

(6)

Moreover, note that a motif can occur multiple times in a graph, possibly on the same set of node IDs (due to the input graphs being multi-graphs) as well as on different sets. We denote the different occurrences of the same motif $g_i$ in graph $G_j$’s cover set by $\mathcal {O}(g_i, \mathcal {CS}_j) = \langle g_{ij}^{(1)}, g_{ij}^{(2)}, \ldots , g_{ij}^{(c_{ij})} \rangle$ where $c_{ij}$ denotes the total count. See for example Figure 1 where motif $g_1$ shown in Figure 1(b) exhibits three occurrences in the graph shown in Figure 1(a), highlighted with dashed edges. Notice that two of those occurrences are on the same set of nodes, and the third one on a different set. (Note that occurrences can have different yet overlapping node sets, as in this example, as long as the edge sets are non-overlapping.) We refer to the number of occurrences of a motif $g_i$ on the same set of nodes of a graph $G_j$, say $V\subseteq V_{j}$, as its multiplicity on V, denoted $m(g_i, G_j, V)$. Formally,

\begin{equation} m(g_i, G_j, V) = \big |\big \lbrace g_{ij}^{(k)} \;|\; V_{ij}^{(k)} = V \big \rbrace \big |\;, \text{where} \; V \subseteq V_j. \end{equation}

(7)

Having provided all necessary definitions, Algorithm 2 presents how a graph $G_j$ is encoded. For an occurrence $g_{kj}$ in its cover set, the corresponding motif $g_k$ is first communicated by simply sending over its code-word (line 2). (Note that having encoded the motif table, the Receiver can translate the code-word to a specific motif structure.) Then, the one-to-one correspondence between the occurrence nodes $V_{kj}$ and those of the motif $V_k$ is encoded (line 3). Next, the multiplicity of the motif on the same set of nodes $V_{kj}$ is encoded (line 4) so as to cover as many edges of $G_j$ with few bits. Having been encoded, all those occurrences on $V_{kj}$ are then removed from $G_j$’s cover set (line 5).

Let us denote by $\overline{\mathcal {CS}}_j \subseteq \mathcal {CS}_j$ the largest subset of occurrences in $\mathcal {CS}_j$ with unique node-sets, such that $V_{kj} \ne V_{lj}, \forall \lbrace g_{kj}, g_{lk}\rbrace \in \overline{\mathcal {CS}}_j$. Then, the overall graph encoding length for $G_j$ is given as

\begin{align} L(G_j|{MT}) = \,& \sum _{g_{kj} \in \overline{\mathcal {CS}}_j} L(code_k) + L_{\text{perm}}(|V_j|,n_k) \nonumber \nonumber\\ \,& + L_{\mathbb {N}}(m(g_k, G_j, V_{kj})). \end{align}

(8)

The usage count of a motif $g_i \in \mathcal {M}$ that is used to compute its code-word length (Equations (2) and (3)) is defined as

\begin{equation} usage_{\mathcal {G}}(g_i) = \sum _{G_j \in \mathcal {G}} c_{ij} = \sum _{G_j \in \mathcal {G}}\; \sum _{g_{kj} \in \mathcal {CS}_j} \mathbb {1}(g_{kj} \simeq g_i), \end{equation}

(10)

where $\mathbb {1}(g_{kj} \simeq g_i)$ returns 1 if $g_{kj}$ is both structure- and label-isomorphic to $g_i$, that is if $g_{kj}$ is an occurrence of $g_i$, and 0 otherwise. The basic idea is to encode each motif only once (as discussed in Section 4.1), and then to encode occurrences of a motif in the data $\mathcal {G}$ by simply providing a reference to the motif, i.e., its code-word. This way we can achieve compression by assigning high-occurrence motifs a short code-word, the principle behind Shannon’s entropy.

5 Search Algorithm

Our aim is to compress as a large portion of the input graphs as possible using motifs. This goal can be restated as finding a large set of non-overlapping motif occurrences that cover these graphs. We set up this problem as an instance of the Maximum Independent Set (MIS) problem on, what we call, the occurrence graph ${G_\mathcal {O}}$. In ${G_\mathcal {O}}$, the nodes represent motif occurrences and edges connect two occurrences that share a common edge. MIS ensures that occurrences in the solution are non-overlapping (thanks to independence, no two are incident to the same edge). Moreover, it helps us identify motifs that have large usages, i.e., number of non-overlapping occurrences (thanks to maximality), which is associated with shorter code length and hence better compression.

In this section, first we describe how we set up and solve the MIS problems, which provides us with a set of candidate motifs that can go into the ${MT}$ as well as their (non-overlapping) occurrences in input graphs. We then present a search procedure for selecting a subset of motifs among those candidates to minimize the total encoding length in Equation (1).

5.1 Step 1: Identifying Candidate Motifs & Their Occurrences

As a first attempt, we explicitly construct a ${G_\mathcal {O}}$ per input graph and solve MIS on it. Later, we present an efficient way for solving MIS without constructing any ${G_\mathcal {O}}$’s, which cuts down memory requirements drastically.

5.1.1 First Attempt: Constructing G_𝒪’s Explicitly.

Occurrence Graph Construction. For each $G_j\in \mathcal {G},$ we construct a ${G_\mathcal {O}}=({V_\mathcal {O}},{E_\mathcal {O}})$ as follows. For $k=3,\ldots ,10$, we enlist all connected induced k-node subgraphs of $G_j$, each of which corresponds to an occurrence of some k-node motif in $G_j$. All those define the node set ${V_\mathcal {O}}$ of ${G_\mathcal {O}}$. If any two enlisted occurrences share at least one common edge in $G_j$, we connect their corresponding nodes in ${G_\mathcal {O}}$ with an edge.

Notice that we do not explicitly enumerate all possible k-node labeled motifs and then identify their occurrences in $G_j$, if any, which would be expensive due to subgraph enumeration (especially with many node labels) and numerous graph isomorphism tests. The above procedure yields occurrences of all existing motifs in $G_j$ implicitly.

Greedy ${MIS}$ Solution. For the ${MIS}$ problem, we employ a greedy algorithm (Algorithm 3) that sequentially selects the node with the minimum degree to include in the solution set $\mathcal {O}$. It then removes it along with its neighbors from the graph until ${G_\mathcal {O}}$ is empty. Let $deg_{{G_\mathcal {O}}}(v)$ denote the degree of node v in ${G_\mathcal {O}}$ (the initial one or the one after removing nodes from it along the course of the algorithm), and $\mathcal {N}(v) = \lbrace u \in {V_\mathcal {O}}| (u,v) \in {E_\mathcal {O}}\rbrace$ be the set of v’s neighbors in ${G_\mathcal {O}}$.

Approximation ratio: The greedy algorithm for MIS provides a $\Delta$-approximation [18] where $\Delta = \max _{v \in {V_\mathcal {O}}} deg_{{G_\mathcal {O}}}(v)$. In our case, $\Delta$ could be fairly large, e.g., in 1,000s, as many occurrences overlap due to edge multiplicities. Here, we strengthen this approximation ratio to $\min \lbrace \Delta , \Gamma \rbrace$ as follows, where $\Gamma$ is around 10 in our data.

Theorem 5.1.

For MIS on occurrence graph ${G_\mathcal {O}}= ({V_\mathcal {O}},{E_\mathcal {O}})$, the greedy algorithm in Algorithm 3 achieves an approximation ratio of $\min \lbrace \Delta , \Gamma \rbrace$ where $\Delta = \max _{v \in {V_\mathcal {O}}} deg_{{G_\mathcal {O}}}(v)$ and $\Gamma$ is the maximum number of edges in a motif that exists in the occurrence graph.

Proof. Let $MIS(G_{\mathcal {O}})$ denote the number of nodes in the optimal solution set for MIS on $G_{\mathcal {O}}$. We prove that

\begin{align} |\mathcal {O}| \ge \frac{1}{\min \lbrace \Delta , \Gamma \rbrace } MIS(G_{\mathcal {O}}). \end{align}

(11)

Let $G_{\mathcal {O}}(v)$ be the vertex-induced subgraph on $G_{\mathcal {O}}$ by the vertex set $\mathcal {N}_v = \lbrace v\rbrace \cup \mathcal {N}(v)$ when node v is selected by the greedy algorithm to include in $\mathcal {O}$. $G_{\mathcal {O}}(v)$ has $\lbrace (u,w) \in E_{\mathcal {O}} | u,w \in \mathcal {N}_v\rbrace$ as the set of edges. In our greedy algorithm, when a node v is selected to the solution set $\mathcal {O}$, v and all of its current neighbors are removed from $G_{\mathcal {O}}$. Hence,

\[\begin{eqnarray*} \forall u, v \in \mathcal {O}, \mathcal {N}_u \cap \mathcal {N}_v = \emptyset . \nonumber \nonumber \end{eqnarray*}\]

In other words, all the subgraphs $G_{\mathcal {O}}(v), v \in \mathcal {O}$ are independent (they do not share any nodes or edges). Moreover, since the greedy algorithm runs until $V_{\mathcal {O}} = \emptyset$ or all the nodes are removed from $\mathcal {O}$, we also have

\[\begin{eqnarray*} \cup _{v \in \mathcal {O}} \mathcal {N}_v = V_{\mathcal {O}}. \nonumber \nonumber \end{eqnarray*}\]

Then, we can derive the following upper-bound for the optimal solution based on the subgraphs $G_{\mathcal {O}}(v), \forall v \in \mathcal {O}$:

\begin{align} MIS(G_{\mathcal {O}}) \le \sum _{v \in \mathcal {O}} MIS(G_{\mathcal {O}}(v)). \end{align}

(12)

This upper-bound of $MIS(G_{\mathcal {O}})$ can easily be seen by the fact that the optimal solution set in $G_{\mathcal {O}}$ can be decomposed into subsets where each subset is an independent set for a subgraph $G_{\mathcal {O}}(v); hence, it is only a feasible solution of the MIS$ problem on that subgraph with a larger optimal solution.

Next, we will find an upper-bound for $MIS(G_{\mathcal {O}}(v))$ and then plug it back in Equation (11) to derive the overall bound. There are two different upper-bounds for $MIS(G_{\mathcal {O}}(v))$ as follows:

Upper-bound 1: Assume that $E_{\mathcal {O}} \ne \emptyset$, otherwise the greedy algorithm finds the optimal solution which is $V_{\mathcal {O}}$ and the theorem automatically holds, we establish the first bound

\begin{align} MIS(G_{\mathcal {O}}(v)) & \le \max \lbrace 1, |\mathcal {N}(v)|\rbrace = \max \lbrace 1, deg_{G_{\mathcal {O}}}(v) \rbrace \nonumber \nonumber\\ & \le \max _{u \in V} deg_{G_{\mathcal {O}}}(u) \le \Delta . \end{align}

(13)

Note that $\Delta$ is the maximum degree in the initial graph, which is always larger or equal to the maximum degree at any point after removing nodes and edges.

Upper-bound 2: Let us consider the case that the optimal solution for MIS on $G_{\mathcal {O}}(v)$ is the subset of $\mathcal {N}(v)$. Note that each node $v \in V_{\mathcal {O}}$ is associated with an occurrence, which contains a set of edges in $G_j$, denoted by $E_j(v) = \lbrace e_{v1}, e_{v2}, \dots \rbrace$, and, u and v are neighbors if $E_j(u) \cap E_j(v) \ne \emptyset$. Moreover, any pair of nodes $u, w$ in the optimal independent set solution of $G_{\mathcal {O}}(v)$ satisfies $E_j(u) \cap E_j(w) = \emptyset$. Thus, the largest possible independent set is $\lbrace u_1, \dots , u_{|E_j(v)|}\rbrace$ such that $E_j(u_l) \cap E_j(v) = \lbrace e_{vl}\rbrace$ ($u_l$ must be a minimal neighbor - sharing a single edge between their corresponding occurrences) and $E_j(u_l) \cap E_j(u_k) = \emptyset , \forall l \ne k$ (every node is independent of the others). Therefore, we derive the second upper-bound:

\begin{align} MIS(G_{\mathcal {O}}(v)) \le |\lbrace u_1, \dots , u_{|E_j(v)|}\rbrace | = |E_j(v)| \le \Gamma . \end{align}

(14)

Combining Equations (12) and (13) gives us a stronger upper-bound

\begin{align} MIS(G_{\mathcal {O}}(v)) \le \min \lbrace \Delta , \Gamma \rbrace . \end{align}

(15)

Plugging Equation (14) back to Equation (11), we obtain

\[\begin{eqnarray*} & MIS(G_{\mathcal {O}}) \le \sum _{v \in \mathcal {O}} \min \lbrace \Delta , \Gamma \rbrace = |\mathcal {O}| \times \min \lbrace \Delta , \Gamma \rbrace , \nonumber \nonumber\\ \qquad \qquad \qquad \qquad \;\;\Rightarrow &\;\; |\mathcal {O}| \ge \frac{1}{\min \lbrace \Delta , \Gamma \rbrace } MIS(G_{\mathcal {O}}). \nonumber \nonumber \;\; \qquad \qquad \qquad \qquad \Box \end{eqnarray*}\]

Complexity analysis: Algorithm 3 requires finding the node with minimum degree (line 3) and removing nodes and incident edges (line 5); hence, a naïve implementation would have $O(|{V_\mathcal {O}}|^2 + |{E_\mathcal {O}}|)$ time complexity; $O(|{V_\mathcal {O}}|^2)$ for searching nodes with minimum degree at most $|{V_\mathcal {O}}|$ times (line 3) and $O({E_\mathcal {O}})$ for updating node degrees (line 5). If we use a priority heap to maintain node degrees for a quick minimum search, time complexity becomes $O((|{V_\mathcal {O}}| + |{E_\mathcal {O}}|) \log |{V_\mathcal {O}}|)$; $|{V_\mathcal {O}}| \log |{V_\mathcal {O}}|$ for constructing the heap initially and $|{E_\mathcal {O}}| \log |{V_\mathcal {O}}|$ for updating the heap every time a node and its neighbors are removed (includes deleting the degrees for the removed nodes and updating those of all $\mathcal {N}(v)$’s neighbors).

Algorithm 3 takes $O(|{V_\mathcal {O}}| + |{E_\mathcal {O}}|)$ to store the occurrence graph.

5.1.2 Memory-Efficient Solution: MIS w/out Explicit G_𝒪.

Notice that the size of each ${G_\mathcal {O}}$ can be very large due to the combinatorial number of k-node subgraphs induced, which demands huge memory and time. Here, we present an efficient version of the greedy algorithm that drastically cuts down the input size to MIS. Our new algorithm leverages a property stated as follows:

Property 1 (Occurrence Degree Equality).

The nodes in the occurrence graph ${G_\mathcal {O}}$ of $G_j$ that correspond to occurrences that are enlisted based on subgraphs induced on the same node set in $G_j$ have exactly the same degree.

Let us introduce a new definition called simple occurrence. A simple occurrence $sg_{ij} = (V_{ij}, E_{ij})$ is a simple subgraph without the edge multiplicities of $G_j$ that is isomorphic to a motif. Let $\lbrace sg_{1j}, sg_{2j}, \dots , sg_{tj}\rbrace$ be the set of all simple occurrences in $G_j$. Note that two simple occurrences may correspond to the same motif.

Recall that the greedy algorithm only requires the node degrees in ${G_\mathcal {O}}$. Since all the nodes corresponding to occurrences that “spring out” of a simple occurrence have the same degree (Property 1), we simply use the simple occurrence as a “compound node” in place of all those degree-equivalent occurrences.² As such, the nodes in ${G_\mathcal {O}}$ now correspond to simple occurrences only. The degree of each node (say, $sg_{ij}$) is calculated as follows:

\begin{align} deg_{{G_\mathcal {O}}}(sg_{ij}) =& \prod _{(u,v) \in E_{ij}} m(u,v) - \left[ \prod _{(u,v) \in E_{ij}} \left(m(u,v) - 1 \right) \right] - 1 \nonumber \nonumber\\ & + \sum _{\lbrace sg_{lj} | E_{lj} \cap E_{ij} \ne \emptyset \rbrace } \left(\prod _{(u,v) \in E_{lj} \backslash E_{ij}} m(u,v) \right) \nonumber \nonumber\\ &\times \left(\prod _{(u,v) \in E_{lj} \cap E_{ij}} m(u,v) - \prod _{(u,v) \in E_{lj} \cap E_{ij}} \left(m(u,v) - 1\right) \right)\!, \end{align}

(16)

where $m(u,v)$ is the multiplicity of edge $(u,v)$ in $G_j$. The first line of Equation (15) depicts the “internal degree” among the degree-equivalent occurrences that originate from $sg_{ij}$. The rest captures the “external degree” to other occurrences that have an overlapping edge.

Memory-Efficient Greedy ${MIS}$ Solution. The detailed steps of our memory-efficient greedy ${MIS}$ algorithm is given in Algorithm 4. We first calculate degrees of all simple occurrences (line 2) and then sequentially select the one with the minimum degree, denoted $sg_{i^*j}$, included in the solution list $\mathcal {S}$ (lines 5, 6). To account for this selection, we decrease multiplicities of all its edges $(u,v)\in E_{i^{*}j}$ by 1 (line 7) and recalculate the degrees of simple occurrences that overlap with $sg_{i^{*}j}$ (lines 8-12). If one of those simple occurrences that has an intersecting edge set with that of $sg_{i^{*}j}$ contains at least one edge with multiplicity equal to 0 (due to decreasing edge multiplicities in line 7), a special value of $deg_{\max }+1$ (line 3) is assigned as its degree. This is to signify that this compound node contains no more occurrences, and it is not to be considered in subsequent iterations (line 4).

Notice that we need not even construct an occurrence graph in Algorithm 4, which directly operates on the set of simple occurrences in $G_j$, computing and updating degrees based on Equation (15). Note that the same simple occurrence could be picked more than once by the algorithm. The number of times a simple occurrence appears in $\mathcal {S}$ is exactly the number of non-overlapping occurrences that spring out of it and get selected by Algorithm 3 on ${G_\mathcal {O}}$. As such $\mathcal {O}$ and $\mathcal {S}$ have the same cardinality and each motif has the same number of occurrences and simple occurrences in $\mathcal {O}$ and $\mathcal {S}$, respectively. As we need the number of times each motif is used in the cover set of a graph (i.e., its usage), both solutions are equivalent.

Complexity analysis: Calculating $deg_{{G_\mathcal {O}}}(sg_{ij})$ of a simple occurrence takes $O(t \cdot \Gamma)$, as Equation (15) requires all t intersecting simple occurrences $\lbrace sg_{lj} \;\vert \; E_{lj} \cap E_{ij} \ne \emptyset \rbrace$ where intersection can be done in $O(\Gamma)$, the maximum number of edges in a motif. Thus, Algorithm 4 requires $O(t^2 \cdot \Gamma)$ for line 2 to calculate all degrees. Within the while loop (lines 4–12), the total number of degree recalculations (line 10) is bounded by $O(t \cdot \Gamma \cdot \max _{(u,v) \in E_j} m((u,v)))$ since each simple occurrence gets recalculated for at most $\Gamma \cdot \max _{(u,v) \in E_j} m((u,v))$ times. Finding the intersecting simple occurrences (line 8) is $(t \cdot \Gamma)$. Overall, Algorithm 4 time complexity is $O(t^2 \cdot \Gamma ^2 \cdot \max _{(u,v) \in E_j} m(u,v))$.

The space complexity of Algorithm 4 is $O(t \cdot \Gamma + |E_j|)$, for t simple occurrences with at most $\Gamma$ edges each, plus input edge multiplicities. Note that this is significantly smaller than the space complexity of Algorithm 3, i.e., $O(|{V_\mathcal {O}}| + |{E_\mathcal {O}}|)$, since $t \ll |{V_\mathcal {O}}|$ and $|E_j| \ll |{E_\mathcal {O}}|$. Our empirical measurements on our SH dataset (see Table 2 in Section 6 for details) in Figure 2 show that $|{V_\mathcal {O}}|$ is up to 9 orders of magnitude larger than t.

Fig. 2.

Table 2.

Name	#Graphs	#Node labels	#Nodes	#Multi-edges
SH	38,122	11	[2, 25]	[1, 782]
SH_Synthetic	38,122	11	[2, 25]	[1, 782]
HW	90,274	11	[2, 25]	[1, 897]
KD	152,105	10	[2, 91]	[1, 1,774]
Enron	1,139	16	[2, 87]	[1, 1,356]

Table 2. Summary Statistics of Graph Datasets Used in Experiments

5.1.3 Weighted Maximum Independent Set (WMIS).

A motivation leading to our ${MIS}$ formulation is to cover as much of each input graph as possible using motifs. The amount of coverage by an occurrence of a motif can be translated into the number of edges it contains. This suggests a weighted version of the maximum independent set problem, denoted ${WMIS}$, where we set the weight of an occurrence, i.e., node in ${G_\mathcal {O}}$, to be the number of edges in the motif it corresponds to. Hence, the goal is to maximize the total weight of the non-overlapping occurrences in the solution set.

Greedy ${WMIS}$ Solution. For the weighted version of ${MIS}$, we also have a greedy algorithm with the same approximation ratio as the unweighted one. The only difference from Algorithm 3 is the selection of node $v_{\min }$ to remove (line 3). Let $w_v$ denote the weight of node v in ${G_\mathcal {O}}$. Then,

\begin{align} v_{\min } = \arg \max _{v \in {V_\mathcal {O}}} \frac{w_v}{\min \lbrace w_v, \deg _{{G_\mathcal {O}}}(v)\rbrace \cdot \max _{u \in \mathcal {N}(v)} w_u}. \end{align}

(17)

Intuitively, Equation (16) prefers selecting large-weight nodes (to maximize total weight), but those that do not have too many large-weight neighbors in the ${G_\mathcal {O}}$, which cannot be selected.

Theorem 5.2.

For ${WMIS}$ on weighted occurrence graph ${G_\mathcal {O}}= ({V_\mathcal {O}},{E_\mathcal {O}}, W)$ where $w_v \in W$ is the weight of node $v \in {V_\mathcal {O}}$, the greedy algorithm in Algorithm 3 with node selection criterion in Equation (16) achieves an approx. ratio of $\;\min \lbrace \Delta , \Gamma \rbrace$ where $\Gamma = \max _{v \in {V_\mathcal {O}}} w_v$ is the maximum number of edges in a motif.

Proof.

The proof is similar to that of Theorem 5.1 with the key realization that the total weight of the optimal solution set of WMIS on subgraph $G_{\mathcal {O}}(v)$ is upper-bounded by $\min \lbrace w_v, \deg _{G_{\mathcal {O}}}(v)\rbrace \cdot \max _{u \in \mathcal {N}(v)} w_u$ and when a node v is selected from $G_{\mathcal {O}}$ by the greedy algorithm, due to the selection condition, we have,

\begin{align} & \frac{w_v}{\min \lbrace w_v, \deg _{G_{\mathcal {O}}}(v)\rbrace \cdot \max _{u \in \mathcal {N}(v)} w_u} \ge \frac{w_{v^*}}{\min \lbrace w_{v^*}, \deg _{G_{\mathcal {O}}}(v^*)\rbrace \cdot \max _{u \in N(v^*)} w_u} \nonumber \nonumber\\ & \ge \frac{1}{\min \lbrace w_{v^*}, \deg _{G_{\mathcal {O}}}(v^*)\rbrace } \ge \frac{1}{\min \lbrace \Delta , \Gamma \rbrace }, \end{align}

(18)

where $v^* = \max _{u \in V_{\mathcal {O}}} w_u$, then $\forall u \in \mathcal {N}(v^*), w_{v^*} \ge w_u$.□

Memory-efficient Greedy ${WMIS}$ Solution. Similar to the unweighted case, we can derive a memory-efficient greedy algorithm for ${WMIS}$ since Property 1 holds for both degree $deg_{{G_\mathcal {O}}}(v)$ and $w_v$—the two core components in selecting nodes (Equation (16)) in greedy algorithm for ${WMIS}$—since all occurrences of the same simple occurrence have the same number of edges.

5.2 Step 2: Building the Motif Table

The $(W)MIS$ solutions, $\mathcal {S}_j$’s, provide us with non-overlapping occurrences of k-node motifs ($k\ge 3$) in each $G_j$. The next task is to identify the subset of those motifs to include in our motif table ${MT}$ so as to minimize the total encoding length in Equation (1). We first define the set $\mathcal {C}$ of candidate motifs:

\begin{equation} \mathcal {C}= \left\lbrace g\simeq sg \;|\; sg \in \bigcup _{j} \mathcal {S}_j \right\rbrace \!. \end{equation}

(19)

We start with encoding the graphs in $\mathcal {G}$ using the simplest code table that contains only the 2-node motifs. This code table, with optimal code lengths for database $\mathcal {G}$, is called the Standard Code Table, denoted by ${SMT}$. It provides the optimal encoding of $\mathcal {G}$ when nothing more is known than the frequencies of labeled edges (equal to usages of the corresponding 2-node motifs), which are assumed to be fully independent. As such, ${SMT}$ does not yield a good compression of the data but provides a practical bound.

To find a better code table, we use the best-first greedy strategy: Starting with ${MT}:={SMT}$, we try adding each of the candidate motifs in $\mathcal {C}$ one at a time. Then, we pick the “best” one that leads to the largest reduction in the total encoding length. We repeat this process with the remaining candidates until no addition leads to a better compression or all candidates are included in the ${MT}$, in which case the algorithm terminates.

The details are given in Algorithm 5. We first calculate the usage of 2-node motifs per $G_j$ (lines 1–4), and set up the ${SMT}$ accordingly (line 5). For each candidate motif $g\in \mathcal {C}$ (line 7) and each $G_j$ (line 8), we can identify the occurrences of g in $G_j$’s cover set, $\mathcal {O}(g,\mathcal {CS}_j)$, which is equivalent to the simple occurrences selected by Algorithm 4 that are isomorphic to g (line 9). When we insert a g into ${MT}$, usage of some 2-node motifs, specifically those that correspond to the labeled edges of g, decreases by $|\mathcal {O}(g,\mathcal {CS}_j)|$; the usage of g in $G_j$’s encoding (lines 10–12). Note that the usage of $(k\ge 3)$-node motifs already in the ${MT}$ do not get affected, since their occurrences in each $\mathcal {S}_j$, which we use to cover $G_j$, do not overlap; i.e., their uses in covering a graph are independent. As such, updating usages when we insert a new motif to ${MT}$ is quite efficient. Having inserted g and updated usages, we remove 2-node motifs that reduce to zero usage from ${MT}$ (line 13), and compute the total encoding length with the resulting ${MT}$ (lines 14–15). The rest of the algorithm (lines 16–20) picks the “best” g to insert that leads to the largest savings, if any, or otherwise quits.

6 Experiments

Datasets. Our work is initiated by a collaboration with industry, and CODEtect is evaluated on large real-world datasets containing all transactions of 2016 (tens to hundreds of thousands transaction graphs) from three different companies, anonymously SH, HW, and KD (proprietary) summarized in Table 2. These do not come with ground truth anomalies. For quantitative evaluation, our expert collaborators inject two types of anomalies into each dataset based on domain knowledge (Section 6.1), and also qualitatively verify the detected anomalies from an accounting perspective (Section 6.2).

Since our transaction data are not shareable, to facilitate reproducibility of the results, we generate synthetic data that resemble the real-world datasets used. In particular, we add SH_Synthetic that contains random graphs generated based on statistical characteristics of SH as follows: (1) A graph in SH_Synthetic has the number of nodes randomly selected following its distribution in SH depicted in Figure 3(a); (2) The number of single edges in a graph is drawn randomly following the distribution in SH illustrated in Figure 3(b) (If the graph has more edges than the maximum number - $n\times (n-1)/2$, we restart the process); (3) For each node, we randomly assign its label following the label distribution in SH as shown in Figure 3(c); (4) For each single edge, we then randomly generate its multiplicity based on the distribution of the corresponding node label pair (an example is given in Figure 3(d)). At the end, we have the same number of graphs with similar characteristics as SH and share this data along with our code.

Fig. 3.

We also study the public Enron database [20], consisting of daily e-mail graphs of its 151 employees over 3 years surrounding the financial scandal. Nodes depict employee e-mail addresses and edges indicate email exchanges. Each node is labeled with the employee’s department (Energy Operations, Regulatory and Government Affairs, etc.) and edge multiplicity denotes the number of e-mails exchanged.

6.1 Anomaly Detection Performance

We show that CODEtect is substantially better in detecting graph anomalies as compared to a list of baselines across various performance measures. The anomalies are injected by domain experts, which mimic schemes related to money laundering, entry error or malfeasance in accounting, specifically:

–

Path injection (money-laundering-like): (i) Delete a random edge $(u,v) \in E_j$, and (ii) Add a length- 2 or 3 path u–w(–z)–v where at least one edge of the path is rare (i.e., exists in only a few $G_j$’s). The scheme mimics money-laundering, where money is transferred through multiple hops rather than directly from the source to the target.

–

Label injection (entry-error or malfeasance): (i) Pick a random node $u \in V_j$, and (ii) Replace its label $t(u)$ with a random label $t \ne t(u)$. This scheme mimics either simply an entry error (wrong account), or malfeasance that aims to reflect larger balance on certain types of account (e.g., revenue) in order to deceive stakeholders.

For path injection, we choose 3% of graphs and inject anomalous paths that replace 10% of edges (or 1 edge if 10% of edges is less than 1). For label injection, we also choose 3% of graphs and label-perturb 10% of the nodes (or 1 node if 10% of nodes is less than 1). We also tested with different severity levels of injection, i.e., 30% and 50% of edges or nodes, and observed similar results to those with 10%. The goal is to detect those graphs with injected paths or labels.

Baselines: We compare CODEtect with:

–

SMT: A simplified version of CODEtect that uses the Standard Motif Table to encode the graphs.

–

GLocalKD [26]: A recent graph neural network (GNN)-based approach that leverages knowledge distillation to train and was shown to achieve good performance and robustness in semi-supervised and unsupervised settings. We used the default setting of 3 layers, 512 hidden dimensions, and 256 output dimensions as recommended in the original article [26]. Independently, we also performed sensitivity analysis on varying the hyper-parameter settings and found that the default one returned top performance.

–

Gbad [12]: The closest existing approach for anomaly detection in node-labeled graph databases (See Section 2). Since it cannot handle multi-edges, we input the $G_j$’s as simple graphs setting all the edge multiplicities to 1.

–

Graph Embedding + iForest: We pair different graph representation learning approaches with state-of-the-art outlier detector iForest [25], as they cannot directly detect anomalies. We consider the following combinations:

–

Graph2Vec [31] (G2V)+iForest: G2V cannot handle edge multiplicities; thus, we set all to 1.

–

Deep Graph Kernel [54](DGK)+iForest

–

GF+iForest: Graph (numerical) features (GF) include number of edges of each label-pair and number of nodes of each label.

–

Entropy quantifies skewness of the distribution on the non-zero number of edges over all possible label pairs as the anomaly score. A smaller entropy implies there exist rare label-pairs and hence higher anomalousness.

–

Multiedges uses sum of edge multiplicities as anomaly score. We tried other simple statistics, e.g., #nodes, #edges, their sum/product, which do not perform well.

Performance measures: Based on the ranking of graphs by anomaly score, we measure Precision at top-k for $k=\lbrace 10,100,1000\rbrace$, and also report Area Under ROC Curve (AUC) and Average Precision (AP) that is the area under the precision-recall curve. Since most of the methods, including CODEtect, are deterministic, we run most of the methods once and report all the measures. For Graph2Vec and Deep Graph Kernel with some randomization, we also run multiple times to check the consistency of performance. Additionally, for those with hyper-parameters, we use the default settings from the corresponding publicly available source codes.

6.1.1 Detection of Path Anomalies.

We report detection results on SH and HW datasets in Table 3(a) and (b) (performance on KD is similar and omitted for brevity) and on SH_Synthetic in Table 3(c).

Table 3.

CODEtect consistently outperforms all baselines by a large margin across all performance measures in detecting path anomalies. More specifically, CODEtect provides 16.9% improvement over the runner-up (underlined) on average across all measures on SH, and 10.2% on HW. Note that the runner-up is not the same baseline consistently across different performance measures. Benefits of motif search is evident looking at the superior results over SMT. G2V+iForest produces decent performance w.r.t. most measures but is still much lower than those of CODEtect. Similar observations are present on SH_Synthetic .

6.1.2 Detection of Label Anomalies.

Table 4(a) and (b) reports detection results on the two larger datasets, HW and KD (performance on SH is similar and omitted for brevity), and Table 4(c) provides results on SH_Synthetic . Note that Gbad and G2V+iForest failed to complete within five days on KD ; thus, their results are absent in Table 4(b). Note that KD is a relatively large-scale dataset, having both larger and more graphs than the other datasets.

Table 4.

In general, we observe similar performance advantage of CODEtect over the baselines for label anomalies. The exceptions are Gbad and GLocalKD, which perform comparably, and appear to suit better for label anomalies, potentially because changing node labels disrupts structure more than the addition of a few short isolate paths. Gbad, however, does not scale to KD, and the runner-up on this dataset performs significantly worse. Similar observations are also seen on SH_Synthetic dataset.

6.2 Case Studies

Case 1—Anomalous transaction records: The original accounting databases we are provided with by our industry partner do not contain any ground truth labels. Nevertheless, they beg the question of whether CODEtect unearths any dubious journal entries that would raise an eyebrow from an economic bookkeeping perspective. In collaboration with their accounting experts, we analyze the top 20 cases as ranked by CODEtect. Due to space limit, we elaborate on one case study from each dataset/corporation as follows:

In SH, we detect a graph with a large encoding length yet relatively few (27) multi-edges, as shown in Figure 5, consisting of several small disconnected components. In accounting terms, the transaction is extremely complicated, likely the result of a (quite rare) “business restructuring” event. In this single journal entry, there exist many independent simple entries, involving only one or two operating-expense (OE) accounts, while other edges arise from compound entries (involving more than three accounts). This event involves reversals (back to prepaid expenses) as well as re-classification of previously booked expenses. The fact that all these bookings are recorded within a single entry leaves room for manipulation of economic performance and mis-reporting via re-classification, which deserves an audit for careful re-examination.

In Figure 4 (left), we show a motif with sole usage of 1 in the dataset, which is used to cover an anomalous graph (right) in HW. Edge NGL (non-operating gains&losses) to C (cash) depicts an unrealized foreign exchange gain and is quite unusual from an economic bookkeeping perspective. This is because, by definition, unrealized gains and losses do not involve cash. Therefore, proper booking of the creation or relinquishment of such gains or losses should not involve cash accounts. Another peculiarity is the three separate disconnected components, each of which represents very distinct economic transactions: one on a bank charge related to a security deposit, one on health-care and travel-related foreign-currency business expense (these two are short-term activities), and a third one on some on-going construction (long-term in nature). It is questionable why these diverse transactions are grouped into a single journal. Finally, the on-going construction portion involves reclassifying a long-term asset into a suspense account, which requires follow-up attention and final resolution.

Fig. 4.

Fig. 5.

Finally, the anomalous journal entry from KD involves the motif shown in Figure 6 (left) where the corresponding graph is the exact motif with multiplicity 1 shown on the (right). This motif has sole usage of 1 in the dataset and is odd from an accounting perspective. Economically, it represents giving up an existing machine, which is a long-term operating asset (LOA), in order to reduce a payable or an outstanding short-term operating liability (SOL) owed to a vendor. Typically one would sell the machine and get cash to payoff the vendor with some gains or losses. We also note that the ${MT}$ does not contain the 2-node motif LOA$\rightarrow$SOL. The fact that it only shows up once, within single-usage motif, makes it suspicious.

Fig. 6.

Besides the quantitative evidence on detection performance, these number of case studies provide qualitative support to the effectiveness of CODEtect in identifying anomalies of interest in accounting domain terms, worthy of auditing and re-examination.

Case 2 - Enron scandal: We study the correlation between CODEtect’s anomaly scores of the daily e-mail exchange graphs and the major events in Enron’s timeline. Figure 7 shows that days with large anomaly scores mark drastic discontinuities in time, which coincide with important events related to the financial scandal.³ It is also noteworthy that the anomaly scores follow an increasing trend over days, capturing the escalation of events up to key personnel testifying in front of Congressional committees.

Fig. 7.

6.3 Scalability

To showcase the scalability of CODEtect, in regard to running time and memory consumption, we randomly selected subsets of graphs in KD database with different sizes, i.e., $\lbrace 40, 45, 50, 55, 60, 65, 70, 75, 80, 85, 90, 95, 100\rbrace \,\times \, 10^3$. We re-sample each subset of graphs three times and report the averaged result for each setting. A summary of the results is presented in Figure 8. We observe a linear scaling of CODEtect with increasing size of input graphs as measured in number of multi-edges with respect to both time and memory usage.

Fig. 8.

7 Conclusion

We introduced CODEtect, (to our knowledge) the first graph-level anomaly detection method for node-labeled multi-graph databases; which appear in numerous real-world settings such as social networks and financial transactions, to name a few. The main idea is to identify key network motifs that encode the database concisely and employ compression length as the anomaly score. To this end, we presented (1) novel lossless encoding schemes and (2) efficient search algorithms. Experiments on transaction databases from three different corporations quantitatively showed that CODEtect significantly outperforms the prior and more recent GNN-based baselines across datasets and performance metrics. Case studies, including the Enron database, presented qualitative evidence to CODEtect’s effectiveness in spotting instances that are noteworthy of auditing and re-examination.

Footnotes

To ensure unique decoding, we assume prefix code(word)s, in which no code is the prefix of another.

MDL-optimal cost of integer k is $L_{\mathbb {N}}(k) = \log ^{\star }k + \log _2 c$; $c \approx 2.865$; $\log ^{\star }k = \log _2 k + \log _2(\log _2 k) + \ldots$ summing only the positive terms [40].

Given $sg_{ij}$, the number of its degree-equivalent occurrences in $G_j$ is the product of edge multiplicities $m(u,v)$ for $(u,v)\in E_{ij}$.

http://www.agsm.edu.au/bobm/teaching/BE/Enron/timeline.html, https://www.theguardian.com/business/2006/jan/30/corporatefraud.enron.

References

[1]

Leman Akoglu, Mary McGlohon, and Christos Faloutsos. 2010. Oddball: Spotting anomalies in weighted graphs. In Proceedings of the Pacific-Asia Conference on Knowledge Discovery and Data Mining. Springer, 410–421.

Abstract

1 Introduction

2 Related Work

3 Preliminaries & The Problem

4 Encoding Schemes

4.1 Encoding the Motif Table

4.2 Encoding a Graph Given the Motif Table

5 Search Algorithm

5.1 Step 1: Identifying Candidate Motifs & Their Occurrences

5.1.1 First Attempt: Constructing G𝒪’s Explicitly.

5.1.2 Memory-Efficient Solution: MIS w/out Explicit G𝒪.

5.1.3 Weighted Maximum Independent Set (WMIS).

5.2 Step 2: Building the Motif Table

6 Experiments

6.1 Anomaly Detection Performance

6.1.1 Detection of Path Anomalies.

6.1.2 Detection of Label Anomalies.

6.2 Case Studies

6.3 Scalability

7 Conclusion

Footnotes

References

Cited By

Index Terms

Recommendations

Multi-representations Space Separation based Graph-level Anomaly-aware Detection

A Fast General Methodology for Information-Theoretically Optimal Encodings of Graphs

Enumerating and generating labeled k-degenerate graphs

Comments

Information

Published In

Publisher

Publication History

Permissions

Check for updates

Author Tags

Qualifiers

Funding Sources

Contributors

Other Metrics

Bibliometrics

Article Metrics

Other Metrics

Citations

Cited By

View options

PDF

eReader

HTML Format

Login options

Full Access

Figures

Other

Share

Share this Publication link

Share on social media

Affiliations

5.1.1 First Attempt: Constructing G_𝒪’s Explicitly.

5.1.2 Memory-Efficient Solution: MIS w/out Explicit G_𝒪.