Exact string matching in labeled graphs is the problem of searching paths of a graph G=(V, E) such that the concatenation of their node labels is equal to a given pattern string P[1.m]. This basic problem can be found at the heart of more complex operations on variation graphs in computational biology, of query operations in graph databases, and of analysis operations in heterogeneous networks.
We prove a conditional lower bound stating that, for any constant ε > 0, an O(|E|1 - εm) time, or an O(|E| m1 - ε)time algorithm for exact string matching in graphs, with node labels and pattern drawn from a binary alphabet, cannot be achieved unless the Strong Exponential Time Hypothesis (SETH) is false. This holds even if restricted to undirected graphs with maximum node degree 2—that is, to zig-zag matching in bidirectional strings, or to deterministic directed acyclic graphs whose nodes have maximum sum of indegree and outdegree 3. These restricted cases make the lower bound stricter than what can be directly derived from related bounds on regular expression matching (Backurs and Indyk, FOCS’16). In fact, our bounds are tight in the sense that lowering the degree or the alphabet size yields linear time solvable problems.
An interesting corollary is that exact and approximate matching are equally hard (i.e., quadratic time) in graphs under SETH. In comparison, the same problems restricted to strings have linear time vs quadratic time solutions, respectively (approximate pattern matching having also a matching SETH lower bound (Backurs and Indyk, STOC’15)).
1 Introduction
String matching is the classical problem of finding the occurrences of a pattern as a substring of a text [36]. As most of today’s data is linked, it is natural to investigate string matching not only in text strings but also in labeled graphs. Indeed, large-scale labeled graphs are becoming ubiquitous in several areas, such as graph databases, graph mining, and computational biology. Applications require sophisticated operations on these graphs, and they often rely on primitives that locate paths whose nodes have labels matching a pattern given at query time. The most basic pattern to search in a graph is a string, and in this article we will prove that performing string matching in graphs is computationally challenging, even on very restricted graph classes.
In graph databases, query languages provide the user with the ability to select paths based on the labels of their nodes or edges. In this way, graph databases explicitly lay out the dependencies between the nodes of data, whereas these dependencies are implicit in classical relational databases [7]. Although a standard query language has not been yet universally adopted (as it occurred for SQL in relational databases), popular query languages such as Cypher [26], Gremlin [46], and SPARQL [43] offer the possibility of specifying paths by matching the labels of their nodes.
In graph mining and machine learning for network analysis, heterogeneous networks specify the type of each node [48]. A basic task related to graph kernels [33] and node similarity [16] is to find paths whose label matches a specific pattern. For example, in the DBLP network [53], the nodes for authors can be marked with the letter ‘A,’ and the nodes for papers can be marked with the letter ‘P,’ and edges connect authors to their papers. For example, coauthors can be identified by the pattern ‘APA’ if the two ‘A’ letters match two different nodes.
In genome research, the very first step of many standard analysis pipelines of high-throughput sequencing data has been to align sequenced fragments of DNA (called reads) on a reference genome of a species. Further analysis reveals a set of positions where the sequenced individual differs from the reference genome. After years of such studies, there is now a growing dataset of frequently observed differences between individuals and the reference. A natural representation of this gained knowledge is a variation graph in which the reference sequence is represented as a backbone path and variations are encoded as alternative paths [47]. Aligning reads (i.e., string matching) on this labeled graph gives the basis for the new paradigm called computational pan-genomics [15]. There are already practical tools that use such ideas (e.g., [28]).
The string matching problem that we consider in this article is defined as follows. Given an alphabet \(\Sigma\) of symbols, consider a labeled graph \(G=(V,E,L)\), where \((V,E)\) represents a directed or undirected graph and \(L: V \rightarrow \Sigma\) is a function that defines which symbol from \(\Sigma\) is assigned to each node as label.1 A node labeled with \(\sigma \in \Sigma\) is called a \(\sigma\)-node, and an edge whose endpoints are labeled \(\sigma _1\) and \(\sigma _2\), respectively, is called a \(\sigma _1\sigma _2\)-edge. If G is a directed graph, we say that G is deterministic if, for any two out-neighbors of the same node, their labels are different. In the following, we introduce the acronym 3-DDAG to indicate a deterministicDirected Acyclic Graph (DAG) such that the sum of the indegree and outdegree of each node is at most 3.
Given a pattern string \(P[1..m]\) over \(\Sigma\), we say that P has a match in G if there is a path \(u_1, \ldots , u_m\) in G such that \(P = L(u_1) \cdots L(u_m)\) (we also say that Poccurs in G, and that \(u_1, \ldots , u_m\) is an occurrence of P).
1.1 Our Results
We give conditional bounds for the String Matching in Labeled Graphs (SMLG) problem using the Orthogonal Vectors (OV) hypothesis [52]. The latter states that for any constant \(\epsilon \gt 0\), no algorithm can solve in \(O(n^{2-\epsilon }\text{poly}(d))\) time the OV problem: given two sets \(X, Y \subseteq \lbrace 0,1 \rbrace ^d\) such that \(|X| = |Y| = n\) and \(d = \omega (\log n)\), decide whether there exist \(x \in X\) and \(y \in Y\) such that x and y are orthogonal, namely, \(x \cdot y = 0\). We observe that it is common practice to use the Strong Exponential Time Hypothesis (SETH) [34], but since SETH implies the OV hypothesis [52], it suffices to use the OV hypothesis in the bounds, as they hold also for SETH.
First, we consider the SMLG problem on directed graphs. Their weakest form is a 3-DDAG, for which we prove in Section 3 that subquadratic time for exact string matching cannot be achieved unless the OV hypothesis is false.
Next, we consider the SMLG problem on undirected graphs and introduce the zig-zag pattern matching problem in strings, which models searching a string P along a path of an undirected graph. An exact occurrence of P in a text string is found by scanning the text forward for increasing positions in P; however, a zig-zag occurrence of P in a text can be found by partially scanning forward and backward adjacent text positions, as many times as needed (e.g., for an edge \(\lbrace u,v\rbrace\) with \(L(u) = \mathtt {a}\) and \(L(v)=\mathtt {b}\), all patterns of the form \(\mathtt {a}, \mathtt {a} \mathtt {b}, \mathtt {a} \mathtt {b} \mathtt {a}, \mathtt {a} \mathtt {b} \mathtt {a} \mathtt {b}, \ldots\) occur starting from u). We prove in Section 4 the following result.
Our results can cover arbitrary graphs in this way. Interpreting the graphs from Theorem 1.2 as directed, we observe that they have nodes with both indegree 2 and outdegree 2. Looking at Theorem 1.1, we observe that it involves directed graphs with both nodes of indegree at most 1 and outdegree 2, and nodes with outdegree at most 1 and indegree 2. Thus, the only uncovered case is that of directed graphs with only nodes of indegree at most 1 or directed graphs with only nodes of outdegree at most 1. For such graphs, observe that their edges can be decomposed into forests of directed trees (arborescences), whose roots may be connected in a directed cycle (at most one cycle per forest). We show in Section 5.1 that the Knuth-Morris-Pratt (KMP) algorithm [36] can be easily extended to solve exact string matching for these special directed graphs in linear time, thus completing the picture of the complexity of the SMLG problem.
1.2 History and Implications
The idea of extending the problem of string matching to graphs, as given in SMLG, is not new. If the nodes \(u_1, \ldots , u_m\) are required to be distinct (i.e., to be a simple path), this problem is NP-hard as it solves the well-known Hamiltonian Path problem, so this requirement is removed for this reason. The SMLG problem was studied more than 25 years ago as a search problem for hypertext by Manber and Wu [38]. The history of key contributions is given in Table 1, where a common feature of the reported bounds is the appearance of the quadratic term \(m \, |E|\) (except for some special cases). Specifically, Amir et al. [5, 6] gave the first quadratic time solution for exact string matching in \(O(N + m \cdot |E|)\) time, where \(N = \sum _{u \in V} |L(u)|\).
\(V,\) set of nodes; \(E,\) set of edges; \(occ,\) number of matches for the pattern in the graph; \(m,\) pattern length; \(N,\) total length of text in all nodes; (1), errors only in the pattern; (2), errors in the graph; (3), matches span only one edge. The two rows highlighted in bold report the best known bounds for exact and approximate string matching, respectively.
In the approximate matching case, allowing errors in the graph makes the problem NP-hard [6], so onward we consider errors only in the pattern. In such case, the quadratic cost of the approximate matching in graphs is asymptotically optimal under SETH since (i) it solves the approximate string matching as a special case, since a graph consisting of just one directed path of \(|E|+1\) nodes and \(|E|\) edges is a text string of length \(n=|E|+1\), and (ii) it has been recently proved that the edit distance of two strings of length n cannot be computed in \(O(n^{2-\epsilon })\) time, for any constant \(\epsilon \gt 0\), unless SETH is false [10]. This conditional lower bound explains why the \(O(m|E|)\) barrier has been difficult to cross in the approximate case. Rautiainen and Marschall [45] and Jain et al. [35] recently gave the best bound for errors in pattern only, \(O(N + m \cdot |E|)\) time, same as the exact string matching. The two best results for exact and approximate pattern matching, both taking quadratic time in the worst case, are highlighted in Table 1.
In this scenario and the application domains mentioned at the beginning, our results have a number of implications:
•
Although we can explain the complexity of approximate string matching in graphs, not much is known about the complexity of exact string matching in graphs. The classical exact string matching can be solved in linear time [36], so one could expect the corresponding problem on graphs to be easier than approximate string matching. A lower bound (i.e., NP-hard, as mentioned earlier) exists only in the case when the pattern is restricted to match only simple paths in the graph. Extensions of this type of matching for special graph classes were studied in the work of Limasset et al. [37]. Here we study the general case, where paths can pass through nodes multiple times. Somewhat surprisingly, Theorems 1.1 and 1.2 imply that exact and approximate pattern matching are equally hard in graphs, even if they are 3-DDAGs.
•
Our results imply that the algorithm for directed graphs by Amir et al. [5, 6] is essentially the best we can hope for asymptotic bounds unless the OV hypothesis is false. This also applies to the case of undirected graphs by the simple transformation so that each edge \(\lbrace u,v\rbrace\) is transformed into a pair of arcs \((u,v)\) and \((v,u)\). Note that we need also Theorem 1.2 to explicitly state that this is the best possible also for undirected graphs of maximum degree 2. To complete the picture, we show how to get linear time for the preceding special case of directed graphs where each node has indegree at most 1 or directed graphs whose nodes have outdegree at most 1.
•
Our results also explain why it has been difficult to find indexing schemes for fast exact string matching in graphs, with other than best-case or average-case guarantees [27, 49], except for limited search scenarios [50]. They complement recent findings about Wheeler graphs [3, 27, 31]. Wheeler graphs are a class of graphs admitting an index structure that supports linear time exact pattern matching. Gibney and Thankachan [31] show that it is NP-complete to recognize whether a (non-deterministic) DAG is a Wheeler graph. Alanko et al. [3] give a linear time algorithm for recognizing whether a deterministic automaton is a Wheeler graph. They also give an example where the minimum size Wheeler graph is exponentially smaller than an equivalent deterministic automaton. Theorem 1.1 shows that converting an arbitrary deterministic DAG into an equivalent (not necessarily minimum size) Wheeler graph should take at least quadratic time unless the OV hypothesis is false; moreover, later refinement of this result [24] shows that exponential time for the conversion is needed under the OV hypothesis. In particular, the 3-DDAG obtained in the reduction from OV in the proof of Theorem 1.1 is not a Wheeler graph.
•
We describe a simple transformation in Section 5.2 so that we can see our 3-DDAG and the pattern P as two Deterministic Finite Automata (DFAs) so that our SMLG problem reduces to the emptiness intersection for the string sets recognized by these two DFAs. This highlights a connection between the two problems, and immediately provides a quadratic conditional lower bound using OV for the latter problem. However, this might not be the best that can be obtained for the emptiness intersection problem, as ongoing work attempts to prove a quadratic lower bound, under SETH [51], already when the two DFAs are trees. Nevertheless, our algorithm from Section 5.1 shows that emptiness intersection between a tree and a chain of nodes is solvable in linear time.
Our reductions share some similarities with those for string problems [1, 8, 9, 10, 11, 13, 41]. The closest connection is with a conditional hardness of several forms of regular expression matching [9]. We describe these similarities in Section 2, highlighting the main limitations of this reduction scheme. (For the interested reader, we went through the details of such reduction in an early version of this work [21].) Later we explain why our strategy is crucial in achieving stronger results such as covering the case of deterministic directed graphs with bounded degree. This strategy yields a graph of small degree and enables local merging of non-deterministic subgraphs into deterministic counterparts. Such locality feature of our reduction is of cardinal importance, since converting a Non-Deterministic Finite Automaton (NFA) into a DFA can take exponential time [44]. Finally, although this reduction works also for undirected graphs of small degree, it does not cover undirected graphs of degree 2. For this case (zig-zag matching in a bidirectional string), we need a more intricate reduction as the underlying graph has less structure.
2 Overview of the Reduction and Connections with Regular Expressions
As mentioned in Section 1, our lower bounds have deep connections with previous results on regular expressions matching. We use these connections to conceptually introduce some internal components of our reductions before proceedings to their formal definitions. Additionally, this allows us to point out why a simple modification of an earlier reduction is not sufficient for our purposes.
Backurs and Indyk [9] analyzed which types of regular expressions are hard to match, and one of their lower bounds can be adapted to address the SMLG problem in the case of a non-deterministic DAG. The type of regular expressions in question is \(\mid \cdot \mid\)—that is, a composition of two or operations. An example of such regular expression is \([(a|b)(b|c)]|[(a|c)b]\). Given a regular expression p of this type and a text t, determining whether or not a substring of t can be generated by p requires quadratic time, unless there exists a subquadratic time algorithm for OV. The reduction adopted to prove such result consists in defining text \(t = t_1\texttt {2}t_2\texttt {2}\ldots \texttt {2}t_n\) as the concatenation of all the binary vectors \(t_1,\dots ,t_n\) of X, placing the separator character \(\texttt {2}\) between them. Regular expression \(p = G_W^{(1)} \mid G_W^{(2)} \mid \cdots \mid G_W^{(n)}\) is an or of n gadgets, one for each vector in the set Y. Moreover, gadget \(G_W^{(j)}\) is designed in such a way that it accepts substring \(t_i\) if and only if the i-th vector of X and the j-th vector of Y are orthogonal. Hence, it is fairly straightforward to prove that a substring of t is accepted by p if and only if there exists a pair of OVs in X and Y, respectively.
The idea behind this reduction can be modified for the SMLG problem as follows. In the SMLG problem, we need to construct a pattern P and a graph G such that P has a match in G if and only if there is a vector in X orthogonal to a vector in Y. Consider the NFA that accepts the same language as the regular expression p defined earlier, and call \(\mathtt {b}\) and \(\mathtt {e}\) its start and accepting states, respectively.
We can enrich such automaton with a universal gadget \(G_{U}^{(j)}\), which accepts any binary vector of length d. We place \(n-1\) universal gadgets on each side of the \(G_W^{(i)}\)’s, to allow P to shift, as shown in Figure 1. Pattern P is again defined as the concatenation of the vectors in X with separator characters. (Again, see the next section for formal definition.) Due to the fact that we placed only \(n-1\) universal gadgets on each side, pattern P matches in G if and only if a subpattern of P matches in one of the \(G_W^{(j)}\) gadgets, which can happen if and only if there exists a pair of OVs.
Fig. 1.
Observe that this reduction builds a non-deterministic graph because of the out-neighbors of node \(\mathtt {b}\). This non-deterministic feature appears inherent to this type of construction. Our contribution is a heavy restructuring of this reduction, whose two main ideas can be intuitively summarized as follows. First, instead of placing the \(G_W^{(j)}\) gadgets on a “column,” we place them on a “row.” We then place the left universal gadgets on a “row” on top of this one, the right universal gadgets on a “row” below this one, and force the pattern to have a match starting in the top row and ending in the bottom row. See Section 3.3 and Figure 4 presented later in the article. This allows us to restrain the non-deterministic parts of the graph to nodes having only two out-neighbors with the same label. Second, we then show how to remove this non-determinism by locally merging the parts of the graph labeled with the same letter while still maintaining the properties of the graph. See Section 3.4 and Figure 6 presented later in the article.
3 Deterministic DAGs
In this section, we reduce the OV problem to the SMLG problem for the restricted case of 3-DDAGs. In this scenario, 3-DDAGs are the most restricted case, as otherwise the SMLG problem can be solved in linear time (see Section 5.1).
Given an OV instance with sets \(X = \lbrace x_1, \ldots , x_n\rbrace\) and \(Y = \lbrace y_1, \ldots , y_{n}\rbrace\) of d-dimensional binary vectors, we show how to build a pattern P and a 3-DDAGG such that P will have a match in G if and only if there exists a vector in X orthogonal to one in Y. We first describe how to build P and how to obtain a directed graph whose nodes are labeled with a constant-sized alphabet. Then we discuss how to turn such a graph into the 3-DDAGG.
3.1 Pattern
Pattern P is over the alphabet \(\Sigma = \lbrace \mathtt {b},\mathtt {e},\mathtt {0},\mathtt {1} \rbrace\), has length \(|P| = O(nd)\), and can be built in \(O(nd)\) time from the first set of vectors \(X = \lbrace x_1, \ldots , x_n\rbrace\). Namely, we define
where \(P_{x_i}\) is a string of length d that is associated with \(x_i \in X\), for \(1 \le i \le n\). The h-th symbol of \(P_{x_i}\) is either \(\mathtt {0}\) or \(\mathtt {1}\), for each \(h \in \lbrace 1,\dots ,d\rbrace\), such that \(P_{x_i}[h] = \mathtt {1}\) if and only if \(x_i[h] = 1\).2 We thus view the vectors in X as subpatterns \(P_{x_i}\)s which are concatenated by placing separator characters \(\mathtt {e} \mathtt {b}\). Note that P starts with \(\mathtt {b} \mathtt {b}\) and ends with \(\mathtt {e} \mathtt {e}\): such strings are found nowhere else in P, marking thus its beginning and its end.
3.2 Graph Gadgets
The gadget implementing the main logic of the reduction is a directed graph \(G_W = (V_W,E_W,L_W)\), illustrated in Fig. 2. Starting from the second set of vectors Y, set \(V_W\) can be seen as n disjoint groups of nodes \(V_W^{(1)}, V_W^{(2)}, \ldots , V_W^{(n)}\) (plus some extra nodes), where the nodes in \(V_W^{(j)}\) are uniquely associated with vector \(y_j \in Y\), for \(1 \le j \le n\). The corresponding induced subgraph \(G_W^{(j)} = (V_W^{(j)}, E_W^{(j)})\) will contain an occurrence of a subpattern \(P_{x_i}\) if and only if \(x_i \cdot y_j = 0\). We give more details in the following.
Fig. 2.
The nodes in \(V_W^{(j)}\) are defined as follows. For \(1 \le h \le d\), we consider entry \(y_j[h]\) of vector \(y_j \in Y\). If \(y_j[h] = 1\), we place just a \(\mathtt {0}\)-node \(w^0_{j h}\) to indicate that we only accept \(P_{x_i}[h] = \mathtt {0}\) for this h coordinate. Instead, if \(y_j[h] = 0\), we place both a \(\mathtt {0}\)-node \(w^0_{j h}\) and a \(\mathtt {1}\)-node \(w^1_{j h}\) to indicate that the value of \(P_{x_i}[h]\) does not matter. The nodes in \(V_W^{(j)}\) are preceded by a special begin \(\mathtt {b}\)-node \(b_W^{(j)}\) and succeeded by a special end \(\mathtt {e}\)-node \(e_W^{(j)}\). The overall nodes are thus \(V_W = \bigcup _{1 \le j \le n} (V_W^{(j)} \cup \lbrace b_W^{(j)},e_W^{(j)}\rbrace)\), and it holds that \(|V_W| = O(nd)\).
As for the edges in \(E_W^{(j)}\), they properly connect the nodes inside each group \(V_W^{(j)}\). Specifically, node \(b_W^{(j)}\) is connected to \(w^0_{j 1}\) and, if it exists, to \(w^1_{j 1}\). Additionally, we place edges connecting both nodes \(w^0_{j d}\) and \(w^1_{j d}\) (if this exists) to node \(e_W^{(j)}\). Moreover, there is an edge for every pair of nodes that are consecutive in terms of h coordinate, for \(1 \le h \lt d\) (e.g., \(w^1_{j h}\) is connected to \(w^0_{j \, h+1}\)). The overall edges are thus \(E_W = \bigcup _{1 \le j \le n} E_W^{(j)}\), where \(|E_W| = O(nd)\).
In this way, we define the directed graph \(G_W = (V_W,E_W,L_W)\), which can be built in \(O(nd)\) time from set Y and consists of n connected components \(G_W^{(j)}\), one for each vector \(y_j \in Y\).
We observe that pattern occurrences in \(G_W\) have some useful combinatorial properties. The following lemma is an immediate observation, which follows from the fact that each \(G_W^{(j)}\) is acyclic and not connected to any other \(G_W^{(j^{\prime })}\).
The following lemma instead relates the occurrence of a subpattern to the OV problem.
In the following, we will also use gadget \(G_U = (V_U,E_U,L_U)\), the degenerate case of \(G_W\) with \(2n-2\) (instead of just n) connected components \(G_U^{(j)}\) where, for all \(1 \le j \le 2n-2\) and \(1 \le h \le d\), we place both a \(\mathtt {0}\)-node and a \(\mathtt {1}\)-node: we call these two nodes \(u^0_{j h}\) and \(u^1_{j h}\), respectively, to distinguish them from those in \(G_W\). Moreover, every \(\mathtt {e}\)-node of this gadget is connected with the next \(\mathtt {b}\)-node, in terms of the j coordinate (Fig. 3). As it can be seen, any subpattern \(P_{x_i}\) occurs in \(G_U\), so it can be used as a “jolly” gadget.
Fig. 3.
3.3 Non-Deterministic Graph
A possible approach is based on suitably combining one instance of gadget \(G_W\) and two instances of gadgets \(G_U\), named \(G_{U1}\) and \(G_{U2}\). The idea is that when \(x_i \cdot y_j = 0\), we want P to occur in G, so that the following three conditions hold:
•
Instance\(G_{U1}\): \(P_{x_1}\) occurs in \(G_{U1}^{(n-1+j-(i-1))}\), ..., \(P_{x_{i-1}}\) occurs in \(G_{U1}^{(n-1+j-1)}\).
•
Instance\(G_W\): \(P_{x_i}\) occurs in \(G_W^{(j)}\).
•
Instance\(G_{U2}\): \(P_{x_{i+1}}\) occurs in \(G_{U2}^{(j)}\), ..., \(P_{x_{n}}\) occurs in \(G_{U2}^{(j+n-i-1)}\).
However, when \(x_i \cdot y_j \ne 0\), we do not want \(P_{x_i}\) to occur in \(G_W^{(j)}\). We can suitably link the instances \(G_W\), \(G_{U1}\), and \(G_{U2}\) so that we get the preceding conditions. We connect the \(\mathtt {e}\)-nodes in \(G_{U1}\) to \(\mathtt {b}\)-nodes in \(G_W\) and connect the \(\mathtt {e}\)-nodes in \(G_W\) to \(\mathtt {b}\)-nodes in \(G_{U2}\). Additionally, we place additional starting \(\mathtt {b}\)-nodes and additional ending \(\mathtt {e}\)-nodes, to properly match the \(\mathtt {b} \mathtt {b}\) and \(\mathtt {e} \mathtt {e}\) prefix and suffix of P, respectively. More precisely, for every \(\mathtt {b}\)-node in \(G_{U1}\) and \(G_W,\) we add a new \(\mathtt {b}\)-node as an in-neighbor of it, and for every \(\mathtt {e}\)-node in \(G_{W}\) and \(G_{U2}\), we add a new \(\mathtt {e}\)-node as an out-neighbor of it. Such construction is depicted in Figure 4.
Fig. 4.
However, even if \(G_W\), \(G_{U1}\), and \(G_{U2}\) are deterministic, their resulting composition is not so, because of the out-neighbors of the \(\mathtt {e}\)-nodes.3 In the following, we show how to obtain a deterministic graph by suitably merging \(G_W\) with portions of \(G_U\).
3.4 Deterministic Graph
To obtain a deterministic DAG, we need to suitably combine one instance of gadget \(G_W\) with the two instances \(G_{U1}\) and \(G_{U2}\) (recall that both \(G_{U1}\) and \(G_{U2}\) have instances of gadget \(G_U^{(j)}\), for all \(1 \le j \le 2n-2\)). Although \(G_{U2}\) will be used as is, \(G_{U1}\) needs to be partially merged with \(G_{W}\) to obtain determinism. We start building our final graph G from \(G_W\) by adding parts of \(G_{U1}\) when needed, obtaining a deterministic graph called \(G_{U1W}\), as shown in Figure 5. Consider subgraph \(G_W^{(j)}\) and assume that the first position in which the \(\mathtt {1}\)-node is lacking is h. We place a partial version of subgraph \({G}_{U1}^{(j^{\prime })}, j^{\prime }:=n-1+j\), by adding to the graph the nodes and edges of \({G}_{U1}^{(j^{\prime })}\) that are located between position \(h+1\) and node \({e}_{U1}^{(j^{\prime })}\) (included). If \(h=d,\) we place only node \(e_{U1}^{(j^{\prime })}\). We also place \(\mathtt {1}\)-node \(u^1_{jh}\) and connect the \(\mathtt {0}\)-node and the \(\mathtt {1}\)-node (if any) of \(G_W^{(j)}\) in position \(h-1\) to it (if \(h \gt 1\)), or we connect \({b}_{W}^{(j)}\) to it (if \(h=1\)). Moreover, we connect node \(u^1_{jh}\) to the first \(\mathtt {0}\)- and \(\mathtt {1}\)-node of partial \(G_{U1}^{(j^{\prime })}\). If \(h=d,\) we connect \(u^1_{j h}\) to \(e_{U1}^{(j^{\prime })}\). Then we scan \(G_W^{(j)}\) from left to right looking for those positions \(h^{\prime }\), \(h \le h^{\prime } \lt d\), such that there is no \(\mathtt {1}\)-node in position \(h^{\prime }+1\). We connect the \(\mathtt {0}\)-node and the \(\mathtt {1}\)-node (if any) of \(G_W^{(j)}\) in position \(h^{\prime }\) to the \(\mathtt {1}\)-node of \(G_{U1}^{(j^{\prime })}\) in position \(h^{\prime }+1\). Finally, we place edge \(({e}_{U1}^{(j^{\prime })}, {b}_{W}^{(j+1)})\). To complete the merging task, we apply the preceding modification to all \({G}_W^{(j)}\), for \(1 \le j \le n-1\), and thus obtain gadget \(G_{U1W}\).
Fig. 5.
At this point, we place gadget \(G_{U2}\) and connect \(G_{U1W}\) to it by placing edges \((e_W^{(j)}, b_{U2}^{(j)})\), for all \(1 \le j \le n\). Additionally, for every \(\mathtt {b}\)-node of \(G_{U1W}\), we place an additional \(\mathtt {b}\)-node as in-neighbor. We do the same for every \(\mathtt {e}\)-node of \(G_{U2}\), placing an \(\mathtt {e}\)-node as out-neighbor. Adding subgraphs \(G_{U1}^{(1)}, \ldots , G_{U1}^{(n-1)}\) with one additional \(\mathtt {b}\)-node as in-neighbor of their \(\mathtt {b}\)-nodes, and connecting the \(\mathtt {e}\)-node of \(G_{U1}^{(n-1)}\) to the \(\mathtt {b}\)-node of \(G_{W}^{(1)}\), completes the transformation into the wanted deterministic DAG, which we call G. Figure 6 gives an overall picture of G.
Fig. 6.
It is easy to verify that every \(\mathtt {b}\)- and \(\mathtt {e}\)-node in G can have no more than two out-neighbors, and in such case, they have different labels. This shows that graph G is deterministic.
The deterministic DAG G has a crucial property which, combined with Lemma 3.1 and Lemma 3.2, is essential to ensure the correctness of our reduction.
We conclude this section by proving the following weaker version of Theorem 1.1. In the next two sections, we show how to obtain the full proof of Theorem 1.1, by transforming G to have maximum sum of indegree and outdegree 3, and how to reduce the alphabet to binary.
3.5 Reduced Degree
In this section, we show how to transform the deterministic graph G from the previous section to be a 3-DDAG.
Observe that every node in G can have at most two in-neighbors and two out-neighbors. An emblematic case is that of four nodes, say v, w, \(v^{\prime }\), and \(w^{\prime }\), with edges \((v,w)\), \((v,w^{\prime }),(v^{\prime },w)\), and \((v^{\prime },w^{\prime })\). To reduce to 1 the outdegree of v and \(v^{\prime }\), and the indegree of w and \(w^{\prime }\), the idea is to add two dummy nodes \(\bar{v}\) and \(\bar{w}\) connected by an edge \((\bar{v},\bar{w})\), then replace the four preceding edges with \((v,\bar{v})\), \((v^{\prime },\bar{v})\), \((\bar{w},w)\), and \((\bar{w},w^{\prime })\). The dummy nodes can be labeled, for example, with \(\mathtt {0}\), then one can do a symmetric modification in the pattern. One needs to apply such transformations between any two consecutive columns of G.
To be more precise, we need to consider four node configurations. The first three, shown in Figure 7, are slightly simpler than the fourth one, in Figure 8. The final result is achieved by applying these adjustments among (sequences of) consecutive columns of G, observing that these four cases cover all possible configurations in the graph.
Fig. 7.
Fig. 8.
Since we always insert a pair of two new \(\mathtt {0}\)-nodes, or a \(\mathtt {b}\)- and an \(\mathtt {e}\)-node, between prescribed columns in G, then we can analogously modify the pattern to match the new structure of G.
The encoding that we present next to obtain a binary alphabet can be safely applied after reducing the degree of the nodes of G with this technique.
3.6 Binary Alphabet
The size of the alphabet used until this point is 4. One can reduce the alphabet size to binary using the following encoding,
for both the pattern and the graph. Given any string \(x = x[1..m]\), we define its binary encoding \(\alpha (x) := \alpha (x[1]) \cdots \alpha (x[m])\). In the graph, we replace each \(\sigma\)-node with a path of as many nodes as characters in \(\alpha (\sigma)\).
To make this encoding work, we need to additionally make the pattern start with characters \(\mathtt {e} \mathtt {b} \mathtt {b}\) (instead of just \(\mathtt {b} \mathtt {b}\)) and end with characters \(\mathtt {e} \mathtt {e} \mathtt {b}\) (instead of just \(\mathtt {e} \mathtt {e}\)) to exploit the properties of sequence \(\mathtt {e} \mathtt {b}\). Moreover, this entails that also in the graph we have to place and connect a new \(\mathtt {e}\)-node to each \(\mathtt {b}\)-node used to mark the beginning of a viable match, and in the same manner, we need to add a new \(\mathtt {b}\)-node after every \(\mathtt {e}\)-node used to mark the end of a match.
We can now assume that the graph and the pattern have been changed as described in the previous section so that the graph has the maximum sum of indegree and outdegree 3. The goal is to show that there is a bijection from matches before and after applying such encoding and reduction adjustments.
At this point, we apply the \(\alpha\) encoding, and nodes with labels of length 2 and 4 will be replaced by chains of nodes labeled by single characters each. Note that in graph \(G,\) the only out-neighbors of a node can be \(\mathtt {0}\) and \(\mathtt {1}\), or \(\mathtt {b}\) and \(\mathtt {e}\), respectively, hence this encoding keeps the graph deterministic. We now prove some key properties of the chosen encoding.
Observe that even if we modified the graph to reduce the degree, it still holds that the subgraphs of G where matches of some subpattern can be present are separated by an \(\mathtt {e} \mathtt {b}\)-edge (recall Figure 7(b) and (c)). Thus, the following synchronizing property is useful.
An immediate consequence of Lemma 3.5 is that the encoding preserves the occurrences. Let \(G^{(ex)}\) be the deterministic DAG reduced to have the maximum sum of indegree and outdegree 3, extended with the extra \(\mathtt {b}\)- and \(\mathtt {e}\)-nodes, and let \(P^{(ex)}\) be the pattern corresponding to this reduced graph, extended with the \(\mathtt {b}\) and \(\mathtt {e}\) characters. Let \(\alpha (G^{(ex)})\) denote the graph obtained from \(G^{(ex)}\) by relabeling its nodes with the binary encoding \(\alpha\) of their labels and substituting such nodes that now have labels of length 2 and 4 with undirected paths of length 2 and 4, respectively, whose nodes are labeled with single characters.
4 Undirected Graphs: Zig-zag Matching
In this section, we prove Theorem 1.2. To this end, we need to modify the previous reduction, defining a new alphabet, pattern, and graph. The main ideas will be the same, but since the graph will now be a single undirected path, some key changes will be needed. In Section 4.1, we introduce a reduction in which the alphabet has cardinality 6, and in Section 4.2, we show how to reduce the alphabet to binary.
4.1 Non-Binary Alphabet
The original alphabet \(\Sigma = \lbrace \mathtt {b},\mathtt {e},\mathtt {0},\mathtt {1} \rbrace\) is replaced with \(\Sigma ^{\prime } = \lbrace \mathtt {b},\mathtt {e},\mathtt {A},\mathtt {B},\mathtt {s},\mathtt {t} \rbrace\). Characters \(\mathtt {0}\) and \(\mathtt {1}\) are encoded in the following manner:
When such encoding is applied, character \(\mathtt {s}\) will be used as a separator marking the beginning and the end of the old characters. As an example, the subpattern
A new pattern \(P^{\prime }\) is built applying this encoding to each one of the subpatterns \(P_{x_i}\), thus obtaining new subpatterns \(P^{\prime }_{x_i}\). We then concatenate all the subpatterns \(P^{\prime }_{x_i}\) by placing the new character \(\mathtt {t}\) to separate them, instead of eb. Finally, we place characters \(\mathtt {b} \mathtt {t}\) at the beginning of the new pattern and \(\mathtt {t} \mathtt {e}\) at the end. We have the following example.
Note that for each subpattern, we are introducing a constant number of new characters, hence the size of the entire pattern \(P^{\prime }\) still is \(O(nd)\).
An analogous encoding will be applied to the graph. The strategy is to encode \(G_W\) in an undirected path by concatenating subpaths representing each \(G_W^{(j)}\), one after another.
The positions h in which both a \(\mathtt {0}\)- and a \(\mathtt {1}\)-node are present in \(G_W^{(j)}\) are replaced by a path that can be matched both by \(\mathtt {0} = \mathtt {A} \mathtt {B} \mathtt {A} \mathtt {B} \mathtt {A} \mathtt {B} \mathtt {A}\) and \(\mathtt {1} = \mathtt {A} \mathtt {B} \mathtt {A}\). Positions h with only a \(\mathtt {0}\)-node and no \(\mathtt {1}\)-node are encoded instead with a path that can be matched only by \(\mathtt {0} = \mathtt {A} \mathtt {B} \mathtt {A} \mathtt {B} \mathtt {A} \mathtt {B} \mathtt {A}\) (Figure 9). We use \(\mathtt {s}\)-nodes to separate these paths. We denote by \(LG_W^{(j)}\) (Linear \(G_W^{(j)}\)) this linearized version of \(G_W^{(j)}\). Moreover, given subgraph \(G_W^{(j)}\), two new \(\mathtt {t}\)-nodes will mark the beginning and the ending of its encoding. Figure 10 illustrates this transformation for \(G_W^{(j)}\).
Fig. 9.
Fig. 10.
In a similar manner, \(G_U\) is also encoded as a path. We do not need to encode all its \(2n-2\) subgraphs: since the matching path can go through nodes more than once, we only need to encode one of these subgraphs, in the same manner as done for \(G_W^{(j)}\). Let \(LG_U\) be the linearized version of only one of the “jolly” gadgets that were composing the original \(G_U\).
Then, for each \(1 \le j \le n\), we build structure \(LG^{(j)}\) by placing \(\mathtt {t}\)-nodes, \(LG_U\) instances, \(LG_W^{(j)}\), a \(\mathtt {b}\)-node on the left, and an \(\mathtt {e}\)-node on the right, as in Figure 11. In such structure, the \(\mathtt {b}\)-node and the \(\mathtt {e}\)-node delimit the beginning and the end of a viable match for a pattern. The \(\mathtt {t}\)-nodes are separating the \(LG_U\) structures from \(LG_W^{(j)}\), and in general, they are marking the beginning and the end of a match for a subpattern \(P^{\prime }_{x_i}\). The idea behind \(LG^{(j)}\) is that a match of P can traverse \(LG_U\) from the beginning to the end, backward and forward as many times as needed, before starting a match of some subpattern \(P_{x_i}^{\prime }\) inside \(LG_W^{(j)}\). Notice also that this allows only subpatterns on even positions i to match inside \(LG_W^{(j)}\). We will address this minor issue at the end see the paragraph following the proof of Lemma 4.3.
Fig. 11.
To construct the final graph \(LG,\) we concatenate all \(LG^{(1)}\), \(LG^{(2)}\), ..., \(LG^{(n)}\) into a single undirected path. Figure 12 gives a picture of the end result.
Fig. 12.
No issues arise regarding the size of the graph, since we are replacing every \(\mathtt {0}\)-node, or every pair of a \(\mathtt {0}\)-node and a \(\mathtt {1}\)-node, with a constant number of new nodes. By construction, the two gadgets \(LG_U\) and \(LG_W^{(j)}\) both have size \(O(d)\), since for each one of the d entries of a vector we place one of the two possible encodings. In \(LG,\) there are n instances of \(LG_W^{(j)}\), each one surrounded by two \(LG_U\) instances. Hence, the total size of the graph remains \(O(nd)\).
To prove the correctness of the reduction, we will show some properties on LG by introducing the following lemmas. We use \(t_lLG_W^{(j)}t_r\) to refer to \(LG_W^{(j)}\) extended with the \(\mathtt {t}\)-nodes on its left and on its right. When referring to the k-th \(\mathtt {s}\)-character in \(P^{\prime }_{x_i}\), we mean the k-th \(\mathtt {s}\)-character found scanning \(P^{\prime }_{x_i}\) from left to right; in the same manner, we refer to the k-th \(\mathtt {s}\)-node in \(LG_W^{(j)}\).
The main difference with the original proof resides in assuming that a match for \(P^{\prime }_{x_i}\) starts at \(t_l\) and ends at \(t_r\). This feature is crucial for the correctness of the reduction and can be safely exploited since, as shown in the following, the \(\mathtt {b}\)- and \(\mathtt {e}\)-nodes guarantee that in case of a match for \(P^{\prime }\) we will cross the \(LG_W^{(j)}\) gadget from left to right at least once.
Since Lemma 4.3 gives us a property that holds only if a subpattern is in an even position, we need to tweak pattern \(P^{\prime }\) to make the reduction work. Indeed, we define two patterns. The first pattern \(P^{\prime (1)}\) is \(P^{\prime }\) itself; the second pattern \(P^{\prime (2)}\) is obtained by swapping the subpatterns \(P^{\prime }_{x_i}\) on odd position with the next subpatterns \(P^{\prime }_{x_{i+1}}\) on even position, for every \(i = 1, 3, \ldots\) . For example, if n is even, we will have the following.
While \(P^{\prime (1)}\) checks the even positions of \(P^{\prime }\), \(P^{\prime (2)}\) checks the odd ones. If n is even, then neither \(P^{\prime (1)}\) nor \(P^{\prime (2)}\) would be able to have a match in LG, since after matching an even number of subpatterns it is not possible to match any \(\mathtt {e}\)-node. In such case, we can simply add a dummy subpattern \(\bar{P} = \mathtt {s} ~\mathtt {A} \mathtt {B} \mathtt {A} ~\mathtt {s} ~\mathtt {A} \mathtt {B} \mathtt {A} ~\mathtt {s} \ldots \mathtt {s} ~\mathtt {A} \mathtt {B} \mathtt {A} ~\mathtt {s}\) (with d repetitions of \(\mathtt {A} \mathtt {B} \mathtt {A}\)) at the end of P as it were its last subpattern so that the number of subpatterns becomes odd. Indeed, observe that \(\bar{P}\) corresponds to vector \(\bar{x} = (1 1 \ldots 1)\), which has null product only with vector \(\bar{y} = (0 0 \ldots 0)\). Hence, if \(\bar{y} \not\in Y,\) then \(\bar{P}\) does not have a match in any \(LG^{(j)}\), whereas if \(\bar{y} \in Y,\) every subpattern \(P^{\prime }_{x_i}\) has a match in the \(LG^{(j)}\) built on top of \(\bar{y}\). This means that \(\bar{P}\) does not disrupt our reduction.4
Now we are ready to present the end result.
Theorem 1.2 follows directly from the correctness of these constructions, except for the alphabet size reduction to binary, which we cover in the next section.
4.2 Binary Alphabet
In this section, we explain how to reduce the alphabet from the reduction in Section 4.1 to be binary. For this purpose, we apply the following encoding \(\alpha\) to the characters:
Denote by \(\alpha (P^{\prime })\) and \(\alpha (LG)\) the encoded pattern and graph, respectively. Note that when applying the encoding to LG, we replace each \(\sigma\)-node with a sequence of nodes labeled with the characters of the encoding of \(\sigma\). Thus, we maintain the property that the label of each node is a single character. To prove correctness, it suffices to prove the following two lemmas.
5 Additional Results
5.1 A Linear Time Algorithm for Almost Trees
Directed pseudo forests are directed graphs whose nodes have outdegree at most 1, and their transpose are graphs whose nodes have indegree at most 1. Both of these types of graphs are structures lying between our conditional hardness results and the linear time solvable string matching case. Such structures are forests of directed trees whose roots may be connected in a directed cycle (at most one cycle per forest).
Exact string matching in a tree whose edges are directed from root to leaves (graphs whose nodes have indegree at most 1) can be solved in linear time. One such algorithm [2] works on constant alphabet, but there is a folklore alphabet-independent solution through a simple variation of the KMP algorithm [36]: recall that after linear time preprocessing of the pattern \(P[1..m]\), KMP scans through the text string T, updating index i in the pattern in amortized constant time to find the longest prefix \(P[1..i]\) that matches suffix \(T[j-i+1..j]\) of the current position j in the text. One can simulate this algorithm on a tree by just storing the current value of index i at each node before branching.
One can reduce our special case to the tree case as follows. Cut the cycle at any edge \((v,w)\) to form a tree rooted at w. Read the cycle from v backward (possibly many times) to form a string \(S[1..m]\), where m is the pattern length. Create a path matching the reverse of \(S[1..m]\) and connect this path to the root w forming a new tree. Pattern matching on this tree takes linear time [2].
To see that the reduction works correctly, consider root r of some tree hanging from the cycle. Let \(S^r\) be the infinite string formed by reading the cycle starting at r backward. For searching a pattern of length m spanning r, it is sufficient to add a path spelling reverse of \(S^r[1..m]\) on top of r and use the linear time solution for trees [2]. Furthermore, observe that the infinite strings \(S^r\) for all roots r along the cycle overlap, so it is sufficient to linearize the cycle until each root is preceded by a length m part of the reverse of their infinite string \(S^r\). To cover also matches inside the cycle, one can consider similarly any node on a cycle as a root. The reduction covers these cases.
Finally, the symmetric case of a cycle containing roots of upward directed trees (graphs whose nodes have outdegree at most 1) can be reduced to the symmetric case by reversing all edges and the pattern.
5.2 Language Intersection of Two DFAs
We can show a connection between SMLG and the emptiness intersection problem by turning a deterministic DAG and a pattern into two DFAs. We do so by modifying the graph of our reduction so that we also obtain a reduction from OV to the emptiness intersection.
Let G be the 3-DDAG obtained in the reduction of Section 3. We can obtain a DFA \(D_1\) from G as follows. First, the nodes in G become the states of \(D_1\), and each arc \((u,v)\) in G gives a transition from state u to state v in \(D_1\) with symbol \(L(v)\). Also let S be the states in \(D_1\) that correspond to \(\mathtt {b}\)-nodes in G with zero indegree. We add \(O(|S|)\) states to \(D_1\) forming a tree whose root becomes the initial state of \(D_1,\) and the leaves of this tree have transition to the states in S with symbol \(\mathtt {b}\). Each transition from each of these new states to its left child is labeled with L and to its right child is labeled with R.
The other DFA \(D_2\) is obtained from P as follows. We employ the same tree with \(|S|\) leaves as earlier, except the transitions from these leaves with \(\mathtt {b}\) go the same state: from this state, we have a simple chain of states that spells P. We can observe that P occurs in G if and only if the languages of \(D_1\) and \(D_2\) have a nonempty intersection, as this amounts to find an occurrence of P starting from one of the \(\mathtt {b}\)-nodes in G corresponding to a state in S.
6 Discussion
The lower bounds that we presented for directed deterministic graphs are tight with regard to the structure of the graph, in the sense that lowering the degree or the alphabet size makes the problem solvable in subquadratic time. Lowering the degree from 3 makes the problem fall into the almost-tree category that we dealt with in Section 5.1. Lowering the alphabet size to unary means that the graph can only consist of a set of paths or cycles. If there is a cycle in the graph, the pattern always matches, and otherwise one can easily check in linear time if there is a long enough path for the pattern to match. Similar trivial or esoteric cases occur when considering the same for directed non-deterministic, undirected deterministic, and undirected non-deterministic graphs.
Our reductions create sparse graphs\(G=(V,E)\) with \(|E|=O(|V|)\), and hence the results are covering also the difficulty of finding \(O(|V|^{1-\epsilon } \, m)\) or \(O(|V| \, m^{1-\epsilon })\) time algorithms for SMLG. This difficulty carries over to non-deterministic subdense graphs with \(|E|=O(|V|^{2-\epsilon })\) and alphabet size at least 3: given a sparse graph \(G^{\prime }=(V,E^{\prime })\) and pattern P of length m from binary alphabet, convert \(G^{\prime }\) into a subdense graph \(G=(V,E)\) adding \(|E|\) spurious arcs labeled with a third symbol. In other words, unless the OV hypothesis fails, there is no \(O(|E|+|V|^{1-\epsilon } \, m +|E|^{\frac{1}{2}}\, m)\) time algorithm for SMLG on subdense graphs G for \(m=O(|V|)\). However, for dense graphs with \(|E|=\omega (|V|^{2-\epsilon })\), there is room to improve the bounds.
Open Problem 1.
Is there an \(O(|E|+|V|\, m+ |E|^{\frac{1}{2}}\, m)\) time algorithm for SMLG on dense graphs?
Other natural directions to continue the study include the tradeoff between indexing and query time on string matching for graphs, as well as a closer examination of other possible string-alike graph classes than those already covered here.
For the former, a slight modification of the proof of Theorem 1.1 results in conditional hardness of finding \(O(|E|^\alpha m^\beta)\) time algorithms for SMLG for any \(\alpha ,\beta \gt 0\) with \(\alpha +\beta \lt 2\). This observation can then be exploited in a self-reduction [24], showing that one cannot achieve subquadratic search times using polynomial time for indexing (under the OV hypothesis).
For the latter, one possible direction is to consider degenerate generalized strings [4]: a sequence \(S=S_1, S_2, \ldots , S_n\) is a degenerate generalized string if set \(S_i\) consist of strings of fixed-length \(n_i\) for all i. When interpreted as an automaton, the language of S is the Cartesian product of its sets. It was recently shown that language intersection emptiness on two degenerate generalized strings can be decided in linear time in the total size of the sets [4]. However, if the requirement of equal length strings is relaxed, the complexity of string matching on such elastic degenerate strings has shown to have tight connection with fast matrix multiplication [12]. Naturally, our reductions do not cover graphs representing degenerate generalized strings. They also do not cover the elastic case, but another relaxation of degenerate generalized strings: consider that the Cartesian product taking all combinations of consecutive sets is replaced by an arbitrary selection of subsets of combinations of consecutive sets. A characteristic feature of graphs resulting from this relaxation is that all paths from one node to another are of the same length. This is also a feature of our reduction graphs. Hence, other features need to be identified to close this gap between linear time solvability and conditional quadratic time hardness; interestingly, conditional hardness of indexing elastic degenerate strings has been established without a direct link to the complexity of the online version [29].
After our last submission of this work for review, many new research directions have emerged around the topic. Some of these are already covered in a survey [42]. In the following, we briefly discuss some recent directions.
The conditional lower bounds have been strengthened to consider how many logarithmic factors can be shaved off from the quadratic complexity [30]. The conclusion is that if the denominator of the time complexity feature is a \(O(\log ^c m)\) or \(O(\log ^c |E|)\) term, the exponent c is bounded by a constant. New graphs properties have been identified that make them amenable to indexing: graphs that can be partially sorted [18], graphs parameterized by the maximum width of their co-lexicographic relation [17], and graphs induced from suitable segmentation of multiple sequence alignments [25] admit efficient indexing schemes. The latter work adapts a reduction technique from this work to show that an arbitrary segmentation of a multiple sequence alignment does not break the conditional lower bound, but one needs a stronger property. Further complexity results have also been derived for online exact and approximate matching on different graph classes [14, 20, 32]. Finally, SMLG has been studied also under the model of computation of quantum computing [19], achieving a subquadratic solution for non-sparse graphs.
Acknowledgments
We would like to acknowledge the contribution of Alessio Conte, Luca Versari and Bastien Cazaux in useful and inspirational conversations. Also, we would like to thank an anonymous reviewer of a previous version of this article for pointing out the open problem on dense graphs.
Even if the work of Backurs and Indyk [9] represents the closest connection with our results, a folklore proof by Russell Impagliazzo about the hardness of the NFA acceptance problem was also known. We would like to thank Karl Bringmann for bringing such proof to our attention.5
Footnotes
1
Note that we can also define the node labels as nonempty strings, but it suffices to use single symbols to show that string matching in graphs is challenging.
2
Note that \(\mathtt {1}\) is a symbol of \(\Sigma\), whereas 1 is the truth value in \(x_i\).
3
An \(\mathtt {e}\)-node can have two \(\mathtt {b}\)-nodes as out-neighbors when linking \(G_{U1}\) to \(G_W\) (see [23]).
4
An alternative strategy is to use only one pattern \(P^{\prime \prime }\) instead of two, defined as
The “dummy” subpatterns \(\bar{P}\) encode a \(\mathtt {1}\) in every position and guarantee that we always have an odd number of subpatterns in \(P^{\prime \prime }\). Moreover, every actual subpattern \(P^{\prime }_{x_i}\) has a chance to be matched in \(LG_W^{(j)}\), for some j, since every such subpattern occurs in an even position.
Amir Abboud, Arturs Backurs, and Virginia Vassilevska Williams. 2015. Tight hardness results for LCS and other sequence similarity measures. In Proceedings of the IEEE 56th Annual Symposium on Foundations of Computer Science (FOCS’15). IEEE, Los Alamitos, CA, 59–78.
Tatsuya Akutsu. 1993. A linear time pattern matching algorithm between a string and a tree. In Combinatorial Pattern Matching. Lecture Notes in Computer Science, Vol. 684. Springer, 1–10.
Jarno Alanko, Giovanna D’Agostino, Alberto Policriti, and Nicola Prezza. 2020. Regular languages meet prefix sorting. In Proceedings of the 2020 ACM-SIAM Symposium on Discrete Algorithms (SODA’20). 911–930.
Mai Alzamel, Lorraine A. K. Ayad, Giulia Bernardini, Roberto Grossi, Costas S. Iliopoulos, Nadia Pisanti, Solon P. Pissis, and Giovanna Rosone. 2018. Degenerate string comparison and applications. In Proceedings of the 18th International Workshop on Algorithms in Bioinformatics (WABI’18). Leibniz International Proceedings in Informatics, Vol. 113. Schloss Dagstuhl–Leibniz-Zentrum fuer Informatik, Dagstuhl, Germany, Article 21, 14 pages.
Amihood Amir, Moshe Lewenstein, and Noa Lewenstein. 1997. Pattern matching in hypertext. In Algorithms and Data Structures. Lecture Notes in Computer Science, Vol. 1272. Springer, 160–173.
Arturs Backurs and Piotr Indyk. 2015. Edit distance cannot be computed in strongly subquadratic time (unless SETH is false). In Proceedings of the 47th Annual ACM Symposium on Theory of Computing (STOC’15). ACM, New York, NY, 51–58.
Arturs Backurs and Piotr Indyk. 2016. Which regular expression patterns are hard to match? In Proceedings of the IEEE 57th Annual Symposium on Foundations of Computer Science (FOCS’16). IEEE, Los Alamitos, CA, 457–466.
Arturs Backurs and Piotr Indyk. 2018. Edit distance cannot be computed in strongly subquadratic time (unless SETH is false). SIAM J. Comput. 47, 3 (2018), 1087–1097.
Arturs Backurs and Christos Tzamos. 2017. Improving Viterbi is hard: Better runtimes imply faster clique algorithms. In Proceedings of the 34th International Conference on Maching Learning (ICML’17). Proceedings of Machine Learning Research, Vol. 70., 311–321. http://proceedings.mlr.press/v70/backurs17a.html.
Giulia Bernardini, Pawel Gawrychowski, Nadia Pisanti, Solon P. Pissis, and Giovanna Rosone. 2019. Even faster elastic-degenerate string matching via fast matrix multiplication. In Proceedings of the 46th International Colloquium on Automata, Languages, and Programming (ICALP’19). Leibniz International Proceedings in Informatics, Vol. 132. Schloss Dagstuhl–Leibniz-Zentrum für Informatik, Dagstuhl, Germany, Article 21, 15 pages.
Karl Bringmann and Marvin Künnemann. 2015. Quadratic conditional lower bounds for string problems and dynamic time warping. In Proceedings of the IEEE 56th Annual Symposium on Foundations of Computer Science (FOCS’15). IEEE, Los Alamitos, CA, 79–97.
Alessio Conte, Gaspare Ferraro, Roberto Grossi, Andrea Marino, Kunihiko Sadakane, and Takeaki Uno. 2018. Node similarity with q-grams for real-world labeled networks. In Proceedings of the 24th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (KDD’18). ACM, New York, NY, 1282–1291.
Nicola Cotumaccio. 2022. Graphs can be succinctly indexed for pattern matching in \(O(|E|^{2} + |V|^{5/2})\) time. In Proceedings of the Data Compression Conference (DCC’22). IEEE, Los Alamitos, CA, 272–281.
Nicola Cotumaccio and Nicola Prezza. 2021. On indexing and compressing finite automata. In Proceedings of the 2021 ACM-SIAM Symposium on Discrete Algorithms (SODA’21). 2585–2599.
Parisa Darbari, Daniel Gibney, and Sharma V. Thankachan. 2022. Quantum time complexity and algorithms for pattern matching on labeled graphs. In String Processing and Information Retrieval. Lecture Notes in Computer Science, Vol. 13617. Springer, 303–314.
Riccardo Dondi, Giancarlo Mauri, and Italo Zoppis. 2022. On the complexity of approximately matching a string to a directed graph. Information and Computation 288 (2022), 104748.
Massimo Equi, Roberto Grossi, and Veli Mäkinen. 2019. On the complexity of exact pattern matching in graphs: Binary strings and bounded degree. arXiv e-prints, arXiv:1901.05264 [cs.CC] (2019).
Massimo Equi, Roberto Grossi, Veli Mäkinen, and Alexandru I. Tomescu. 2019. On the complexity of string matching for graphs. In Proceedings of the 46th International Colloquium on Automata, Langages, and Programming (ICALP’19). Leibniz International Proceedings in Informatics, Vol. 132. Schloss Dagstuhl–Leibniz-Zentrum für Informatik, Dagstuhl, Germany, Article 55, 15 pages.
Massimo Equi, Roberto Grossi, Alexandru I. Tomescu, and Veli Mäkinen. 2019. On the complexity of exact pattern matching in graphs: Determinism and zig-zag matching. arXiv e-prints, arXiv:1902.03560 [cs.CC] (2019).
Massimo Equi, Veli Mäkinen, and Alexandru I. Tomescu. 2021. Graphs cannot be indexed in polynomial time for sub-quadratic time string matching, unless SETH fails. In SOFSEM 2021: Theory and Practice of Computer Science. Lecture Notes in Computer Science, Vol. 12607. Springer, 608–622.
Massimo Equi, Tuukka Norri, Jarno Alanko, Bastien Cazaux, Alexandru I. Tomescu, and Veli Mäkinen. 2022. Algorithms and complexity on indexing founder graphs. Algorithmica. Published online, July 28, 2022.
Nadime Francis, Alastair Green, Paolo Guagliardo, Leonid Libkin, Tobias Lindaaker, Victor Marsault, Stefan Plantikow, Mats Rydberg, Petra Selmer, and Andrés Taylor. 2018. Cypher: An evolving query language for property graphs. In Proceedings of the 2018 International Conference on Management of Data (SIGMOD’18). 1433–1445.
Travis Gagie, Giovanni Manzini, and Jouni Sirén. 2017. Wheeler graphs: A framework for BWT-based data structures. Theor. Comput. Sci. 698 (2017), 67–78.
Garrison Erik, Jouni Sirén, Adam M. Novak, Glenn Hickey, Jordan M. Eizenga, Eric T. Dawson, William Jones, et al. 2018. Variation graph toolkit improves read mapping by representing genetic variation in the reference. Nat. Biotechnol. 36 (Aug. 2018), 875.
Daniel Gibney. 2020. An efficient elastic-degenerate text index? Not likely. In String Processing and Information Retrieval. Lecture Notes in Computer Science, Vol. 12303. Springer, 76–88.
Daniel Gibney, Gary Hoppenworth, and Sharma V. Thankachan. 2021. Simple reductions from formula-SAT to pattern matching on labeled graphs and subtree isomorphism. In Proceedings of the 4th Symposium on Simplicity in Algorithms (SOSA’21). 232–242.
Daniel Gibney and Sharma V. Thankachan. 2019. On the hardness and inapproximability of recognizing wheeler graphs. In Proceedings of the 27th Annual European Symposium on Algorithms (ESA’19). Leibniz International Proceedings in Informatics, Vol. 144. Schloss Dagstuhl–Leibniz-Zentrum für Informatik, Dagstuhl, Germany, Article 51, 16 pages.
Daniel Gibney, Sharma V. Thankachan, and Srinivas Aluru. 2022. On the hardness of sequence alignment on de Bruijn graphs. J. Comput. Biol. 29, 12 (2022), 1377–1396.
Shohei Hido and Hisashi Kashima. 2009. A linear-time graph kernel. In Proceedings of the 9th IEEE International Conference on Data Mining (ICDM’09). IEEE, Los Alamitos, CA, 179–188.
Chirag Jain, Haowen Zhang, Yu Gao, and Srinivas Aluru. 2019. On the complexity of sequence to graph alignment. In Research in Computational Molecular Biology, Lenore J. Cowen (Ed.). Springer International Publishing, Cham, Switzerland, 85–100.
Udi Manber and Sun Wu. 1992. Approximate string matching with arbitrary costs for text and hypertext. In Advances in Structural and Syntactic Pattern Recognition. World Scientific, 22–33.
Kunsoo Park and Dong Kyue Kim. 1995. String matching in hypertext. In Combinatorial Pattern Matching. Lecture Notes in Computer Science, Vol. 937. Springer, 318–329.
Eric Prud’hommeaux and Andy Seaborne. 2008. SPARQL Query Language for RDF. World Wide Web Consortium Recommendation REC-rdf-sparql-query-20080115. W3C.
Marko A. Rodriguez. 2015. The Gremlin graph traversal machine and language (invited talk). In Proceedings of the 15th Symposium on Database Programming Languages. 1–10.
Korbinian Schneeberger, Jörg Hagmann, Stephan Ossowski, Norman Warthmann, Sandra Gesing, Oliver Kohlbacher, and Detlef Weigel. 2009. Simultaneous alignment of short reads against multiple genomes. Genome Biol. 10 (2009), R98.
Chuan Shi, Yitong Li, Jiawei Zhang, Yizhou Sun, and Philip S. Yu. 2017. A survey of heterogeneous information network analysis. IEEE Trans. Knowl. Data Eng. 29, 1 (2017), 17–37.
Jouni Sirén, Niko Välimäki, and Veli Mäkinen. 2014. Indexing graphs for path queries with applications in genome research. IEEE/ACM Trans. Comput. Biol. Bioinform. 11, 2 (March2014), 375–388.
David CFrancis NMarsault V(2024)Distinct Shortest Walk Enumeration for RPQsProceedings of the ACM on Management of Data10.1145/36516012:2(1-22)Online publication date: 14-May-2024
SOFSEM 2021: Theory and Practice of Computer Science
Abstract
The string matching problem on a node-labeled graph asks whether a given pattern string P has an occurrence in G, in the form of a path whose concatenation of node labels equals P. This is a basic primitive in various problems in ...
The problem of matching a query string to a directed graph, whose vertices are labeled by strings, has application in different fields, from data mining to computational biology. Several variants of the problem have been considered, depending on ...
The string matching problem on a node-labeled graph G = ( V , E ) asks whether a given pattern string P equals the concatenation of node labels of some path in G. This is a basic primitive in various problems in bioinformatics, graph ...
David CFrancis NMarsault V(2024)Distinct Shortest Walk Enumeration for RPQsProceedings of the ACM on Management of Data10.1145/36516012:2(1-22)Online publication date: 14-May-2024
Liu XKong JLuo DXiong NXu GChen X(2023)An Intelligent Semi-Honest System for Secret Matching against Malicious AdversariesElectronics10.3390/electronics1212261712:12(2617)Online publication date: 10-Jun-2023
Equi MMäkinen VTomescu A(2023)Graphs cannot be indexed in polynomial time for sub-quadratic time string matching, unless SETH failsTheoretical Computer Science10.1016/j.tcs.2023.114128975:COnline publication date: 9-Oct-2023