In this article, we present a derivative-based, functional recognizer and parser generator for visibly pushdown grammars. The generated parser accepts ambiguous grammars and produces a parse forest containing all valid parse trees for an input string in linear time. Each parse tree in the forest can then be extracted also in linear time. Besides the parser generator, to allow more flexible forms of the visibly pushdown grammars, we also present a translator that converts a tagged CFG to a visibly pushdown grammar in a sound way, and the parse trees of the tagged CFG are further produced by running the semantic actions embedded in the parse trees of the translated visibly pushdown grammar. The performance of the parser is compared with popular parsing tools, including ANTLR, GNU Bison, and other popular hand-crafted parsers. The correctness and the time complexity of the core parsing algorithm are formally verified in the proof assistant Coq.
1 Introduction
Parsing is a fundamental component in computer systems. Modern parsers used in high-performance settings such as web browsers and network routers need to be efficient, as their performance is critical to the performance of the whole system. Furthermore, high-assurance parsers are becoming increasingly more important for security in settings such as web applications, where their parsers are directly processing potentially adversarial inputs from the network. In these settings, formally verified parsers are highly desirable.
Most parsing libraries are based on Context-Free Grammars (CFGs) or their variants. Although very flexible, CFGs have limitations in terms of efficiency. Not all CFGs can be converted to deterministic pushdown automata (PDA); the inherent nondeterminism in some CFGs causes the worst-case running time of general CFG-based parsing algorithms to be \(O(n^3)\).
To achieve efficient parsing, many parsing frameworks place restrictions on what CFGs can be accepted, at the expense of placing the burden on users to refactor their grammars to satisfy those restrictions. Please see the related-work section for discussion about common kinds of restrictions, leading to parsing frameworks such as LL(k), LR(k) [11], and so on.
This article explores an alternative angle of building parsers based on Visibly Pushdown Grammars (VPGs) [2]. In VPGs, users explicitly partition all terminals into three kinds: plain, call, and return symbols. This partitioning makes the conversion of a VPG to a deterministic PDA always possible, which provides the foundation for efficient algorithms. Compared to requiring users to refactor their grammars to resolve sophisticated conflicts during parser generation such as shift-reduce conflicts in LR(k), asking users to specify what terminals are call and return symbols is less of a burden.
VPGs have been used in program analysis, XML processing, and other applications [1], but their potential in parsing has not been fully exploited. In this article, we show that VPGs bring many benefits in parsing. First, we show an efficient, linear-time parsing algorithm for VPGs. Second, our algorithm is amenable to formal verification. Overall, this article makes the following contributions:
•
We present a derivative-based algorithm for VPG recognition and parsing. The generated parser accepts ambiguous grammars and produces a parse forest for the input string, where each parse tree in the forest can be extracted in linear time.
•
We mechanize the correctness proofs of the parsing algorithm in Coq. We have also formalized and proved the linear-time performance guarantee of the parsing algorithm in Coq.
•
We present a surface grammar called tagged CFGs to allow a more convenient use of our parsing framework. Users can use their familiar CFGs for developing grammars and provide additional tagging information on terminals. A sound translator then converts a tagged CFG to a VPG. We note that this validator is conservative and places some restriction on acceptable tagged CFGs; in particular, left recursion is not allowed and users are required to perform standard refactoring to remove left recursion.
•
During the performance evaluation of VPG-based parsers on popular formats including JSON, XML, and HTML, we discover that their performance is comparable with popular parsing generators such as ANTLR and other hand-crafted parsers. In addition, we discover that VPG-based parsers often require a special lexer that groups multiple tokens into one single call/return symbol.
The remainder of this article is organized as follows: We first introduce VPGs in Section 2 and discuss related work in Section 3. In Section 4, we provide an overview of our parsing library and explore various usage scenarios. Section 5 presents a derivative-based VPG recognizer, which sheds light on the parsing algorithm discussed in Section 6. The translator and tagged CFGs are discussed in Section 7. We then evaluate the VPG parser in Section 8.
A conference version of this article was published in the ACM SIGPLAN Conference on Object-Oriented Programming, Systems, Languages, and Applications (OOPSLA) [22]. We describe the differences between the conference version and this article as follows: First, we have redesigned the VPG parsing algorithm, including changing how the parser PDA is generated and replacing the pruner in the conference version with a set of parse-tree extraction functions. The new algorithm enables a simpler correctness proof and, more importantly, enables us to verify the performance guarantee: The parser is verified to have time complexity of \(O(kn)\), where k is the number of parse trees and n is the length of the input string. Second, we provide a new method that optimizes the parse-tree extraction functions to a more efficient extraction PDA. Third, we rerun all experiments to evaluate our new parsing algorithm. Finally, we have revised our translation algorithm from tagged CFGs to VPGs and added the proofs of its soundness and termination.
2 Background: VPG
As a class of grammars, VPGs [2] have been used in program analysis, XML processing, and other fields. Compared with CFGs, VPGs enjoy many good properties. Since languages of VPGs are a subset of deterministic context-free languages, it is always possible to build a deterministic PDA from a VPG. The terminals in a VPG are partitioned into three kinds and the stack action associated with an input symbol is fully determined by the kind of the symbol: An action of pushing to the stack is always performed for a call symbol, an action of popping from the stack is always performed for a return symbol, and no stack action is performed for a plain symbol. Furthermore, languages of VPGs enjoy appealing theoretical closure properties; e.g., the set of visibly pushdown languages is closed under intersection, concatenation, and complement [2]. As will be shown in this article, VPGs also enable the building of linear-time parsers, and VPG parsers are amenable to formal verification. The expressive power of VPG is between regular grammars and CFGs and is sufficient for describing the syntax of many practical languages, such as JSON, XML, and HTML, with appropriated defined call/return symbols.1
We next give a formal account of VPGs. A grammar G is represented as a tuple \((V, \Sigma , P, L_0)\), where V is the set of nonterminals, \(\Sigma\) is the set of terminals, P is the set of production rules, and \(L_0\in V\) is the start symbol. The alphabet \(\Sigma\) is partitioned into three sets: \(\Sigma _{\text{plain}}\), \(\Sigma _{\text{call}}\), \(\Sigma _{\text{ret}}\), which contain plain, call, and return symbols, respectively. Notation-wise, a terminal in \(\Sigma _{\text{call}}\) is tagged with on the left, and a terminal in \(\Sigma _{\text{ret}}\) is tagged with on the right. For example, \({‹ {a}}\) is a call symbol in \(\Sigma _{\text{call}}\), and \({b› }\) is a return symbol in \(\Sigma _{\text{ret}}\).
We first formally define well-matched VPGs. Intuitively, a well-matched VPG generates only well-matched strings in which a call symbol is always matched with a return symbol in a derived string.
Note that in \(L\rightarrow c L_1\) terminal c must be a plain symbol, and in \(L\rightarrow {‹ {a}}L_1 {b› }L_2\) a call symbol must be matched with a return symbol; these requirements ensure that any derived string must be well-matched.
The following is an example of a well-matched VPG, which is refactored from a grammar for XML:
In this example, nonterminals start with a lowercase character, such as “element,” and terminals start with an uppercase character, such as “OpenTag.” The special nonterminal “Empty” has a single rule that produces the empty string. The grammar shows a typical usage of VPGs to model a hierarchically nested matching structure of XML texts: “OpenTag” is matched with “CloseTag,” and “content” nested in between can be “element” itself (not shown in the above snippet) and forms an inner hierarchy.
In the main text of this article, we consider only well-matched VPGs and use the term VPGs for well-matched VPGs. We also call rules in the form of \(L\rightarrow {‹ {a}}L_1 {b› } L_2\)matching rules. Our Coq library actually supports a broader class of VPGs, which we call general VPGs; algorithms for general VPGs are put into the appendices. Intuitively, general VPGs can specify substrings of well-matched strings; e.g., \(\langle /p \rangle \langle p \rangle \langle /p \rangle \langle p \rangle\) is a substring of a well-matched string. This ability enables users to parse streaming data such as network traffic. As an extension to well-matched VPGs, general VPGs allow the case of pending calls and returns, which means that a call/return symbol may not have its corresponding return/call symbol in the input string. To accommodate pending symbols, general VPGs in addition allow rules in the forms of \(L\rightarrow {‹ {a}}L^{\prime }\) and \(L\rightarrow {b› }L^{\prime }\), which we call pending rules. Call/return symbols in pending rules are called pending call/return symbols. Further, the set of nonterminals V is partitioned into \(V^0\) and \(V^1\): nonterminals in \(V^0\) only generate well-matched strings, while nonterminals in \(V^1\) can generate strings with pending symbols.
The above definition imposes constraints on how \(V^0\) and \(V^1\) nonterminals can be used in a rule. For example, in \(L\rightarrow {‹ {a}}L_1 {b› }L_2\), nonterminal \(L_1\) must be a well-matched nonterminal; so P cannot include rules such as \(L_1 \rightarrow {‹ {a}}L_3\), since \(L_1\) is supposed to generate only well-matched strings.
The above constraints guarantee that we can identify paired call and return symbols simply by their locations. For example, consider the following string from a general VPG:
Even without knowing the rules, we can determine the pairing as follows: We begin with an empty stack T and read symbols in order, ignoring the plain symbols. When we read a call symbol, we push it to T; when we read a return symbol, we match it with the top of T, and if T is empty, then the return symbol is pending. After we read all symbols, call symbols remained in T are also pending. For the above string, the stack T has the following transitions:
Therefore, the first call symbol \({‹ {a}}\) is pending, and the second call symbol \({‹ {a}}\) is paired with the return symbol \({b› }\). Note that, in general VPGs, paired \({‹ {a}}\) and \({b› }\) are not necessarily generated from the same matching rule; they could also be derived from pending rules.
The notion of a derivation in VPGs is the same as the one in CFGs. We write \(w \rightarrow w^{\prime }\) to mean a single derivation step according to a grammar, where w and \(w^{\prime }\) are strings of terminals or nonterminals. We write \({L} \rightarrow ^* {w}\) to mean that w can be derived from L via a sequence of derivation steps.
3 Related Work
Most parser libraries rely on the formalism of Context-Free Grammars (CFGs) and user-defined semantic actions for generating parse trees. Many CFG-based parsing algorithms have been proposed in the past, including LL(k), LR(k) [11], Earley [14], CYK [7, 24, 44], among many others. LL(k) and LR(k) algorithms are commonly used, but their input grammars must be unambiguous. Users often have to change/refactor their grammars to avoid conflicts in LL(k) and LR(k) parsing tables, a nontrivial task. Earley, CYK, and GLR parsing can handle any CFG, but their worst-case running time is \(O(n^3)\). In contrast, our VPG parsing accepts ambiguous grammars and is linear time.
Our VPG parsing algorithm relies on derivatives. One major benefit of working with derivatives is that it is amenable to formal verification, as proofs related to derivatives involve algebraic transformations on symbolic expressions; they are easier to develop in proof assistants than it is to reason about graphs (required when formalizing LL and LR algorithms). Brzozowski [6] first presented the concept of derivatives and used it to build a recognizer for regular expressions. The idea was revived by Owens et al. [34] and generalized to generate parsers for CFGs [30], with an exponential worst-case time complexity. More recently, a symbolic approach [21] for parsing CFGs with derivatives was presented with cubic time complexity. Derivative-based CFG parsing is further improved with generalized zippers [9]. Danielsson [8] described a monadic parser combinator library, which used derivatives to run parsers; the library had a machine-checked proof in Agda [33], a dependently typed functional programming language. Finally, Edelmann et al. [15] presented a formally verified, derivative-based, linear-time parsing algorithm for LL(1) context-free expressions.
Owl is an open-source project [20] that provides a parser generator for VPGs. It has the same goal as our work, but differs in the following critical aspects: (1) Owl supports only well-matched VPGs, while our parsing library supports general VPGs; (2) Owl adopts a different algorithm and is implemented in an imperative way, while our parsing library is derivative-based and functional; (3) Owl does not provide formal assurance, while our parsing library is formally verified in Coq; (4) Owl rejects ambiguous VPGs, while our parsing library accepts ambiguous grammars and generates parse forests; and (5) Owl does not support semantic actions embedded in grammars, while our parsing library accepts semantic actions.
Due to parsers’ importance to security, many efforts have been made to build secure and correct parsers. One obvious approach is testing, through fuzz testing or differential testing (e.g., Petsios et al. [37]). However, testing generally cannot show the absence of bugs. Formal verification has also been applied to the building of high-assurance parsers. Jourdan et al. [23] applied the methodology of translation validation and implemented a verified parser validator for LR(1) grammars. RockSalt [31] included a verified parser for regular expression-based DSL. Barthwal and Norrish [4] presented a verified SLR parser. Lasser et al. [26] and Edelmann et al. [15] presented verified LL(1) parsers but we are not aware of fully verified LL(k) parsers. Koprowski and Binsztok [25] implemented a formally verified parser generator for Parsing Expression Grammars (PEG). Ramananandro et al. [38] presented a verified parser generator for tag-length-value binary message format descriptions. There are also verified parsers for general CFGs [5, 17, 39]. These parsers are limited in their performance on real-world grammars, with the worst-case complexity of \(O(n^3)\). The reason is that they have to handle highly ambiguous CFGs, whose nondeterminism hinders the parsers’ performance. Recently, Lasser et al. [27] implemented a verified ALL(*) parser, which is the algorithm behind ANTLR4; however, it did not provide a linear-time worst-case guarantee. We formalize our derivative-based, VPG parsing algorithm and its correctness and linear-time performance proofs in Coq.
4 VPG Frontend Overview
The focus of this article is on VPG-based parsing. However, since users are already familiar with CFGs, our system takes as input a CFG-based surface grammar, which is called a tagged CFG. In this section, we will give a brief overview of tagged CFGs and some application scenarios; we also discuss tagged CFGs’ expressiveness and limitations. In Section 7, we will discuss tagged CFGs in detail and how to convert a tagged CFG to a VPG if the tagged CFG passes a validation process. Once a VPG is obtained, we can use the algorithm described in Section 6 to produce a parser.
Tagged CFG overview. Roughly speaking, a tagged CFG is a CFG enhanced with tagging information about what terminals are call and return symbols; in addition, we allow regular operators for the convenience of specification. As examples, the two subfigures in Figure 1 present tagged CFGs for JSON and XML documents, respectively; more examples of tagged CFGs can be found in Appendix F.
Fig. 1.
By placing a left bracket < before a terminal, the user can mark it as a call symbol; similarly, a right bracket > after a terminal marks it as a return symbol. We mentioned in Section 2 that the locations of call and return symbols in a VPG determine how they are matched. Similarly for a tagged CFG, matched call and return symbols should appear in the same rule.
Expressiveness and limitations. In practice, we can make tagged CFGs more expressive by utilizing a lexer with an extended notion of call and return symbols. Take XML as an example. If individual characters were considered as terminals, then the XML grammar could not be expressed as a tagged CFG; this is because an XML tag is complex and includes components such as the tag name and the tag attributes. However, if a whole XML tag is treated as a token in an extended lexer, then the XML grammar can be expressed as a tagged CFG and is convertible to a VPG, as our evaluation shows. For instance, the call symbol for XML is the open tag, which comprises the tag name and also possible attributes.
Since the syntax of such tags can be specified by regular expressions, we can treat it as a single token and extend the lexer to store semantic values such as the tag name p and the attributes id = ”a”, which can be performed efficiently [13].
As will be discussed in Section 7, the primary limitation of a tagged CFG is that it must pass validation so we can convert it to a VPG. Approximately, it means that each dependency loop in the grammar must be in at least one pair of matched call and return symbols. A dependency loop is there when a nonterminal L can be used to generate itself via a set of rules; this notion will be discussed formally in Definition 7.3. For example, the following grammar includes a dependency loop, as the right-hand side of the first rule uses L itself:
\begin{equation*} L\rightarrow a L L b \mid c. \end{equation*}
The previous grammar would not pass the validation; however, the user can use the following tags to make it a valid tagged CFG, by ensuring that the dependency stays within \({‹ {a}}\) and \({b› }\):
In this case, we describe the dependency loop as well founded. For the JSON grammar in Figure 1, the call-return pairs (’[’, ’]’) and (’{’, ’}’) ensure that all dependency loops are well founded.
Because of the restriction of dependency loops, left recursion is not supported by tagged CFGs. However, in practice, we can easily remove direct left recursion by performing grammar refactoring [36].
5 VPG-based Recognition
Before presenting a parsing algorithm for VPGs, we first present an algorithm for converting a well-matched VPG into a deterministic PDA for recognition using a derivative-based algorithm. The resulting PDA recognizes the same set of strings as the input VPG.
To compute a recognizer, we utilize the notion of derivatives [6, 30, 34]. The derivative of a language \(\mathcal {L}\) with respect to an input symbol c is the residual set of strings of those in \(\mathcal {L}\) that start with c:
Before giving formal derivative definitions, we first review derivatives for regular grammars [6, 34] in Example 5.1. We then extend them to VPGs in Example 5.2, where we also discuss informally the intuition of the states, the stack, and the transition function of the PDA that is converted from an input VPG. Throughout this section and Section 6, we use the VPG G shown in Figure 2(c), where the start nonterminal is L, and the right-regular grammar \(G^{\prime }\), which is based on the VPG G by deleting the matching rules.
Fig. 2.
Definition 5.2 (Derivative Functions).
(1)
\(\delta _c(S) = (S^{\prime },\lambda T.T)\), where
For a plain symbol \(c\in \Sigma _{\text{plain}}\), it checks each pair \((L_1,L_2)\) in the current state S, and if there is a rule \(L_2\rightarrow c\ L_3\), then the pair \((L_1,L_3)\) becomes part of the new state. In addition, the stack is left unchanged.
(2)
\(\delta _{{‹ {a}}}(S) = (S^{\prime },\lambda T.[S,{‹ {a}}]\cdot T)\), where
For a call symbol \({‹ {a}}\in \Sigma _{\text{call}}\), it checks each pair \((L_1,L_2)\) in the current state S; if there is a rule \(L_2 \rightarrow {‹ {a}}L_3 {b› } L_4\), then the pair \((L_3,L_3)\) becomes part of the new state; note there is a context change, since a call symbol is encountered. In addition, the old state together with \({‹ {a}}\) is pushed to the stack.
(3)
\(\delta _{ {b› }}(S,[S_1,{‹ {a}}]) = (S^{\prime },\text{tail})\), where \(\text{tail}\) is a function that returns the tail of the stack, and
For a return symbol \({b› }\in \Sigma _{\text{ret}}\) and a stack top \([S_1,{‹ {a}}]\), it checks each pair \((L_1,L_2)\) in the state \(S_1\) of the stack top symbol; if there is a pair \((L_3,L_4)\) in the current state S, \(L_4\) can derive the empty string, and there is a rule \(L_2\rightarrow {‹ {a}}L_3 {b› }\ L_5\), then the pair \((L_1,L_5)\) becomes part of the new state; note that it checks \(L_4\rightarrow \epsilon\) to ensure that the current level is finished before returning to the upper level. In addition, the stack top is popped from the stack. Figure 3 presents a drawing that depicts the situation when a return symbol is encountered.
We formalize the semantics of PDA configurations as sets of accepted strings:
Fig. 3.
Fig. 3. An example of state transitions: Assuming there is a rule \(L_2 \rightarrow {‹ {a}}L_3 {b› }L_5\), \((L_1,L_2)\) transfers to \((L_3,L_3)\) with symbol \({‹ {a}}\), and finally transfers to \((L_1,L_5)\) with symbol \({b› }\).
Definition 5.3 (Semantics of PDA Configurations).
We write \((S, T)\leadsto w\) to mean that a terminal string w can be accepted by the configuration \((S, T)\). It is defined as follows:
(1)
\((S, \bot)\leadsto w \mbox{ if } \exists (L_1,L_2)\in S,\ \text{s.t. }L_2\rightarrow ^{*}w\),
The correctness of derivative functions is stated in the following theorem, whose correctness proof is detailed in Appendix A. Take the case of \(\delta _c\) as an example: The theorem states that \((S,T)\) matches \(cw\) iff the configuration after running \(\delta _c\) matches w (the string after consuming c).
Theorem 5.1 (Derivative Function Correctness).
•
Assume \(\delta _c(S)=(S^{\prime },\lambda T.T)\) for a plain symbol c. Then \((S,T)\leadsto cw\) iff \((S^{\prime },T) \leadsto w\).
•
Assume \(\delta _{{‹ {a}}}(S)=(S^{\prime },\lambda T.[S,{‹ {a}}]\cdot T)\) for a call symbol \({‹ {a}}\). Then \((S,T)\leadsto {‹ {a}}w\) iff \((S^{\prime }, [S,{‹ {a}}]\cdot T)\leadsto w\).
•
Assume \(\delta _{ {b› }}(S, [S_1,{‹ {a}}])=(S^{\prime },\text{tail})\) for a return symbol \({b› }\). Then \((S, [S_1,{‹ {a}}]\cdot T)\leadsto {b› }w\) iff \((S^{\prime },T)\leadsto w\).
With those derivative functions, we can convert a VPA to a PDA, whose set of states is the least solution of the following equation; it makes sure that states are closed under derivatives:
We note that the least solution to the previous equation may include unreachable states, since the last line of the equation considers all \((S, [S^{\prime }, {‹ {a}}])\) without regard for whether such a configuration is possible. This may make the resulting PDA contain more states and occupy more space for the PDA representation than necessary. However, unreachable states do not affect the linear-time parsing guarantee of VPG parsing, as during parsing those unreachable states are not traversed; further, during experiments, we did not experience space issues when representing PDA states and transitions.
Algorithm 1 is an iteration-based method to solve the equation for the least solution, where the returned \(S_0\) is the initial state, A is the set of all states, and \(\mathcal {T}\) is the set of edges between states. For an iteration, N is the set of states that the algorithm should perform derivatives on. Line 7 then performs derivatives using call and plain symbols, and line 10 performs derivatives using return symbols.
At the end of each iteration, the following invariants are maintained: (1) \(N \subseteq A\); (2) for state \(S \in A-N\) and \(i \in \Sigma _{\text{call}}\cup \Sigma _{\text{plain}}\), if \(\delta _i(S)=(S^{\prime },f)\), then \(S^{\prime } \in A\); (3) for states \(S, S^{\prime } \in A-N\), \({‹ {a}}\in \Sigma _{\text{call}}\), and \({b› }\in \Sigma _{\text{ret}}\), if \(\delta _{ {b› }}(S, [S^{\prime },{‹ {a}}])=(S^{\prime \prime },f)\), then \(S^{\prime \prime } \in A\). With these invariants, when N becomes empty, A is closed under derivatives.
Once the PDA is constructed from a VPG, it can be run on an input string in a standard way. For completeness, we include the definition here so we can state the correctness theorem formally. Recall that a runtime configuration of a recognizer PDA is a pair \((S,T)\), where S is a state and T is a stack, denoted as \(T=t_1\cdot t_2\cdots t_k\cdot \bot\), where \(t_i=[S_i, {‹ {a}}_i]\) for \(i=1..k\), \({‹ {a}}_i \in \Sigma _{\text{call}}\) is a call symbol, \(t_1\) the top of T, and \(\bot\) the empty stack. The PDA’s initial configuration is \((\lbrace (L_0,L_0)\rbrace ,\bot)\) and its acceptance configurations are defined as follows:
Definition 5.4 (PDA Acceptance Configurations).
Given a VPG \(G=(V,\Sigma ,P,L_0)\), a pair \((S,T)\) is an acceptance configuration if \(T=\bot\), and \(\exists (L,L^{\prime })\in S\ \text{s.t. }(L^{\prime }\rightarrow \epsilon)\in P\).
Definition 5.5 (Recognizer PDA Execution).
The runtime execution \(\mathcal {F}\) of a PDA \((S_0,A,\mathcal {T})\) is defined as follows, where \(S_0\) is the start state, A is the set of states, and \(\mathcal {T}\) is the set of configuration transitions:
if \(i\in \Sigma _{\text{call}}\cup \Sigma _{\text{plain}}\), then \((S,S^{\prime })\in \mathcal {T}\) and is marked with \((i,f)\), and \(T^{\prime }=f(T)\);
(2)
if \(i\in \Sigma _{\text{ret}}\) and \(T=t\cdot T^{\prime }\), then \((S,S^{\prime })\in \mathcal {T}\) and is marked with \((i,t,f)\), and \(T^{\prime }=f(T)\).
Given an input string \(w = w_1\ldots w_n\), we say the PDA accepts w if there exists a sequence of configurations \((S_0, T_0), \ldots , (S_n, T_n)\), so
Then PDA correctness can be stated as follows. We detail the correctness proof in Appendix A.
Theorem 5.2 (PDA Correctness).
For VPG G and its start symbol \(L_0\), a string \(w\in \Sigma ^*\) can be derived from \(L_0\) (i.e., \(L_0\rightarrow ^{*}w\)) iff w is accepted by the corresponding PDA.
For converting general VPGs (i.e., with pending rules) to PDAs, a couple of changes need to be made to the derivative-based approach: (1) The derivative functions need to consider also pending rules; (2) the acceptance stack may be nonempty because of pending call symbols. The construction is discussed in Appendix B.
The time complexity of Algorithm 1. Algorithm 1 terminates when no new state can be produced. Since a PDA state contains pairs of nonterminals, the number of states produced at the end \(|A|\) is bounded by \(O(2^{|V|^2})\). For each produced state, the derivative functions are called at line 6 and line 9. The number of times they are called across all iterations is no more than \(|A|\times |\Sigma _{\text{call}}\cup \Sigma _{\text{plain}}|+|A|\times |\Sigma _{\text{ret}}|\times |A| \times |\Sigma _{\text{call}}|\), which is bounded by \(O(|A|^2|\Sigma |^2)\). Among the three derivative functions, the one for return symbols has the largest time complexity. Recall that in the definition of \(\delta _{ {b› }}(S, [S_1, {‹ {a}}])\), we consider each pair \((L_1,L_2)\) in \(S_1\), find a matching rule “\(L_2\rightarrow {‹ {a}} L_3 {b› }L_5\),” and check if \((L_3,L_4)\) is in S for some \(L_4\) that can derive the empty string. Therefore, the cost is bounded by \(|S_1|\times |P|\times |S|\), which is further bounded by \(O(|V|^4|P|)\). At lines 7 and 11, at most \(O(|A|^2)\) edges are added across all iterations. At lines 12 and 13, in one iteration collecting the states takes \(O(|N^{\prime }|+|N_R|)=O(|A|)\), and computing the difference and union of the sets takes \(O(|N|+|A|)\); so the time complexity is \(O(|A|)\). Therefore, the total time complexity is \(O(|A|^2|\Sigma |^2|V|^4|P|) + O(|A|^2)\)=\(O(2^{2|V|^2}|V|^4|\Sigma |^2|P|)\).
6 VPG-based Parsing
Parsing is a process to build the parse trees of a given string. It is equivalent to finding the rule sequences that generate the string. To achieve this, our VPG-based parsing framework includes two steps: We first run a parser PDA on the string, which leaves a trace of PDA states that can be treated as a parse forest; we then execute extraction functions on the forest to extract all valid parse trees. We further propose an optimization, which optimizes the second step to an extraction PDA: Instead of extracting all parse trees, the extraction PDA extracts a single parse tree. In the following discussion, we first describe parse trees in the context of VPGs, then discuss the parser PDA in Section 6.1, the extraction functions in Section 6.2, and the optimization of extraction as an extraction PDA in Section 6.3. We then discuss the correctness of our parsing framework in Section 6.4 and its time complexity in Section 6.5. Both the correctness proof and the time-complexity proof have been mechanized in the proof assistant Coq2; we provide more details about the Coq mechanization in Section 6.6.
Before proceeding, we note that the parsing approach in this section is different from the approach described in the conference version. The conference version [22] presented a parsing method that includes a parser PDA, a pruner PDA, and an extractor. The correctness of parsing was verified, but we were not able to formalize and verify its linear-time performance guarantee when extracting a parse tree; the structure of the PDA states of that method made it hard to reason about the performance (and also correctness). Our new method improves the parsing function and the parser PDA so we can build a verified extraction function and verify the performance: For an input string w that has k valid parse trees, we verify that the time complexity of parsing with the new approach is \(O(k|w|)\). Note that, when only one of the parse trees is needed, later in this article, we will describe an extractor PDA that extracts just one parse tree and prove that it takes linear time (\(O(|w|)\)).
Given a grammar with a rule set P, we define a notion of dotted rules. A dotted rule is a rule in P with an extra dot on the right-hand side. Intuitively, the dot tells the parsing position: To the right of the dot there is a nonterminal, which tells what should be parsed next.
Definition 6.1 (Dotted Rules ̇P)
Given a well-matched VPG \(G=(\Sigma ,V,P,L_0)\), we define the set of dotted rules as
Dotted rules are similar to items in LR parsing [11], but dots in our dotted rules appear after only a terminal, while dots in items do not have this restriction.
Parse trees. Given a grammar \(G=(\Sigma ,V,P,L_0)\) and its dotted rules \(\dot{P}\), we define a parse tree v as a rule sequence; i.e., \(v=[r_1, \ldots , r_n]\), where \(r_i\in \dot{P},1\le i\le n\); when \(n=0\), it is an empty parse tree. For two parse trees \(v_1=[r_1, \ldots , r_{n_1}]\) and \(v_2=[r_1^{\prime }, \ldots , r_{n_2}^{\prime }]\), we use the notation \(v_1+v_2\) to represent their concatenation \([r_1, \ldots , r_{n_1},r_{1}^{\prime }, \ldots , r_{n_2}^{\prime }]\). We use \({L} \Downarrow \left({w},{v}\right)\) for the relation that v can derive a string w starting from nonterminal \(L\in V\); its rules are in Figure 4. We call v a valid parse tree of w if \({L_0} \Downarrow \left({w},{v}\right)\). Note that a rule in Figure 4 produces a sequence of dotted rules. Also, the last rule produces a sequence with \(L\rightarrow {‹ {a}}{L_1} {b› }L_2\) appearing twice, but with different positions for the dot: one after the call symbol \({‹ {a}}\) and one after the return symbol \({b› }\); this is because the call and the return symbol are parsed independently, resulting in the need to differentiate how \(L\rightarrow {‹ {a}}{L_1} {b› }L_2\) is used. During parsing, when we have used a dotted rule to parse a call symbol, but have not read the matched return symbol, we call the dotted rule unclosed.
Fig. 4.
Fig. 4. The big-step parse-tree derivation, assuming a well-matched VPG \(G=(\Sigma ,V,P,L_0)\).
We introduce a set of definitions to extract parts of a dotted rule. We define \(\mathrm{headNT}\left(r\right)\) to be the head nonterminal of r and \(\mathrm{nextNT}\left(r\right)\) to be the next nonterminal after the dot in r:
We abuse the notation and for parse tree \(v=[r_1, \ldots , r_n]\) write \(\mathrm{headNT}\left(v\right)\) to be \(\mathrm{headNT}\left(r_1\right)\) and \(\mathrm{nextNT}\left(v\right)\) to be \(\mathrm{nextNT}\left(r_{n}\right)\). We also define \(\operatorname{firstRule}(v)\) to be \(r_1\) and \(\operatorname{lastRule}(v)\) to be \(r_{n}\).
6.1 The Parser PDA
Recall that in the recognizer PDA, a transition between two states for an input symbol is computed based on possible rules that can derive the input symbol. As a result, a state-transition trace that accepts an input string essentially represents the rule sequences that can possibly derive the string. We illustrate this through the following examples, before presenting a formal algorithm for constructing a parser PDA for a VPG:
Example 6.1 (From Recognizer DFAs to Parser DFAs).
We begin with regular grammars. Recall the DFA in Figure 2(b). Instead of taking the state as a set of nonterminals, let us replace each state with a set of rules that can produce the last consumed symbol; we call the new automaton the parser DFA, as shown in Figure 5. The parser DFA has more states than the recognizer DFA. First, we introduce a pseudo-rule \(L\rightarrow .L\) to construct an initial state with that rule. In the pseudo-rule, we use a dot “.” to indicate the parsing position. Second, the recognizer state \(\lbrace D \rbrace\) is split to two states \(\lbrace B\rightarrow d.D \rbrace\) and \(\lbrace A\rightarrow c.D \rbrace\) to capture the parsing rule used precisely when transitioning from \(\lbrace L \rightarrow c.A,\ L \rightarrow c.B\rbrace\).
Fig. 5.
Fig. 5. The parser DFA for the regular grammar in Figure 2(b). The pseudo-rule \(L\rightarrow .L\) appears only in the initial state, representing a rule in which only its next nonterminal (after the dot) matters.
Compared with the recognizer PDA, a trace of states in the parser DFA represents a parse tree. For example, consider the input string \(s=ccc\). Its trace in the parser DFA is as the following:
This trace essentially represents the following parse tree \(v_s\) of the string \(ccc\):
\begin{equation*} v_s = [L\rightarrow c.A; A \rightarrow c.D; D \rightarrow c.L]. \end{equation*}
In fact, we can extract a parse tree from the trace as follows: We start from the last state and take a rule r that derives the empty string \(\epsilon\) for the nonterminal next to the dot and define the initial parse tree \(v_0\) as \([r]\). In this example, we have the initial parse tree being
Then, we expand the parse tree by processing states in the trace backward. In particular, for each state S, we add a rule \(r \in S\) that is connected to the head rule of the current parse tree v, i.e.,
and prepend r to the current parse tree v. For example, for the first step, we add \(A \rightarrow c.D\) to \(v_0\) to have \(v_1\) as follows; similarly, \(v_2\) is built by adding \(L \rightarrow c.A\) to \(v_1\):
\begin{align*} v_1 &= [A \rightarrow c.D; D \rightarrow c.L], \\ v_2 &= [L\rightarrow c.A; A \rightarrow c.D; D \rightarrow c.L]. \end{align*}
Note that \(v_2\) is the final parse tree we desire. In general, there may be multiple parse trees for an input string, and we need to maintain a set of parse trees during the extraction process.
The above process is essentially the same as how a parser PDA is constructed for a VPG and how a parse tree is extracted from a trace of the parser PDA. We discuss an example next.
Example 6.2 (From Recognizer PDAs to Parser PDAs).
Recall the recognizer PDA shown in Figure 2(e); we first discuss how to extend it to the parser PDA shown in Figure 6 and then illustrate it using a concrete example.
Fig. 6.
Fig. 6. The parser PDA for the VPG shown in Figure 2(c). Each state is marked with its index; e.g., the initial state is State 0. The stack operation “Push i” means pushing State i to the top of the stack, and “Pop i” means removing the top of the stack when the top is State i.
Similar to the case of regular grammars, for each pair \((L_1,L_2)\) in a recognizer PDA state, nonterminal \(L_2\) is replaced by the rules that can produce the last consumed symbol. The context \(L_1\), however, is replaced with the most recent unclosed dotted rule; when there is no such a rule, we use a special context called \(\text{None}\). The stack T could still store the parser PDA states paired with call symbols; however, since the call symbols have already been recorded in the rules in the states, the stack needs to store only the PDA states.
Now let us reexamine the VPG shown in Figure 2(c) and the input string \(s=c{‹ {a}}cc {b› }\). After running the parser PDA in Figure 6 on the string s, we have the following trace, where the initial state is omitted:
Similar to the trace in Example 6.1, the preceding trace includes the following parse tree of the string s:
Let us extract the above parse tree from the trace in a way similar to the case for regular grammars. Start from the last state, we pick a rule r that has the context \(\text{None}\) and can derive the empty string \(\epsilon\) from the next nonterminal to parse. We initialize the parse tree as \(v_0=[r]\). In this example, we have
Now, we trace back to the state \(\lbrace (A\rightarrow {‹ {a}}.A {b› }L,D\rightarrow c.L)\rbrace\). Since the last symbol is a return symbol, we begin by identifying rules with the next nonterminal that can produce the empty string. In this case, rule \(D\rightarrow c.L\) is a candidate. In addition, recall that the context rule is the most recent unclosed dotted rule; so the following return symbol should be derived from the context rule. The context rule is \(A\rightarrow {‹ {a}}.A {b› }L\), which corresponds to the current parse tree’s head rule \(A\rightarrow {‹ {a}}A {b› }.L\). Therefore, we can use the rule \(D\rightarrow c.L\) to extend our parse tree.
Then, we continue to the next state in the trace: \(\lbrace (A\rightarrow {‹ {a}}.A {b› }L,A\rightarrow c.D) \rbrace\). We first observe that the rule \(A\rightarrow c.D\) is connected to the head rule \(D\rightarrow c.L\). In addition, since the symbol to match next is a plain symbol, the context rule should not change and should be \(A\rightarrow {‹ {a}}.A {b› }L\). To determine the context of the head rule efficiently, we introduce a stack T to store the current context, as shown below.
Note that the stack \(T_0\) could alternatively store the rule \(A\rightarrow {‹ {a}}.A {b› }L\); the difference is just about the location of the dot. With the stack, we know that the context is the same and prepend rule \(A\rightarrow c.D\) to the parse tree.
We continue to the next state.
To extract a matching rule for the call symbol is fairly simple at this point: We just pick the rule that corresponds to the top of the stack T; in this case it should be \(A\rightarrow {‹ {a}}.A {b› }L\). Our final parse tree is as the following:
Again, in general, there may be multiple parse trees for the input string, and we need to maintain a set of parse trees during the extraction.
Now, we formally define parser PDA states and stacks and the derivative functions for well-matched VPGs as follows (the definitions for general VPGs can be found in Appendix C):
Given a VPG \(G=(\Sigma ,V,P,L_0)\), based on the dotted rules in \(\dot{P}\),
(1)
we define a set of plain rules \(\mathcal {P}_{\text{pln}}\) as \(\lbrace L\rightarrow c.L_1 \in \dot{P}\,\mid \,c \in \Sigma _{\text{plain}}\rbrace\);
(2)
we define a set of call rules \(\mathcal {P}_{\text{call}}\) as \(\lbrace L\rightarrow {‹ {a}}.L_1 \in \dot{P} \rbrace \cup \lbrace L \rightarrow {‹ {a}}. L_1 {b› }L_2 \in \dot{P} \rbrace\);
(3)
we define a set of return rules \(\mathcal {P}_{\text{ret}}\) as \(\lbrace L\rightarrow {b› }.L_1 \in \dot{P} \rbrace \cup \lbrace L \rightarrow {‹ {a}}L_1 {b› }. L_2 \in \dot{P} \rbrace\).
Definition 6.3 (Parser PDA States and Stacks).
Given a VPG, we introduce a PDA whose states are subsets of \((\mathcal {P}_{\text{call}}\cup \lbrace \text{None}\rbrace) \times (\dot{P} \cup \lbrace L_0\rightarrow .L_0\rbrace)\) and whose stack contains PDA states. Intuitively, the stack remembers a series of past contexts, which are used for matching future return symbols. We call a pair \((m,T)\) a configuration, with m being the state and T being the stack. A rule pair \((r^{\prime },r)\) in state m tells that the current rule is r and the context rule is \(r^{\prime }\). The context rule \(r^{\prime }\) is the rule that generates the last unmatched call symbol; when all call symbols are matched with return symbols, \(r^{\prime }\) is \(\text{None}\). Initially, since the parser PDA has not read any symbol, there is no current rule; so we define the initial PDA state as \(m_0=\lbrace (\text{None}, L_0\rightarrow .L_0)\rbrace\), where \(L_0\rightarrow .L_0\notin \dot{P}\) a special rule; they only appear in \(m_0\).
The initial stack \(T_0\) is empty. When reading a call symbol, the parser PDA pushes the current state to its stack; when reading a return symbol, it removes the top of the stack. For a PDA stack T, the function \(\operatorname{head }(T)\) returns the top of the stack and is defined as
For parser PDA construction, we define three kinds of derivative functions \(p_c\), \(p_{{‹ {a}}}\), and \(p_{ {b› }}\), similar to the case of recognizer construction.
Definition 6.4 (The Derivative Function p for the Parser PDA)
Given a VPG \(G=(V,\Sigma ,P,L_0)\), suppose the current state of the parser PDA is m and the current stack is T, the transition functions \(p_c\), \(p_{{‹ {a}}}\) and \(p_{ {b› }}\) are defined as follows:
(1)
\(p_c(m)=(m^{\prime },\lambda T.T)\), where
\begin{equation*} m^{\prime }=\lbrace (r^{\prime },\mathrm{nextNT}\left(r\right)\rightarrow c. L_1) \mid (r^{\prime },r)\in m \wedge (\mathrm{nextNT}\left(r\right)\rightarrow c. L_1)\in \dot{P}\rbrace . \end{equation*}
For each pair \((r^{\prime },r)\) in m, the new state keeps the context rule \(r^{\prime }\) and updates the current rule to a rule with head \(\mathrm{nextNT}\left(r\right)\) and that derives c.
(2)
\(p_{{‹ {a}}}(m)=(m^{\prime },\lambda T.\ m\cdot T)\), where
For each pair \((r^{\prime },r)\) in m, the new state changes the context rule and the current rule to a rule with head \(\mathrm{nextNT}\left(r\right)\) and that derives \({‹ {a}}\). We further push the current state m to the stack.
(3)
\(p_{ {b› }}(m,\overline{m_{\text{call}}})=(m^{\prime },\text{tail})\), where \(\text{tail}\) is a function that returns the tail of the stack, and \(\overline{m_{\text{call}}}=\operatorname{head }(T)\), and
We remove the top \(\overline{m_{\text{call}}}\) of the stack T and, for each pair \((r_1,r_2)\) in m where the context rule is \(r_1=L\rightarrow {‹ {a}}. L_1 {b› }L_2\), we match \({b› }\) with \({‹ {a}}\) and update the current rule to \(L\rightarrow {‹ {a}}L_1 {b› }. L_2\). We also require that \(\mathrm{nextNT}\left(r_2\right) \rightarrow \epsilon\). Note that we must have \(\overline{m_{\text{call}}}\ne \emptyset\), since there is an unmatched call symbol generated by \(r_1\). For the new state pair’s context rule, we select a pair \((r^{\prime },r)\) in \(\overline{m_{\text{call}}}\), where \(\mathrm{nextNT}\left(r\right)=L\), and restore the context to be \(r^{\prime }\).
With derivative functions defined, Algorithm 2 constructs the parser PDA by computing the least solution of the following equation; it makes sure that states are closed under derivatives.
Definition 6.5 (Traces of Running the Parser PDA).
Given a string w, we write \(w_i\) for its ith symbol. Suppose the sequence of transitions in a parser PDA for input w is \((m_0,\bot) \stackrel{w_1}{\longrightarrow } (m_1, T_1) \cdots \stackrel{w_{|w|}}{\longrightarrow } (m_{|w|}, T_{|w|})\). We define the trace of running the parser PDA for w, denoted as \(\operatorname{forest}(w)\), as the list \([m_1, m_2, \ldots , m_{|w|}]\).
Since each \(m_i\) in \(\operatorname{forest}(w)\) is a set of possible rules at that step of parsing, \(\operatorname{forest}(w)\) can be viewed as the representation of all parse trees of input w; that is, \(\operatorname{forest}(w)\) is the parse forest of w.
Finally, we define the acceptance configuration for the parser PDA as follows:
Given a VPG \(G=(V,\Sigma ,P,L_0)\), a pair \((m,E)\) is an acceptance configuration for the parser PDA if the state m includes a pair \((\text{None},r)\), where \(\mathrm{nextNT}\left(r\right)\) derives the empty string \(\epsilon\), i.e., \(\exists r,\ (\text{None},r)\in m\ \wedge \ (\mathrm{nextNT}\left(r\right)\rightarrow \epsilon)\in P\).
Note that the context \(\text{None}\) indicates that the stack E is empty. For general VPGs, however, the contexts in the state of an acceptance configuration could also be pending rules, indicating that the stack E is not empty, as mentioned in Definition C.2.
6.2 Extracting Parse Trees
In this section, we discuss how to extract parse trees for well-matched VPGs and put the discussion for general VPGs in Appendix C. First, we explain our design choice to accomplish efficient extraction, and then we define the extraction functions formally.
After running the parser PDA on input w, we have \(\operatorname{forest}(w)=[m_1, m_2, \ldots , m_{|w|}]\). Intuitively, for each parse tree \(v=[r_1,\ldots ,r_{|w|}]\) of w, we have \(r_i\in m_i\), \(i=1..|w|\). Our next step is to extract the parse trees from \(\operatorname{forest}(w)\). In Example 6.2, we showed the backward way for extraction, where we first extract valid rules from \(m_{|w|}\) then extend them backward with the rules from \(m_{|w|-1}\) to \(m_1\). There is also a forward way, where we first extract valid rules from \(m_1\) and then extend them with the rules extracted from \(m_2\) to \(m_{|w|}\). Although both ways would give the same set of parse trees, we choose the backward way for better performance. The forward extraction may maintain invalid parse trees during extraction; it may in the worst case maintain an exponential number of invalid parse trees and remove them in later steps, which is inefficient. The backward way, however, maintains only valid parse trees. For an unambiguous grammar in particular, the backward way always maintains one parse tree and guarantees linear-time extraction. We give a concrete example as follows:
Example 6.3.
Consider the unambiguous grammar
and the input string \(w={{‹ {a}}}^n { {b› }}^n\). The parse forest of w is “\(m_1^n + m_2^{n-1} + [m_3]\),” where the notation \(m_1^n\) stands for the list that contains n copies of \(m_1\), and
With the forward extraction, we first extract partial parse trees from \(m_1\). There are two possible partial parse trees \([L \rightarrow {‹ {a}}. L {b› }E]\) and \([L \rightarrow {‹ {a}}. L {d› } E]\) (their context rules can be ignored for now); we have to extract both of them, although the second one will become invalid. We then extend each partial parse tree with rules from the next \(m_1\); e.g., a new parse tree is \([L \rightarrow {‹ {a}}. L {b› }E,\ L \rightarrow {‹ {a}}. L {b› }E]\). This process is repeated, and we get \(2^n\) partial parse trees for \({{‹ {a}}}^n\), since there are \(2^n\) ways to parse \({{‹ {a}}}^n\). However, only one of them is valid considering the following input \({ {b› }}^n\); those invalid ones will be removed when considering \(m_2^{n-1} + [m_3]\). The time complexity of this process is clearly non-linear.
In contrast, with the backward extraction, we first extract the only partial parse tree \([L \rightarrow {‹ {a}}. L {b› }E]\) from \(m_3\), then extend it with the rule \(L \rightarrow {‹ {a}}L {b› }. E\) from \(m_2\). We also introduce a stack for each partial parse tree to store the rules of unmatched return symbols, so later we know how to extend the partial parse tree with matching rules. After extracting rules from \({m_2}^{n-1} + [m_3]\), we have only one partial parse tree whose stack is \(\left(L \rightarrow {‹ {a}}L {b› }. E \right)^n\cdot \bot\). Based on this stack and \(m_1\), only rule \(L\rightarrow {‹ {a}}. L {b› }E\) can be used for valid backward extension. In this way, only one valid parse tree is constructed in the end. We will prove that the backward extraction always generates valid parse trees in Lemma D.16.
Now, we define a partial parse tree as a parse tree paired with a stack as follows:
Definition 6.7 (Partial Parse Trees).
A partial parse tree \((v,E)\) is a parse tree \(v=[r_1,\ldots ,r_n]\) paired with a stack \(E=r_{n_1}\cdots r_{n_k}\cdot \bot\), where \({n_1}\lt \cdots \lt {n_k}\) and \(r_{n_1},\ldots ,r_{n_k}\) are the rules that generate unmatched return symbols.
Formally, we use function \({\operatorname{extract}_\mathrm{init}}\) to extract a set of partial parse trees V from \(m_{|w|}\) of \(\operatorname{forest}(w)\) and function \({\operatorname{extract}_{\mathrm{oneStep}}}\) to extend the current set V of partial parse trees.
\begin{align*} V_{|w|} &\leftarrow \operatorname{extract}_\mathrm{init}(m_{|w|});\\ V_i &\leftarrow \operatorname{extract}_{\mathrm{oneStep}}({m_i},V_{i+1}),i \mbox{ goes from } {|w|}-1 \mbox{ to } 1, \end{align*}
where \(V_1\) is the final parsing result that includes all valid parse trees of w.
Next, we define different kinds of states and define \({\operatorname{extract}_\mathrm{init}}\) and \({\operatorname{extract}_{\mathrm{oneStep}}}\).
Given a VPG \(G=(\Sigma ,V,P,L_0)\), based on the dotted rules in \(\dot{P}\),
(1)
we define a set of plain states \(\mathcal {M}_{\text{pln}}\) as \(\left(\mathcal {P}_{\text{call}}\cup \lbrace \text{None}\rbrace \right)\times \mathcal {P}_{\text{pln}}\);
(2)
we define a set of call states \(\mathcal {M}_{\text{call}}\) as \(\left(\mathcal {P}_{\text{call}}\cup \lbrace \text{None}\rbrace \right)\times \mathcal {P}_{\text{call}}\);
(3)
we define a set of return states \(\mathcal {M}_{\text{ret}}\) as \(\left(\mathcal {P}_{\text{call}}\cup \lbrace \text{None}\rbrace \right)\times \mathcal {P}_{\text{ret}}\).
Definition 6.9 (The Extraction Function \(\operatorname{extract}_\mathrm{init}\))
We first define \(\operatorname{extract}^{\mathrm{pre}}_\mathrm{init}\), which extracts all valid rules from m, then extend \(\operatorname{extract}^{\mathrm{pre}}_\mathrm{init}\) to \(\operatorname{extract}_\mathrm{init}\), which pairs each extracted rule with a stack.
\begin{equation*} \operatorname{extract}^{\mathrm{pre}}_\mathrm{init}(m)=\lbrace r \mid (r^{\prime },r)\in m, (\mathrm{nextNT}\left(r\right)\rightarrow \epsilon)\in P \wedge r^{\prime } \text{ is None} \rbrace . \end{equation*}
A pair \((r^{\prime },r)\) in m corresponds to a parse tree that ends with r and whose last unmatched call symbol is generated by \(r^{\prime }\). This parse tree is valid if \(\mathrm{nextNT}\left(r\right)\) can match the empty string and \(r^{\prime }\) is None. The context \(r^{\prime }\) cannot be a matching rule, since a return symbol would be further needed.
If \(m\subseteq \mathcal {M}_{\text{ret}}\), then we push r to the stack so in a later step during backward extraction, we can match the return symbol in this step using a matching rule. Note that from the definition of derivative functions (Definition 6.4), we can easily show that any state in the parser PDA is a subset of \(\mathcal {M}_{\text{pln}}\), \(\mathcal {M}_{\text{call}}\), or \(\mathcal {M}_{\text{ret}}\); it cannot be the case that a parser PDA state have mixed pairs (e.g., some from \(\mathcal {M}_{\text{call}}\) and others from \(\mathcal {M}_{\text{ret}}\)).
Definition 6.10 (The One-step Extraction Function).
We first define \({\operatorname{extendable}}\), which decides whether \((r^{\prime },r)\) can be used to extend v. Based on \(\operatorname{extendable}\), we define \(\operatorname{extract}_{\mathrm{oneStep}}^{\mathrm{pre}}\), which extracts rules from m and uses them to extend the partial parse trees in V. Finally, we define \(\operatorname{extract}_{\mathrm{oneStep}}\), which modifies the stacks of the extended trees:
\begin{align*} \operatorname{extract}_{\mathrm{oneStep}}^{\mathrm{pre}}(m,V)= &\left\lbrace (r::v,E) \mid (v,E)\in V \ \wedge \ (r^{\prime },r)\in m \ \wedge \ \operatorname{extendable}((r^{\prime },r),{(v,E)})\right\rbrace . \end{align*}
Intuitively, each valid partial parse tree v can be split into \(v_1\) and \(v_2\) so \(v=v_1+v_2\) and \((r^{\prime },\operatorname{lastRule}(v_1))\) is included in m for some \(r^{\prime }\) and \((v_2,E)\) is included in V for some E. Therefore, if \(r^{\prime }\) is None, then E is empty; if \(r^{\prime }\) is a matching rule, then the top of E should be the same matching rule. In the first case, we require that r is connected to v, i.e., \(\mathrm{nextNT}\left(r\right)=\mathrm{headNT}\left(v\right)\). In the second case, if \(\operatorname{firstRule}(v)\) is a rule that derives a return symbol, then we require \(\mathrm{nextNT}\left(r\right)\) can match the empty string; otherwise, we still require that r is connected to v.
\(\operatorname{extract}_{\mathrm{oneStep}}\) modifies E based on the type of m. If a rule in m derives a call symbol, then the call symbol matches the return symbol (if exists) specified on the top of the stack and, therefore, we remove the top of the stack. If a rule in m derives a return symbol, then we push the rule for the return symbol to the stack so it can match a future call symbol during backward extraction.
Now, we define the \(\operatorname{extract}\) function that extracts the set of parse trees of w from a forest.
Definition 6.11 (The Extraction Function).
Given a nonempty \(\operatorname{forest}(w)=[m_1,\ldots , m_{|w|}]\), define the sequence of partial parse trees as
\begin{align*} V_{|w|} &= \operatorname{extract}_\mathrm{init}(m_{|w|});\\ V_i &= \operatorname{extract}_{\mathrm{oneStep}}({m_i},V_{i+1}),i \mbox{ goes from } {|w|}-1 \mbox{ to } 1. \end{align*}
Then, we define \(\operatorname{extract}(\operatorname{forest}(w)) = V_1\) as the set of parse trees extracted from \(\operatorname{forest}(w)\).
6.3 Optimizing the Extraction as a PDA
For an unambiguous grammar, there is only one parse tree for any accepted input string. Even for an ambiguous grammar, there may be a preferred parse tree among possible parse trees for an input string. In this section, we optimize the parse-tree extraction process to an extraction PDA; instead of running the execution function, we can run the extraction PDA on the parse forest and get a trace that is a valid parse tree. Running a PDA is much faster, and our evaluation in Section 8 is based on the extraction PDA.
In Lemma D.17, Appendix D, we will show that those extraction functions always extract valid parse trees. Therefore, to extract a single parse tree, we need to pick one partial parse tree \((v,E)\) from \(\operatorname{extract}_\mathrm{init}(m_{|w|})\) according to some criterion; similarly, if \(\operatorname{extract}_{\mathrm{oneStep}}(m,V)\) returns multiple parse trees, then one needs to be picked. To see why this process can be represented as a PDA, note that the extraction function depends on only the first rule of the current partial parse tree v and the top of E; so we can view \((\operatorname{firstRule}(v),E)\) as a PDA configuration. We define the PDA states and stacks as follows:
Definition 6.12 (Extraction PDA States and Stacks).
Given a VPG, we introduce an extraction PDA whose states are dotted rules in \(\dot{P}\) and whose stack contains rules in \(\mathcal {P}_{\text{ret}}\). Intuitively, the stack remembers a series of contexts in terms of return rules, which are used to match call symbols during backward extraction. We call a pair \((r,E)\) a configuration, with r being the state and E being the stack.
We define extraction PDA transition functions \(q_\text{init}\) and q based on extraction functions \(\operatorname{extract}_\mathrm{init}\) and \(\operatorname{extract}_{\mathrm{oneStep}}\). In the following definition, we write \(\operatorname{pickOne}(V)\) for a function that returns a rule from a list of partial parse trees V; if V is empty, then we define \(\operatorname{pickOne}(V)\) to be \(\text{None}\). In practice, \(\operatorname{pickOne}\) can be defined, e.g., based on priority of rules; we leave it abstract in this section.
Definition 6.13 (The Single-parse-tree Extraction Function \(q_\text{init}\))
Let \(r=\operatorname{pickOne}(\operatorname{extract}^{\mathrm{pre}}_\mathrm{init}(m))\). If \(r=\text{None}\), then we define \(q_\text{init}(m)=\text{None}\). Otherwise, we define
Definition 6.14 (The Transition Function q for the Extraction PDA)
Given a current configuration \((r,E)\), let \(t=\operatorname{head }E\) and \(r^{\prime }=\operatorname{pickOne}(\operatorname{extract}_{\mathrm{oneStep}}(m, \lbrace ([r], {t}\cdot \bot) \rbrace)\). If \(r^{\prime }=\text{None}\), then we define \(q(m,r,t)=\text{None}\). Otherwise, we define
where A is the set of all parser PDA states. Note that we do not need to compute the acceptance state for an extraction PDA: For a forest with the last state m, as long as the initial state \(\operatorname{extract}_\mathrm{init}(m)\) is not empty, the extraction is guaranteed to be successful; if the initial state \(\operatorname{extract}_\mathrm{init}(m)\) is empty, then the input string is invalid.
6.4 Correctness Proof
In this subsection, we prove the correctness of the parse trees constructed according to Sections 6.1 and 6.2. The correctness of the optimization in Section 6.3 using an extraction PDA follows as a corollary, since the extraction PDA extracts one parse tree out of possibly many parse trees returned by the extraction functions in Section 6.2.
Given a VPG \(G=(V,\Sigma ,P,L_0)\) and an input string w, let \(V=\operatorname{extract}(\operatorname{forest}(w))\). We call V correct if it includes exactly the set of valid parse trees of w according to G, which we state formally as the following theorem:
Theorem 6.1 (Correctness of the Parser Generator).
We provide detailed lemmas and proofs in Appendix D and describe the main steps next. Remember that the parser PDA builds a set of partial parse trees in a forward way and stores the trees in a forest, from which the extraction functions extract a set of partial parse trees in a backward way. Therefore, for the correctness proof, we formalize two small-step relations: one for building forward parse trees and one for building backward parse trees. We then show that both small-step relations are equivalent to the big-step parse-tree derivation relation under certain conditions. For the backward direction of Theorem 6.1 (i.e., soundness), we prove that during backward extraction for an input string w, each extracted backward partial parse tree \(v_2\) has a counterpart forward partial parse tree \(v_1\) in the rest of the forest, and \(v_1+v_2\) is a valid parse tree of w. At the end of extraction, we have \(v_1=[]\) and \(v_2\) is a valid parse tree of w. For the forward direction of Theorem 6.1 (i.e., completeness), we show that each valid parse tree v can be split as \(v=v_1+v_2\), and there is an extraction step where \(v_1\) is in the rest of the forest and \(v_2\) is an extracted partial parse tree. Therefore, at the end of extraction, we have \(v_1=[]\) and \(v=v_2\) is in the extracted parse trees.
6.5 Verification of Time Complexity
Next, we discuss how we verify the performance of our VPG parser. For an input string w with a total number of k valid parse trees, we show that the VPG parser takes \(O(k|w|)\) time. When \(k=1\), the time complexity becomes linear (i.e., \(O(|w|)\)). The verification is performed in two steps: (1) it takes \(O(|w|)\) time to run a parser PDA and generates a parse forest; and (2) it takes \(O(k|w|)\) time to run the extraction function on the forest when extracting all of the k parse trees.
We first introduce our method of counting time cost in Coq. Our verification utilizes a Coq monadic library [29], which provides a monad for tracking the time cost of computation. To pair a value a of type A with a time cost c, we utilize a dependent type \(C(A,P)\), defined to be
where C is a type constructor and \(P\ a\ c\) is a property of a and c. In this article, the above dependent type is written using the following notation:
\begin{equation*} \lbrace a : A \mid c \mid P\ a\ c \rbrace . \end{equation*}
For example, given \(k\in \mathbb {N}\), a value \(a^{\prime }\) of type \(\lbrace a : A \mid c \mid c=k \rbrace\) means it takes k time units to compute a of type A.
To correctly count the time cost, programs must be written as a combination of the return and bind monadic operators provided by the library [29]. We briefly introduce them as follows:
(1)
The return operator, denoted as “\(\Leftarrow {a}\)”, returns a value a bound with 0 time cost.
(2)
The insertion operator, denoted as “\(\mathrel {{+}{=}}{k};\ {a^{\prime }}\)”, adds k time units to the monadic value \(a^{\prime }\).
(3)
The bind operator, denoted as “\({a}\leftarrow {e_1};\ {e_2}\)”, first evaluates the computation \(e_1\) and binds its value to a, which is used to evaluate \(e_2\); it further pairs the result with the sum of the time cost of \(e_1\) and \(e_2\).
In our verification, the time cost of a function is counted by accumulating the time cost of its components, except for three kinds of primitive functions with unit cost:
(1)
eq_sym_ID, which compares the IDs of two symbols (nonterminals or terminals).
(2)
andb and orb, which compute the conjunction and disjunction of two Booleans, respectively.
(3)
cons a l, which constructs a list from head a and tail l, and pair a b, which makes a pair.
The insertion function we introduced earlier is used in only two ways: (1) add the time cost cost_eq_sym_ID, cost_andb, cost_orb, cost_cons, or cost_pair to a corresponding function. E.g., the program
compares two IDs \(s_1\) and \(s_2\), and returns a bool paired with cost_eq_sym_ID; and then (2) add cost_branch, which is the number of variables and functions used in a branch, similar to the counting method of McCarthy et al. [29]. Consider the following example:
In the above example, only one unit of cost is added, because only one name x is used; three units of cost are added to the second branch, because three names (x, a, and l) are used. These branch costs are inserted manually.
With the aforementioned setup, we implemented two functions \({\operatorname{runPDA}^{\prime }}\) and \({\operatorname{extract}^{\prime }}\), where \({\operatorname{runPDA}^{\prime }}\) runs the parser PDA to generate a forest and \({\operatorname{extract}^{\prime }}\) extracts the set of parse trees from the forest. In particular, \({\operatorname{runPDA}^{\prime }}\) has the following type:
\begin{equation*} \operatorname{runPDA}^{\prime }: \forall \mathcal {T}\ w, \lbrace M : \mathrm{Forest} \mid c \mid c=\operatorname{cost}_{\operatorname{runPDA}}(\mathcal {T},w) \rbrace \end{equation*}
where w is a string, \(\mathrm{Forest}\) is the type of parse forests, and \(\mathcal {T}\), representing the PDA, is a list of mappings in the form of \(((i,S,t),S^{\prime })\). A mapping \(((i,S,t),S^{\prime })\) represents the state transition from configuration \((S,T)\) to \((S^{\prime },T^{\prime })\), where \(t=\operatorname{head }(T)\); the mapping from T to \(T^{\prime }\) is further computed in \({\operatorname{runPDA}^{\prime }}\). We show the upper bound of the cost function \(\operatorname{cost}_{\operatorname{runPDA}}\) in Theorem 6.2.
Theorem 6.2 (The Running-time Upper Bound of the Parser PDA).
Theorem 6.2 shows that it takes \(O(|w|)\) time to run the parser PDA and generates a parse forest. The proof is straightforward and we omit its discussion.
Now, we turn our attention to the extract function. We previously introduced its definition in Definition 6.11, which maintains a set of partial parse trees. However, it is not efficient to maintain a set-based data structure. Inserting a partial parse tree v to a set, for example, requires us to compare v with trees already in the set to avoid duplicates; comparing each of the \(|w|\) rules in the parse trees would take at least \(|w|\) steps. Even if we hash parse trees before comparison, we would still have to perform hashing for each of the \(|w|\) rules. Therefore, if a partial parse tree is inserted at each of the \(|w|\) steps, then the total time cost is at least quadratic (\(\Omega ({|w|}^2)\)). Getting a linear-time algorithm is a challenge.
We next describe a linear-time algorithm that takes advantage of the locality of VPG parsing. We replace the set structure with a list structure. But we also have to avoid duplicate partial parse trees in the list. To see why this is a problem, notice that two different elements \((r_1,r)\) and \((r_2,r)\) in a state m can be used to extend the same parse tree v from a list V of partial parse trees, both generating the same tree \(r::v\).
To solve this problem, our observation is that if V includes no duplicates, then for two trees \(v_1,v_2\in V\), we have not only \(v_1\ne v_2\), but also \(r_1::v_1\ne r_2::v_2\) for \(r_1,r_2\in \dot{P}\). Therefore, for each \(v\in V\), we can first compute \(m_v\), the rules in m that can extend v, and remove the duplicates in \(m_v\) to get \(m_v^{\prime }\); we then extend v with \(m_v^{\prime }\) to get \(V_v\) and finally concatenate \(V_v\) for all \(v\in V\). To avoid duplicates in \(m_v\), we compare elements in \(m_v\), which takes \(O(|m|^2)\) time. Since m belongs to \((P\cup \lbrace \text{None}\rbrace) \times P\), its size \(|m|\) is at most \((|P|+1)|P|\); therefore, \(O(|m|^2)\) is independent of the input string size and can be viewed as a constant.
Definitions 6.15, 6.16, and 6.17 implement this optimized version of the extract function. It utilizes function \({\operatorname{rmDup}}\), which removes the duplicates in a list, function \({\operatorname{list}}\), which converts a set to a list, and standard functions \(\operatorname{map}\), \(\operatorname{filter}\), and \(\operatorname{concat}\). We omit their standard definitions.
Definition 6.15 (The Optimized Implementation \({\operatorname{extract}_\mathrm{init}^{\mathrm{op}}}\))
Definition 6.16 (The Optimized Implementation \({\operatorname{extract}_{\mathrm{oneStep}}^{\mathrm{op}}}\))
Given a VPG \(G=(\Sigma ,V,P,L_0)\), a set of partial parse trees V and a state m, we first define functions \(\operatorname{extract}^{\mathrm{filt}}\), \(\operatorname{extract}^{\mathrm{map}}\), and \(\operatorname{extract}^{\mathrm{concat}}(m,V)\), and then define \({\operatorname{extract}_{\mathrm{oneStep}}^{\mathrm{op}}}\) based on them.
Definition 6.17 (The Optimized Implementation \({\operatorname{extract}^{\mathrm{op}}}\))
Given a \(\operatorname{forest}(w)=[m_1,\ldots , m_{|w|}]\), where \(w\ne \epsilon\), define the sequence of partial parse trees as
\begin{align*} V_{|w|} &= \operatorname{extract}_\mathrm{init}^{\mathrm{op}}(m_{|w|});\\ V_i &= \operatorname{extract}_{\mathrm{oneStep}}^{\mathrm{op}}({m_i},V_{i+1}),i \mbox{ goes from } {|w|}-1 \mbox{ to } 1. \end{align*}
Then, we define \(\operatorname{extract}^{\mathrm{op}}(\operatorname{forest}(w)) = V_1\) as the list of parse trees extracted from \(\operatorname{forest}(w)\).
Based on \({\operatorname{extract}^{\mathrm{op}}}\), We implemented a version of the extract function that performs cost counting:
\begin{equation*} \operatorname{extract}^{\prime }: \forall w,\ \lbrace V : \text{ParseTreeList} \mid c \mid V=\operatorname{extract}^{\mathrm{op}}(\operatorname{forest}(w))\ \wedge \ c=\operatorname{cost}_{\operatorname{extract}}(\operatorname{forest}(w))) \rbrace \end{equation*}
We show the upper bound of the cost function of \(\operatorname{extract}^{\prime }\) in Theorem 6.3, which also includes the correctness discussed in Section 6.4.
Theorem 6.3 (The Running-time Upper Bound of Extraction).
There are three properties in Theorem 6.3. The first property “NoDup(\(V_1\))” means there is no duplicate in \(V_1\); therefore, \(|V_1|\) is the number of valid parse trees. The next property shows that the time complexity of \({\operatorname{extract}^{\prime }}\) is \(O(|V_1||w|)\). The third property shows the correctness, i.e., \(V_1\) is the set of parse trees of w.
Proof.
During the extraction, we maintain lists \(V_i,i=1..|w|\) of partial parse trees:
\begin{align*} V_{|w|} &\leftarrow \operatorname{extract}_\mathrm{init}^{\mathrm{op}}(m);\\ V_i &\leftarrow \operatorname{extract}_{\mathrm{oneStep}}^{\mathrm{op}}({m_i},V_{i+1}),i \mbox{ goes from } {|w|}-1 \mbox{ to } 1. \end{align*}
We will prove in Theorem 6.4 that \(V_i\) has no duplicates for \(i=1..|w|\). When \(i=1\), we get the first property. To prove the second property, we show that the size of \(V_i\) increases during backward extraction in Theorem 6.5, i.e.,
Then, we prove that during extraction, for each \((m,V)\), it takes \(O({|m|^2|}V|)\) time to evaluate \(\operatorname{extract}_{\mathrm{oneStep}}^{\mathrm{op}}(m,V)\). To do this, we implement the monadic function
and prove in Theorem 6.6 that \(\operatorname{cost}_{\mathrm{oneStep}}^{\mathrm{Extract}}(m,V)\sim O(|m|^2|V|)\). We then rewrite this bound to \(O(|V|)\), since \(|m| \sim O(|P|^2)\). Finally, since \(\operatorname{extract}^{\mathrm{op}}\) can be viewed as executing \(\operatorname{extract}_{\mathrm{oneStep}}^{\mathrm{op}}\) for \(|w|\) times, we have the time complexity of \(\sum _{i=1}^{|w|}O(|V_i|)\), which is further bounded by \(O(|V_{1}||w|)\) using Theorem 6.5.□
Let \(V_{\mathrm{init}}=\operatorname{extract}_\mathrm{init}^{\mathrm{op}}(m)\). Prove by induction on \(M_1\). There are two cases.
When \(M_1=[]\), we have \(\operatorname{extract}^{\mathrm{op}}(M_1{+[m]})=V_{\mathrm{init}}\). Since we apply \({\operatorname{rmDup}}\) to get \(V_{\mathrm{init}}\), clearly, we have \(\operatorname{NoDup}(V_{\mathrm{init}})\).
When \(M_1=[m_1]+M_1^{\prime }\) for some \(m_1\), we have \(\operatorname{extract}^{\mathrm{op}}(M_1+[m])=\operatorname{extract}_{\mathrm{oneStep}}^{\mathrm{op}}(m_1,V^{\prime })\), where \(V^{\prime }=\operatorname{extract}^{\mathrm{op}}(M_1^{\prime }+[m])\). By the induction hypothesis, we have \(\operatorname{NoDup}(V^{\prime })\). Therefore, from the definition of \(\operatorname{extract}_{\mathrm{oneStep}}^{\mathrm{op}}\), we have \(\operatorname{NoDup}(\operatorname{extract}^{\mathrm{op}}(M_1+[m]))\).□
From the invariant in Lemma D.17 we know that each \({v\in } \operatorname{extract}^{\mathrm{op}}(M_1+[m])\) is a valid partial parse tree; i.e., \(\exists v_1, v_1+v\) is a parse tree of w. Let \(v_1=v_1^{\prime }+[r]\) for some r; since \([r]+v\) is also a valid partial parse tree, we have \([r]+v\in \operatorname{extract}^{\mathrm{op}}([m^{\prime }]+M_1+[m])\). This shows an injection from \(\operatorname{extract}^{\mathrm{op}}(M_1+[m])\) to \(\operatorname{extract}^{\mathrm{op}}([m^{\prime }]+M_1+[m])\). Therefore, we have \(|\operatorname{extract}^{\mathrm{op}}(M_1+[m])|\le |\operatorname{extract}^{\mathrm{op}}([m^{\prime }]+M_1+[m])|\).□
Theorem 6.6 (The Running-Time Upper Bound of \({\operatorname{extract}_{\mathrm{oneStep}}^{\mathrm{op}}}\))
For each v in V, it takes \(O(|m|^2)\) time to remove the duplicates in a list of rules, therefore, the total time is \(O(|m|^2|V|)\).□
Note that one benefit of the monadic library is that, since the running times are embedded in propositions, we can extract OCaml code from the Coq implementation and after extraction propositions with running times removed.
6.6 Coq Mechanization
All definitions and theorems presented in Sections 6.4 and 6.5 are mechanized in the proof assistant Coq. Our proof artifact consists of 19 Coq files and \(\sim\)43k lines of code. The complete verification took \(\sim\) 5 minutes on an Intel® Core™ i7-9700 Processor and 16 GB of memory.
Our Coq implementation includes two hypotheses, namely, A_VPG_Linear and A_VPG_Match, corresponding to the two constraints for general VPGs we discussed in Section 2: A_VPG_Linear requires that for each rule \(L\rightarrow i L^{\prime }\) where \(i\in \Sigma\), if the nonterminal L belongs to \(V^0\), then the nonterminal \(L^{\prime }\) must also belong to \(V^0\); and A_VPG_Match requires that for each matching rule \(L\rightarrow {‹ {a}}L_1 {b› }L_2\), the nonterminal \(L_1\) belongs to \(V^0\), and if the nonterminal L belongs to \(V^0\), then so does the nonterminal \(L_2\). These two hypotheses guarantee that our Coq implementation accepts valid general VPGs. In Appendix E, we summarize the correspondence between theorems discussed in this section and our open-sourced Coq formalization and proofs.
We discuss some key differences between the definitions as well as theorems presented in Sections 6.4 and 6.5 and their Coq implementation as follows: First, the big-step parse-tree derivation in Figure 4 includes three rules; its Coq mechanization includes two additional rules for deriving pending call and return symbols, respectively. Second, we formalized the semantics of running the parser PDA in Coq as the relation “Forest\(M\ T\ w\),” where M is the trace for the string w, and T is the terminal stack. Recall that in Definition 6.5, we denote the trace as \(\operatorname{forest}(w)\). In Coq, the function \(\operatorname{forest}\) is verified to generate a trace that satisfies the relation Forest. Further, in the theorems for extraction PDAs, such as Theorems 6.3 and 6.4, the trace \(\operatorname{forest}(w)\) is replaced with a trace M that satisfies the Forest relation.
7 Designing A Surface Grammar
The format of rules allowed in VPGs is designed for easy studying of its meta-theory but is inconvenient for expressing practical grammars. First, no user-defined semantic actions are allowed. Second, each VPG rule allows at most four terminals/nonterminals on the right-hand side. In this section, we present a surface grammar that is more user-friendly for writing grammars. We first discuss embedding semantic actions. Then, we introduce tagged CFGs, which are CFGs paired with information about how to separate terminals to plain, call, and return symbols. We then describe a translator from tagged CFGs to VPGs. During the conversion, the translator also generates semantic actions that convert the parse trees of VPGs back to the ones of tagged CFGs.
7.1 Embedding Semantic Actions
Semantic actions transform parsing results to user-preferred formats. In a rule \(L\rightarrow s_1\cdots s_k\), where \(s_k\in \Sigma \cup V\), we treat L as a default action that takes k arguments, which are semantic values returned by \(s_1\) to \(s_k\), and returns a tree with a root node and \(s_1\) to \(s_k\) as children. The prefix notation of a parse tree gives
where \(v_{s_i}\) is the semantic value for \(s_i\). The above notation can be naturally viewed as a stack machine, where L is an action and \(v_{s_i}\) are the values that get pushed to the stack before the action. The VPG parse tree can be converted to the prefix notation in a straightforward way. If we then replace each nonterminal in the tree with its semantic action, then the parse tree becomes a stack machine.
The default action for a nonterminal can be replaced by a user-defined action appended to each rule in the grammar. For example, consider the grammar \(L = c L\mid {‹ {a}}L {b› } L\mid \epsilon\). Suppose we want to count the number of the symbol c in an input string; we can specify semantic actions in the grammar as follows:
In the above example, a semantic action is specified after each rule, e.g., “@{let \(f_1\ v_1\ v_2 = 1+v_2\)}.” In the actions, \(v_1\), \(v_2\), \(v_3\), and \(v_4\) represent the semantic values returned by the right-hand side symbols of the rule. For example, the first semantic action \(f_1\ v_1\ v_2\ =\ 1+v_2\) accepts two semantic values \(v_1\) and \(v_2\), where \(v_1\) is returned by c and \(v_2\) is returned by L. Specially, the definition “@{let \(f_3\ () = 0\)}” defines a function that does not take any argument.
As an application, the next subsection shows how to use semantic actions to convert the parse trees of a VPG to the parse trees of its original tagged CFG.
7.2 Translating from Tagged CFGs to VPGs
Grammar writers are already familiar with CFGs, the basis of many parsing libraries. We define tagged CFGs to be CFGs paired with information about how to partition terminals into plain, call, and return symbols (\(\Sigma =\Sigma _{\text{plain}}\cup \Sigma _{\text{call}}\cup \Sigma _{\text{ret}}\))3; that is, in a tagged CFG, a terminal is tagged with information about what kind of symbols it is. Compared to a regular CFG, the only additional information in a tagged CFG is the tagging information; therefore, tagged CFGs provide a convenient mechanism for reusing existing CFGs and developing new grammars in a mechanism that grammar writers are familiar with.
Not all tagged CFGs can be converted to VPGs. We use a conservative validator to determine if a tagged CFG can be converted to a VPG and, if the validator passes, translate the tagged CFG to a VPG. Without loss of generality, we assume that in each rule of the input tagged CFG a call symbol is matched with a return symbol. In the general case, we need two additional steps: (1) We first assure that pending call and return symbols are not nested in well-matched call/return symbols; (2) we then change the tags of the pending call and return symbols to plain, making the resulting grammar ready for the validator.
At a high level, a tagged CFG is first translated to a simple form, upon which validation is performed. If validation passes, then the simple-form CFG is translated to a linear-form CFG, which is finally translated to a VPG. We next detail these steps. These steps will be illustrated with a running example in Figure 7. In the example, the left-most column lists the original tagged CFG; note that in the grammar semantic actions are after the \(@\) symbol (e.g., \(L^6\)) and they will be discussed later in the section.
Fig. 7.
Fig. 7. An example of converting a tagged CFG to a VPG.
Definition 7.1 (Simple Forms).
A rule is in the simple form if it is of the form \(L\rightarrow \epsilon\) or of the form \(L\rightarrow q_1\cdots q_k\), where either \(q_i\in \Sigma _{\text{plain}}\cup V\) or \(q_i={‹ {a}}L_i {b› }\) for some \({‹ {a}}\), \({b› }\), \(L^{\prime }\), where \(i=1..k\) and \(k\ge 1\), and \(L^{\prime }\) is a nonterminal. A tagged CFG \(G=(V,\Sigma ,P,L_0)\) is in the simple form if every rule in P is in the simple form.
Compared to a tagged CFG, a simple-form CFG requires that there must be a nonterminal between a call symbol and its matching return symbol. The conversion from a tagged CFG to a simple-form CFG is straightforward: For each rule, we replace every string \({‹ {a}}s {b› }\), where \({‹ {a}}\) is matched with \({b› }\) and \(s\in (\Sigma \cup V)^*\), with \({‹ {a}}L_s {b› }\) and generate a new nonterminal \(L_s\) and a new rule \(L_s\rightarrow s\). After this conversion, a string in the form of \({‹ {a}}L {b› }\) can be viewed as a “plain symbol”; this is a key intuition for the following steps. We call \({‹ {a}}L {b› }\) a matched token in the following discussion. For the running example, this step of translation extracts \(AE\) from the first rule and assigns it to a new nonterminal \(L_{AE}\).
The validation can then be performed on the simple form, using its dependency graph.
Definition 7.2 (Dependency Graph).
The dependency graph of a grammar \(G=(V,\Sigma ,P,L_0)\) is \((V,E_G)\), where
Note that an edge from L to \(L^{\prime }\) is labeled with a pair of strings.
Definition 7.3 (Dependency Loop).
Let \(G=(V,\Sigma ,P,L_0)\) be a grammar with its associated dependency graph \((V,E_G)\). A dependency loop is defined as a loop present in the dependency graph such that at least one edge in the loop is labeled with \((s_1,s_2)\), where \(s_2\) can derive a nonempty string (i.e., \(s_2\rightarrow ^* w\) for some nonempty string w).
The validator’s main task is to make sure each dependency loop is well founded. Additionally, for non-dependency loops, the validator ensures they are not solely labeled with \((\epsilon ,\epsilon)\) to prevent dead loops. Specifically, the validator verifies that for each loop in the dependency graph, either (1) in the loop there is an edge \((L,L^{\prime })\) that is labeled with \((s_1{‹ {a}}, {b› }s_2)\), where \({‹ {a}}\) is matched with \({b› }\) in a rule \(L\rightarrow s_1{‹ {a}}L^{\prime } {b› }s_2\); or (2) every edge \((L,L^{\prime })\) in the loop is labeled with \((s,\epsilon)\) for some string \(s\in (\Sigma \cup V)^*\), and at least one such s satisfies \(s\not\rightarrow ^*\epsilon\). For the running example, the dependency graph of the simple form of the grammar does not have any loops and thus validation passes trivially.
Once the validation passes, the translation converts a simple-form CFG to a linear-form CFG.
Definition 7.4 (Linear Forms).
A rule is in the linear form if it is in one of the following forms: (1) \(L\rightarrow \epsilon\); (2) \(L\rightarrow t_1\cdots t_k\); (3) \(L\rightarrow t_1\cdots t_kL^{\prime }\); where \(t_i\in \Sigma _{\text{plain}}\) or \(\exists {‹ {a}}, {b› },L_i,\text{s.t. }\ t_i={‹ {a}}L_i {b› }\), \(i=1..k\), \(k\ge 1\). A tagged CFG \(G=(V,\Sigma ,P,L_0)\) is in the linear form if every rule in P is in the linear form.
Note that in a linear-form rule, \(t_i\) cannot be a nonterminal, while in a simple-form rule \(q_i\) can be a nonterminal. Further, the linear form allows rules of the form \(L\rightarrow t_1\cdots t_kL^{\prime }\), where \(t_i\) is a terminal or a matched token. The main job of the translator is to convert simple-form rules to linear-form rules by strategically replacing nonterminals with the right-hand sides of their rules. Appendix G shows the translation algorithm and its termination proof. Compared to the conference version where no termination proof was provided, this version of the translation algorithm has a new reformulation, which enables the termination proof to go through. Figure 7 shows the conversion result of the running example. For this example, the conversion is simple: Nonterminal A is replaced by \(cE\) and eventually replaced by c in the first two rules.
The translation from a linear-form CFG to a VPG is simple. E.g., for a rule of the form \(L\rightarrow t_1\cdots t_k\), it is translated to \(L \rightarrow t_1L_1; L_1 \rightarrow t_2 L_2; \ldots ; L_k \rightarrow t_kL_k; L_k \rightarrow \epsilon\), where \(L_1\) to \(L_k\) are a set of new nonterminals. For the running example, the first rule in the linear form of the grammar is split into two rules, as shown in the last column in Figure 7.
All transformations are local rewriting of rules, and as a result it is easy to show that each transformation step preserves the set of strings the grammar accepts; further, for an input string there should be a one-to-one correspondence between parse trees produced by the grammar before transformation and parse trees produced by the grammar after transformation. We further note that not all tagged CFGs can be converted to VPGs. For example, grammar “\(L\rightarrow cLc|\epsilon\)” cannot be converted, since its terminals cannot be suitably tagged: Intuitively, c has to be both a call and a return symbol. Further, since our validation algorithm is conservative, it rejects some tagged CFGs that have VPG counterparts. For example, grammar “\(L\rightarrow Lc|\epsilon\)” is rejected by the validator, since it is left recursive. However, it can be first refactored to “\(L\rightarrow cL|\epsilon\)”, which is accepted by our validator.
Generating semantic actions. During the conversion, each time the translator rewrites a rule, a corresponding semantic action is attached to the rule. Initially, every rule is attached with one default semantic action. For example, the rule \(L\rightarrow AbCd\) is attached with \(L^4\), written as \(L\rightarrow AbCd \ @L^4.\) As mentioned in Section 7.1, \(L^4\) is the default semantic action for constructing a tree with a root node and children nodes that are constructed from semantic values from the right-hand side of the rule. The superscript 4 is its arity. During conversion, every time we rewrite a nonterminal L in a rule R with the right-hand side of rule \(L \rightarrow s\), the semantic values for s are first combined to produce a semantic value for L, which is then used to produce the semantic value for the left-hand nonterminal of R. If a helper nonterminal \(L_s\) is introduced during conversion and a rule \(L_s \rightarrow s\) is generated, then we do not generate a semantic value for \(L_s\) but leave the semantic values for s on the stack so any rule that uses \(L_s\) can use those semantic values directly. In this way, we can convert a parse tree of a VPG to the parse tree of its corresponding tagged CFG.
Back to the example in Figure 7, during the conversion to the simple form, the semantic action of the first rule does not change, since the rule for \(L_{AE}\) has no semantic actions; the semantic values of A and E are left on the stack. Thus, \(L^6\) still expects six values on the stack. The translation step to the linear form expands A in the first rule with \(A\rightarrow cE @A^2\). Then, \(cE{‹ {a}} L_{AE} {b› }E\) is simplified to \(c{‹ {a}} L_{AE} {b› }E\), and that is why \(A^1\) is applied instead of \(A^2\): \(A^1\) accepts the value for c. The same transformation is applied to the second rule. The translation step to the VPG is straightforward.
As a concrete example of parsing using the VPG in Figure 7, for the input string \(c{‹ {a}}c {b› }\), the parser will generate the following parse tree:
Each rule in the above parse tree is then replaced with its attached action and the semantic values of the terminals.
And the evaluation result of the above stack machine is the following parse tree of the tagged CFG:
The above tree can be visualized as follows:
8 Evaluation
As discussed earlier, we implemented in Coq our VPG parsing library and mechanized its proofs; we also introduced two optimizations so parsing is performed by running two PDAs (the parser and the extraction PDAs). To evaluate performance, we extracted OCaml code from the Coq library and used it to construct the parser and extraction PDAs by Algorithms 2 and 3 and stored the PDA transitions in hashtables for constant-time look-ups. We evaluated our implementation for the following questions: (1) how applicable VPG parsing is in practice? (2) what is the performance of VPG parsing compared with other parsing approaches?
We performed a preliminary analysis for a set of ANTLR4 grammars in a grammar repository [3]. Among all 239 grammars, 136 (56.9%) grammars could be converted to VPGs by our tagged-CFG-to-VPG translation after we manually marked the call and return symbols for those grammars. Note that it does not mean the rest cannot be converted; e.g., 34 grammars cannot be converted, because they have left recursion and the conversion may become possible if the left recursion is removed. We left a further analysis for future work.
For performance evaluation, we compared our VPG parsers with two representative LL and LR parsers, namely, ANTLR 4 [35] and GNU Bison [18], using their Java backends. ANTLR 4 is a popular parser generator that implements an efficient parsing algorithm called ALL(*) [36]. The ALL(*) algorithm can perform an unlimited number of lookaheads to resolve ambiguity, and it has a worst-case complexity of \(O(n^4)\); however, it exhibits linear behavior on many practical grammars. GNU Bison is a well-known parser generator that implements a class of linear-time LR parsing algorithms [11]. We utilized GNU Bison’s default configuration to build LALR(1) parsers [10] in conjunction with JFlex [41], a Java lexical analyzer generator. We also compared the VPG parsers with a few hand-crafted parsers specialized for parsing JSON and XML documents, including four mainstream JavaScript engines and four popular XML parsers. Before presenting the performance evaluation, we list some general setups:
(1)
During evaluation, we adapted the grammars for JSON, XML, and HTML from ANTLR4 [3] to tagged CFGs, from which we generated VPG parsers. Appendix F shows the tagged CFGs for JSON, XML, and HTML. We also adapted these grammars to LR grammars and generated LR parsers using Bison. We then compared VPG parsers with ANTLR and Bison parsers in terms of performance.
(2)
When comparing our VPG parsers with ANTLR parsers, we omitted the lexing time. This is because we used ANTLR’s lexers to generate the tokens for both VPG parsers and ANTLR parsers.
(3)
Since ANTLR generates CFG parse trees, for an end-to-end comparison, we used semantic actions to convert the parsing results of VPG parsers and Bison parsers to CFG parse trees and measured the conversion time.
We note that the performance of the parsing algorithms presented in this article is very similar to the performance of the algorithms in the conference version, since both utilize two PDAs and the difference in lookup time between PDA transitions is negligible. We adopted the new algorithms primarily to simplify the verification of both correctness and performance.
8.1 Parsing JSON Files
The JSON format allows objects to be nested within objects and arrays; therefore, a JSON object has a hierarchically nesting structure, which can be naturally captured by a VPG. In particular, since in JSON an object is enclosed within “{” and “}” and arrays within “[” and “]”, its VPG grammar treats “{” and “[” as call symbols and “}” and “]” as return symbols.
When building a VPG parser for JSON, we reused ANTLR’s lexer. Therefore, the evaluation steps are as follows:
A downstream application that uses the ANTLR’s JSON parser may wish to keep working on the same parsing result produced by ANTLR’s parser. Therefore, we implemented a converter to convert the parse forest produced by our VPG parser to ANTLR’s parse tree for the input files. When the grammar is unambiguous, which is the case for the JSON grammar and the XML grammar, the parse forest is really the encoding of a single parse tree. The algorithm of converting a VPG parse tree to a stack machine and evaluating the stack machine has been discussed in Section 7. The result of the evaluation is a structure that can be directly printed out and compared with; the same applies to the parse trees produced by ANTLR and Bison parsers. The conversion steps are summarized as follows:
Note that in practice this conversion may not be necessary. A downstream application can directly work on the VPG parse tree or associate semantic actions with the VPG parse tree to convert the tree to desired semantic values. We include the conversion time for our VPG parser so the parsing result can be compatible with legacy downstream applications.
For evaluation, we collected 23 real-world JSON files from the Awesome JSON Datasets [12], the Native JSON Benchmark [43], and the JSON.parse benchmark [19], where the sizes of the files range from 14 KB to 7 MB.4 Figure 8 shows the parsing time of the ANTLR and Bison parsers, as well as the VPG parser’s parsing plus extraction time, conversion time, and the total (parsing plus extraction plus conversion) time; note that both the x-axis and the y-axis of the figure (and other figures in this section) are in the log scale for better visualization; Table 1 shows the parsing time for the five largest files in our test set, as well as the lexing time; Appendix H shows the results for the full test set. All parsers exhibit linear-time performance when parsing JSON files. As can be seen, the VPG parsing time (“VPG Parse + Extract”) is less than the time by ANTLR and Bison parsers; the total time is in general also less than the ANTLR parser and comparable to the Bison parser. For larger files, we observe that the conversion time dominates; we think the time is largely influenced by the OCaml garbage collector and defer optimizing the conversion for future work.
Fig. 8.
Fig. 8. Parsing times of JSON files (in log scale). “Guide Line” indicates the slope for linear time complexity.
Table 1.
Name
#Token
ANTLR Parse
Bison Parse
VPG Parse + Extract
VPG Conv
VPG Total
ANTLR Lex
JFlex
catalog
135,991
54 ms
20 ms
7 ms
17 ms
24 ms
40 ms
39 ms
canada
334,374
104 ms
50 ms
17 ms
52 ms
69 ms
47 ms
39 ms
educativos
426,398
92 ms
41 ms
20 ms
48 ms
69 ms
109 ms
75 ms
airlines
555,410
107 ms
58 ms
30 ms
68 ms
98 ms
88 ms
91 ms
JSON.parse
1,288,350
225 ms
128 ms
72 ms
162 ms
234 ms
142 ms
141 ms
Table 1. Parsing and Lexing Times of JSON Files
8.2 Parsing XML Files
XML also has a well-matched nesting structure with explicit start-tags such as <p> and matching end-tags such as </p>. However, compared to JSON, there is an additional complexity for the XML grammar, which makes it necessary to adapt the XML grammar provided by ANTLR. In particular, the XML lexer in ANTLR treats an XML tag as separate tokens; e.g., <p> is converted into three tokens: <, p, and >. Those tokens then appear in the ANTLR XML grammar. Part of the reason for this design is that the XML format allows additional attributes within a tag; e.g., <p id=1> is a start-tag with an attribute with name id and value 1. Below is a snippet of the related XML grammar in ANTLR.
Note that the attribute information can be specified by regular expressions and therefore be provided by lexers. To compare VPG parsers with ANTLR and Bison parsers and to expose the nesting structure within XML, we modified the grammar by grouping tag components. The following shows a snippet of our adapted XML grammar:
The above snippet introduces three new tokens, as declared below.
Note that the above grammar is used for ANTLR, Bison, and our VPG parser generator, and we do not need the VPG lexer used in the conference version of this article.
For evaluation, we used the real-world XML files provided by the VTD-XML benchmarks [42], which consist of a wide selection of 23 files ranging from 1 K to 73 MB. The parsing times are presented in Figure 9; the times of the largest 5 files are presented in Table 2; the conversion times are shown in the “VPG Conv” column. Appendix H shows the results for the full test set. Similar to JSON, VPG parsing (“VPG Parse + Extract”) on XML files is in general faster than ANTLR and Bison parsing; the total time is also less for smaller files.
Fig. 9.
Fig. 9. Parsing times of XML files (in log scale). “Guide Line” indicates the slope for linear time complexity.
Table 2.
Name
#Token
ANTLR Parse
Bison Parse
VPG Parse + Extract
VPG Conv
VPG Total
ANTLR Lex
JFlex
ORTCA
39,072
30 ms
8 ms
3 ms
6 ms
10 ms
54 ms
71 ms
SUAS
118,446
54 ms
31 ms
6 ms
21 ms
27 ms
88 ms
131 ms
address
1,437,142
254 ms
232 ms
74 ms
283 ms
357 ms
201 ms
250 ms
cd
4,198,694
721 ms
735 ms
204 ms
704 ms
908 ms
416 ms
509 ms
po
9,266,526
1,360 ms
1,835 ms
461 ms
1,528 ms
1,989 ms
854 ms
1,617 ms
Table 2. Parsing and Lexing Times of XML Files
8.3 Parsing HTML Files
A snippet of the HTML grammar in ANTLR is listed below:
Similar to the XML grammar, the HTML grammar allows self-closing tags such as <br/>. However, the HTML grammar in addition allows optional end tags, which is not allowed in XML. For example, the HTML tag <input type=”submit” value=”Ok”> cannot have a matching end tag according to the HTML standard. Although this kind of tag is also “self-closing,” we will use the terminology of optional end tags, since that is how the official HTML5 standard describes it.
In the conference version of this article, we used the HTML grammar included in ANTLR. To adapt the grammar for VPGs, we combined the components of an HTML tag into a single token and tag it as a call or return symbol for our VPG parser. The performance comparison between the ANTLR parser and the VPG parser is presented in Figure 10. As can be seen, our VPG parser outperformed the ANTLR parser by more than four orders of magnitude.
Fig. 10.
Fig. 10. Parsing times of HTML files (in log scale).
We performed an investigation and identified the reason for the huge performance gap. The HTML grammar included in ANTLR does not distinguish different kinds of tags at the grammar level; all tags are modeled by a single nonterminal. The grammar by itself is not practical and causes nondeterminism at the grammar level. As a result, the ANTLR parser cannot determine how tags are matched until all tags have been read, which significantly slows down its parsing. For the same reason, this HTML grammar cannot be refactored to an LR grammar, which is required by Bison.
To address this issue, we changed the grammar to reduce nondeterminism by explicitly modeling 10 most common tags at the grammar level and requiring their correct matching. The most essential part of the updated HTML grammar is as follows:
For example, we require that table and paragraph tags be correctly matched; other tags are captured by OpenTag and CloseTag. The new tokens are declared as follows:
Similar to XML, the above grammar is used for all parser generators.
Another point is that the HTML grammar is ambiguous. For example, the first rule
can parse a string of scriptletOrSeaWs in different ways. ANTLR’s HTML parser returns the parse tree that prioritizes earlier rules in the grammar; our VPG parser returns the same parse tree as ANTLR’s with a special pickOne function that always picks the first element in the list of partial parse trees.
For evaluation, we used the 19 real-world HTML files provided in ANTLR’s repository [3]; the sizes of these files are smaller than the JSON and XML documents we used. The parsing time is presented in Figure 11 and Table 3. The conversion times of the parse trees are shown in the “VPG Conv” column. As can be seen, our VPG parser outperforms ANTLR and Bison, which is particularly evident for smaller file sizes. Furthermore, the performance gap between the ANTLR parser and the VPG parser is much narrower after the reduction of grammar nondeterminism, confirming our hypothesis that the ANTLR’s parser struggled with the nondeterminism in the original HTML grammar.
Fig. 11.
Fig. 11. Parsing times of HTML files (in log scale). “Guide Line” indicates the slope for linear time complexity.
Table 3.
Name
#Token
ANTLR Parse
Bison Parse
VPG Parse + Extract
VPG Conv
VPG Total
ANTLR Lex
JFlex
cnn1
4,974
90 ms
7 ms
0.29 ms
1.02 ms
1.31 ms
42 ms
34 ms
reddit2
4,976
32 ms
7 ms
0.28 ms
1.11 ms
1.39 ms
54 ms
42 ms
reddit
4,989
31 ms
7 ms
0.27 ms
1.10 ms
1.37 ms
53 ms
38 ms
digg
6,250
127 ms
9 ms
0.51 ms
1.34 ms
1.85 ms
52 ms
38 ms
youtube
16,316
49 ms
12 ms
1.05 ms
5.92 ms
6.97 ms
55 ms
59 ms
Table 3. Parsing and Lexing Times of HTML Files
Summary of comparison with ANTLR and Bison. Our performance evaluation shows that our VPG parsing library generates parsers that in general run faster than those generated by ANTLR, and have comparable performance as the parsers generated by Bison, on grammars that can be converted to VPGs, such as JSON, XML, and HTML.
8.4 Comparison with Hand-crafted Parsers
We also compared VPG parsers with hand-crafted parsers for JSON and XML documents. For JSON, we compared with four mainstream JavaScript engines (V8, Chakra, JavaScriptCore, and SpiderMonkey) and evaluated them on the JSON files discussed in Section 8.1. The four JavaScript engines invoke the JavaScript built-in method JSON.parse to convert a JSON string to a JSON object. For XML, we compared with four popular XML parsers (fast-xml-parser [32], libxmljs [28], sax-js [40], and htmlparser2 [16]) and evaluated them with the XML files discussed in Section 8.2. One caveat of this comparison is that these parsers generate different parsing results. Fast-xml-parser transforms an XML document into a JSON object; libxmljs transforms an XML document into a custom object that represents the tree structure of XML documents. Sax-js and htmlparser2 parse an XML document and execute user-supplied semantic actions; no semantic actions were used in our evaluation. Additionally, for XML documents, the time required to analyze tag components is not included in the VPG parsing time but is included in the ANTLR lexing time; for better comparison, we reported the combined lexing and parsing times.
The evaluation results for the largest files are shown in Tables 4 and 5; the full results are in Appendix H. Note that the hand-crafted parsers can process raw texts directly, while our VPG parsers process the tokens generated by ANTLR’s lexers. Therefore, we show separately the lexing time of ANTLR (column “ANTLR Lex”), the parsing time of VPG parsing (column “VPG Parse + Extract”), the total time (column “VPG Total”), and the combined time (column “ANTLR Lex + VPG Total”). From the results, we can see that although the total time of VPG parsing is not the shortest among all parsers, the parsing time (“VPG Parse + Extract”) alone is. Thus, VPG parsers show promising potential in performance, in addition to verified correctness over hand-crafted parsers. The total parsing time can be reduced by replacing ANTLR’s lexer with a faster, custom lexer, since the parsing time of VPG parsers is shorter than the lexing time.
Table 4.
Name
#Token
ANTLR Lex
VPG Parse + Extract
VPG Total
ANTLR Lex + VPG Total
SpiderM
JSCore
V8
Chakra
catalog
135,991
40 ms
7 ms
24 ms
64 ms
34 ms
71 ms
28 ms
25 ms
canada
334,374
47 ms
17 ms
69 ms
115 ms
57 ms
68 ms
34 ms
44 ms
educativos
426,398
109 ms
20 ms
69 ms
177 ms
71 ms
421 ms
45 ms
49 ms
airlines
555,410
88 ms
30 ms
98 ms
186 ms
74 ms
95 ms
42 ms
56 ms
JSON.parse
1,288,350
142 ms
72 ms
234 ms
376 ms
118 ms
139 ms
76 ms
88 ms
Table 4. Parsing Times of Five Largest JSON Files
Table 5.
Name
#Token
ANTLR Lex
VPG Parse + Extract
VPG Total
ANTLR Lex + VPG Total
Fast-XML
Libxmljs
SAX-JS
HTMLP2
ORTCA
39,072
54 ms
3 ms
10 ms
64 ms
138 ms
91 ms
665 ms
89 ms
SUAS
118,446
88 ms
6 ms
27 ms
115 ms
254 ms
182 ms
1,214 ms
169 ms
address
1,437,142
201 ms
74 ms
357 ms
558 ms
584 ms
196 ms
1,012 ms
331 ms
cd
4,198,694
416 ms
204 ms
908 ms
1,324 ms
1,298 ms
419 ms
2,103 ms
735 ms
po
9,266,526
854 ms
461 ms
1,989 ms
2,843 ms
3,278 ms
897 ms
6,618 ms
1,827 ms
Table 5. Parsing Times of Five Largest XML Files
9 Future Work
As noted earlier, the correctness of our VPG-based parser generator is verified in Coq. Correctness means that if the generated parser constructs a parse tree, it must be a valid parse tree according to the input VPG and vice versa. However, there are gaps between our VPG parser generator’s Coq formalization of and its implementation in OCaml. First, the implementation takes tagged CFGs as input and translates tagged CFGs to VPGs; the translation algorithm’s correctness and termination have not been formally modeled and verified in Coq. The verification effort would be a challenge, as it depends on a global dependency graph; we leave this substantial effort to future work. Second, the implementation uses efficient data structures for performance, while their Coq models use equivalent data structures that are slower but easier for reasoning. For example, the OCaml implementation uses hash tables for storing transition tables of the two PDAs to have efficient search (with \(O(1)\) search complexity), while the Coq counterpart uses a balanced tree (with \(O(\log (n))\) search complexity) provided as a Coq library. Therefore, we extracted OCaml code from the Coq implementation of the parser generator and used it to generate offline parser and extraction PDAs with Algorithms 2 and 3.
Our parsing algorithm requires a VPG as the input grammar. Compared to a CFG, a VPG requires partitioning terminals into plain, call, and return symbols. Some CFGs may not admit such a partitioning; the same terminal may require different stack actions for different input strings. In particular, all languages recognized by VPGs belong to the set of deterministic context-free languages, which is a strict subset of context-free languages (the classic example that separates CFL from DCFL is \(\lbrace a^ib^jc^k\,\mid \,i \ne j \vee j \ne k\rbrace\)). We plan to extend our preliminary study on ANTLR grammars to understand how much of the syntax of practical computer languages (e.g., programming languages and file formats) can be described by VPGs.
Our framework requires two kinds of refactoring. First, as discussed in Section 8, we refactored the XML and HTML grammars by grouping the components of a tag as a single token. In general, given a grammar, we can combine tokens in the grammar as long as the combined token follows a regular expression pattern. This refactoring can be easily achieved by a lexer generator such as JFlex [41]. To reduce user burden, we intend to provide an additional operator in tagged CFGs for grouping tokens; the user can further mark a combined token as a call or return symbol. Second, we must eliminate left recursion because it is disallowed by our validator for tagged CFGs. Eliminating the commonly used direct left recursion is simple [36] and should be supported by tagged CFGs. In general, we believe that the refactoring effort required by our framework is less than the LR(k) refactoring effort required to resolve shift-reduce and reduce-reduce conflicts. We plan to compare refactoring efforts in future work.
Error information is essential for assisting users in identifying errors in grammars and input strings. When a parsing error occurs, our VPG parser outputs the current configuration, which includes the parser PDA’s state and stack. The user is able to view the most recently used rules as well as their context rules, which we frequently found helpful in locating the issue. Furthermore, with the default “pickOne” function, we verified that the extraction PDA should run without error. For custom “pickOne” functions, users can provide their own error information. In future work, we plan to evaluate the error reporting of VPG parsers and compare them to LL and LR parsers.
The translation algorithm from tagged CFGs to VPGs is sound but not complete. In general, it is an open problem to determine whether a CFG can be translated to a VPG, and to infer the call and return symbols automatically.
10 Conclusions
In this article, we present a recognizer and a parser generator for visibly pushdown grammars with formally verified correctness and time complexity, where the parsing algorithm is largely enlightened by the recognizer. We also provide a surface grammar called tagged CFGs and a translator from tagged CFGs to VPGs. We show that when a format can be modeled by a VPG and its call and return symbols can be identified, VPG parsing provides competitive performance and sometimes a significant speedup.
Acknowledgments
The authors would like to thank anonymous reviewers for their detailed, constructive comments.
Footnotes
1
For instance, the XML grammar is a VPG if a whole XML tag is treated as a terminal symbol; this requires a lexer that returns XML tags as tokens.
We note that our implementation of tagged CFGs additionally supports regular operators in the rules; these regular operators can be easily desugared and we omit their discussion.
If \((S,T)\leadsto w\), then \(\exists (L_1,L_2)\in S\), s.t. \((\lbrace (L_1,L_2)\rbrace , T)\leadsto w\).
Proof.
By definition, if \(T=\bot\), then
Then, by definition, again we have the lemma.
Otherwise, \(T=[S^{\prime },{‹ {a}}]\cdot T^{\prime }\) and \(w=w_1 {b› }w_2\). By definition, again we have the lemma.□
Theorem A.4.
Assume \(\delta _c(S)=(S^{\prime },\lambda T.T)\) for a plain symbol c. Then \((S,T)\leadsto cw\) iff \((S^{\prime },T) \leadsto w\).
Proof.
\(\Rightarrow\). By case over \((S,T)\leadsto cw\).
(1)
\(T=\bot\) and \((S,\bot)\leadsto cw\). By definition, \(\exists (L_1,L_2)\in S \wedge L_2\rightarrow ^{*}cw\). By the forms of VPG rules, we must have
Since \((L_1,L_2)\in S \wedge L_2\rightarrow cL_3\), by the definition of \(\delta _c\), and we have \((L_1,L_3)\in S^{\prime }\), thus, since \(L_3\rightarrow ^{*}w\), we have \((S^{\prime },\bot)\leadsto w\).
(2)
\(T=[S_1,{‹ {a}}]\cdot T^{\prime }\) and \((S, [S_1,{‹ {a}}]\cdot T^{\prime })\leadsto cw\). By definition, \(w=w_1 {b› }w_2\), and \(\exists (L_3,L_4)\in S\) s.t.
Thus, by the definition of derivatives, \(\exists L_2, (L_1,L_2)\in S \wedge L_2\rightarrow cL_3\). By Lemma A.2, \((\lbrace (L_1,L_2)\rbrace ,T)\leadsto cw\). By Lemma A.1, \((S,T)\leadsto cw\).□
Theorem A.5.
Assume \(\delta _{{‹ {a}}}S=(S^{\prime },\lambda T.[S,{‹ {a}}]\cdot T)\) for a call symbol \({‹ {a}}\). Then \((S,T)\leadsto {‹ {a}}w\) iff \((S^{\prime }, [S,{‹ {a}}]\cdot T)\leadsto w\).
Proof.
\(\Rightarrow\). By case over \((S,T)\leadsto {‹ {a}}w\).
(1)
\(T=\bot\) and \((S,\bot)\leadsto {‹ {a}}w\). By definition, \(\exists (L_1,L_2)\in S \wedge L_2\rightarrow ^{*}{‹ {a}}w\). Thus, from the forms of VPG rules, we have
thus \((\lbrace (L_3,L_3)\rbrace ,[S,{‹ {a}}]\cdot \bot)\leadsto w_1 {b› }w_2\). Since \((L_3,L_3)\in \delta _{{‹ {a}}}S\), we have \((\delta _{{‹ {a}}}S, [S,{‹ {a}}]\cdot \bot)\leadsto w_1 {b› }w_2=w\).
(2)
\(T=[S_1,{‹ {c}}]\cdot T^{\prime }\) and \((S,[S_1,{‹ {c}}]\cdot T^{\prime })\leadsto {‹ {a}}w\). Then, by definition, we have \(w=w_1 {d› } w_2\), and
Thus, \(\exists L_6,L_7\) s.t. \(L_4\rightarrow {‹ {a}}L_6 {b› }L_7\), \(L_6\rightarrow ^{*}w_{11}\), \(L_7\rightarrow ^{*}w_{12}\) and \(w_1=w_{11} {b› }w_{12}\). Thus, by definition, \((L_6,L_6)\in \delta _{{‹ {a}}}S\), so
where \(w^{\prime }\) satisfies \((\lbrace (L_3,L_7)\rbrace ,[S_1, {‹ {c}}] \cdot T^{\prime })\leadsto w^{\prime }\). From \((\lbrace (L_1,L_5)\rbrace ,T^{\prime })\leadsto w_2\), we know \(w^{\prime }\) can be \(w_{12} {d› } w_2\), so we have
\(\Leftarrow\). By definition, \(\exists (L_3,L_3)\in S^{\prime }\), s.t. \(w=w_1 {b› }w_2\), and
Since \(L_2\rightarrow {‹ {a}} L_3 {b› }L_5 \wedge (\lbrace (L_1,L_5)\rbrace ,T)\leadsto w_2\), by Lemma A.2, we have \((\lbrace L_1,L_2\rbrace ,T)\leadsto {‹ {a}} w_1 {b› }w_2={‹ {a}}w\), thus \((S,T)\leadsto {‹ {a}}w\).□
Theorem A.6.
Assume \(\delta _{ {b› }}(S, [S_1,{‹ {a}}])=(S^{\prime },\text{tail})\) for a return symbol \({b› }\). Then \((S, [S_1,{‹ {a}}]\cdot T)\leadsto {b› }w\) iff \((S^{\prime },T)\leadsto w\).
Proof.
\(\Rightarrow\). By definition, \(\exists (L_3,L_4)\in S\) and \((L_1,L_2)\in S_1\), and we must have \(L_4\rightarrow ^{*}\epsilon\), \(L_2\rightarrow {‹ {a}} L_3 {b› }L_5\) and \((\lbrace (L_1,L_5)\rbrace ,T)\leadsto w\). From \(L_4\rightarrow ^{*}\epsilon\) and the forms of allowed VPG rules, we must have \(L_4\rightarrow \epsilon\); therefore, \((L_1,L_5)\in S^{\prime }\). By Lemma A.1, we have \((S^{\prime },T)\leadsto w\).
\(\Leftarrow\). By Lemma A.3, we have \((L_1,L_5)\in S^{\prime } \wedge (\lbrace (L_1,L_5)\rbrace ,T)\leadsto w\), thus \(\exists (L_1,L_2)\in S_1 \wedge (L_3,L_4)\in S \wedge L_4\rightarrow \epsilon \wedge L_2\rightarrow {‹ {a}} L_3 {b› }L_5\), by definition, we have \((S,[S_1,{‹ {a}}]\cdot T)\leadsto {b› }w\).□
Lemma A.7.
Given a VPG \(G=(V,\Sigma ,P,L_0)\), suppose a PDA is generated according to Algorithm 1. Then, for a string \(w\in \Sigma ^*\), \((\lbrace (L_0,L_0)\rbrace ,\bot)\leadsto w\) iff w is accepted by the PDA.
Proof.
\(\Leftarrow .\) If w of length k is accepted by the PDA, then there exists a sequence of configurations \((S_i,T_i), i\in [0..k]\), s.t. (1) \(S_0=\lbrace (L_0,L_0)\rbrace\), \(T_0=\bot\); (2) \((S_{i},T_{i})=\mathcal {F}(w_{i},S_{i-1},T_{i-1}),\ i \in [1..k]\), where \(w_i\) is the ith symbol in w and \(\mathcal {F}\) is the PDA transition function; and (3) \((S_k,T_k)\) is an acceptance configuration.
For each i, perform case analysis over \(w_i\). Suppose \(w_i\) is a plain symbol, denoted as c. By \((S_{i},T_{i})=\mathcal {F}(w_{i},S_{i-1},T_{i-1})\) and the PDA construction, we must have \(\delta _c(S_{i-1}) = (S_i, \lambda T. T)\). By Theorem A.4, we get \((S_{i-1},T_{i-1})\leadsto w_i \ldots w_k\) iff \((S_i,T_i) \leadsto w_{i+1} \ldots w_k\). The cases for when \(w_i\) is \({‹ {a}}\) or \({b› }\) are similar, with the help of Theorems A.5 and A.6.
Combining all steps, we have \((S_0,T_0)\leadsto w\) iff \((S_k,T_k) \leadsto \epsilon\). Since \((S_k,T_k)\) is an acceptance configuration, we have \((S_k,T_k) \leadsto \epsilon\). Therefore, we get \((S_0,T_0)\leadsto w\).
\(\Rightarrow .\) We prove a more general lemma: If \((S_0,T_0)\) is a PDA runtime configuration and \((S_0,T_0) \leadsto w\), then w is accepted by the PDA. Prove it by induction over the length of w.
When the length is zero, we must have \(T_0=\bot\) and there exists \((L_1,L_2) \in S_0\) such that \(L_0 \rightarrow \epsilon\). Therefore, \((S_0,T_0)\) is an acceptance configuration of the PDA.
For the inductive case, suppose \(w=w_1 \ldots w_{k+1}\). Perform case analysis over \(w_1\), and first show that there exists \((S_1,T_1)\) s.t. \((S_1,T_1)=\mathcal {F}(w_1,S_0,T_0)\).
(1)
Suppose \(w_1\) is a plain symbol c and \(\delta _c(S_0)=(S_1,\lambda T, T)\). Since Algorithm 1 is closed under derivatives, we have that \(S_1\) is a PDA state. Let \(T_1=T_0\). Thus, \(\mathcal {F}(w_1,S_0,T_0) = (S_1,T_1)\) by the definition of \(\mathcal {F}\).
(2)
The case of \(w_1\) being a call symbol is similarly to the previous case.
(3)
Suppose \(w_1\) is a return symbol \({b› }\). By \((S_0,T_0) \leadsto {b› }w_2 \ldots w_{k+1}\), we have \(T_0\) is not the empty stack and has a top symbol \([S,{‹ {a}}]\). Suppose \(\delta _{ {b› }}(S_0, [S,{‹ {a}}])=(S_1,\text{tail})\). Since Algorithm 1 is closed under derivatives, we have that \(S_1\) is a PDA state. Let \(T_1=\text{tail}(T_0)\). Thus, \(\mathcal {F}(w_1,S_0,T_0) = (S_1,T_1)\) by the definition of \(\mathcal {F}\).
By Theorems A.4, A.5, and A.6, we get \((S_1,T_1) \leadsto w_2 \ldots w_{k+1}\). By the induction hypothesis, \(w_2 \ldots w_{k+1}\) is accepted by the PDA. Therefore, the original string \(w_1 \ldots w_{k+1}\) is also accepted.□
Theorem A.8.
For VPG G and its start symbol \(L_0\), a string \(w\in \Sigma ^*\) is derived from \(L_0\), i.e., \(L_0\rightarrow ^{*}w\), iff w is accepted by the corresponding PDA.
Proof.
By Lemma A.7, w is accepted by the PDA iff \((\lbrace (L_0,L_0)\rbrace , \bot)\leadsto w\), and by definition, we have \((\lbrace (L_0,L_0)\rbrace , \bot)\leadsto w\) iff \(L_0\rightarrow ^{*}w\).□
B Recognizing Strings with Pending Calls/returns
In this section, we extend the work in Section 5 to build PDAs for recognizing VPG with pending call or return symbols. In general VPGs, nonterminals are classified to two categories: \(V^0\) for matching well-matched strings and \(V^1\) for strings with pending calls/returns. We write \(V=V^0\cup V^1\) for the set of all nonterminals. \(V^0\) should be disjoint from \(V^1\). The definition also imposes constraints on how \(V^0\) and \(V^1\) nonterminals can be used. E.g., in \(L\rightarrow {‹ {a}}L_1 {b› }L_2\), \(L_1\) must be in a well-matched nonterminal (i.e., in \(V^0\)). This constraint excludes a grammar like \(L_1 \rightarrow {‹ {a}}L_2 {b› }L_3; L_2 \rightarrow {‹ {c}} L_4\).
Another major difference is that in \(L\rightarrow a L_1\), the symbol a can be a call/return symbol in addition to being a plain symbol. This makes matching calls and returns more complicated. For example, suppose we have rules: \(L_1 \rightarrow {‹ {a}}L_2\); \(L_2 \rightarrow {b› }L_3 | \epsilon\); \(L_3 \rightarrow \epsilon\). Then the string \({‹ {a}} {b› }\) is accepted, in which case \({b› }\) from \(L_2 \rightarrow {b› }L_3\) matches \({‹ {a}}\) from \(L_1 \rightarrow {‹ {a}}L_2\). String \({‹ {a}}\) is also accepted, in which case \({‹ {a}}\) is a pending call. So, depending on the input string, \({‹ {a}}\) from \(L_1 \rightarrow {‹ {a}}L_2\) may be a matching call or a pending call.
And \(L_1, L_3 \in V^1, L_2, L_4 \in V^0\). For example, \({‹ {a}} {b› }{‹ {a}}{‹ {a}}{‹ {a}} {b› } {b› }\) is in the language recognized by the grammar.
General VPGs to PDAs. The PDA states and stack symbols are the same as before. We generalize the notion of the top of the stack to return the top stack symbol when the stack is non-empty and return None when the stack is empty.
A derivative-based transition function takes the current state and the top of the stack (which can be None) and returns a new state and a stack action. As before, since \(\delta _c\) and \(\delta _{{‹ {a}}}\) do not use the top of the stack, we omit it from their parameters.
Definition B.1 (Derivative Functions for General VPGs).
Given a general VPG \(G=(V,\Sigma ,P,L_0)\), the transition functions \(\delta\) are defined as follows: For \(c\in \Sigma _{\text{plain}}\), \({‹ {a}}\in \Sigma _{\text{call}}\) and \({b› }\in \Sigma _{\text{ret}}\),
(1)
\(\delta _c\) is the same as the well-matched case.
\(\delta _c(S) = (S^{\prime },\lambda T.T)\), where
For call symbols, we have \(\delta _{{‹ {a}}}(S)= (S^{\prime }\cup S_p,\lambda T.[S,{‹ {a}}]\cdot T)\), where
Compared to the well-matched case, an additional \(S_p\) is introduced for the case when \({‹ {a}}\) appears in a rule like \(L_2\rightarrow {‹ {a}}{L_3}\).
(3)
For a return symbol \({b› }\), if t is the top of the stack, then
where
\(S^{\prime }\) is as before and deals with the case when there is a rule \(L_2\rightarrow {‹ {a}} L_3 {b› }L_5\) with a proper top stack symbol. \(S_{p1}\) deals with the case when there are rules \(L_2\rightarrow {‹ {a}}L_3\) and \(L_4\rightarrow {b› }L_5\); in this case, we match \({b› }\) with \({‹ {a}}\). Finally, \(S_{p2}\) deals with the case when the stack is empty; then \({b› }\) is treated as a pending return symbol (not matched with a call symbol).
For the well-matched case, the stack should be empty after all input symbols are consumed; in the case with pending calls/returns, however, the stack is not necessarily empty at the end. For example, with the grammar \(L\rightarrow {‹ {a}}L\mid \epsilon\) and the valid input string \({‹ {a}}\), the terminal stack is \([\lbrace (L,L)\rbrace ,{‹ {a}}]\cdot \bot\).
Definition B.2 (The Acceptance Configuration for Words with Pending Calls/returns).
Given a general VPG \(G=(V,\Sigma ,P,L_0)\), the pair \((S,T)\) is called an acceptance configuration if the following are satisfied:
either (i) \(T=\bot\) or (ii) \(T= [S^{\prime }, {‹ {a}}] \cdot T^{\prime }\) and \(\exists (L_3,L_4) \in S^{\prime } \wedge (L_4 \rightarrow {‹ {a}}L_1) \in P\) for some \(L_1\).
In the following correctness proof, we use predicate \(\operatorname{well-matched}(w)\) to mean that w, a string of terminals, is a well-matched string; that is, every call/return symbol is matched with a corresponding return/call symbol. We use predicate \(\operatorname{matched-rets}(w)\) to mean that any return symbol in w is matched with a call symbol; however, a call symbol may not be matched with a return symbol. E.g., we have \(\operatorname{matched-rets}({‹ {a}}{‹ {a}} {b› })\), but not \(\operatorname{well-matched}({‹ {a}}{‹ {a}} {b› })\).
Definition B.3 (Semantics of PDA Configurations).
We will write \((S, T)\leadsto w\) to mean that w can be accepted by the configuration \((S, T)\). It is defined as follows:
(1)
\((S, \bot)\leadsto w \mbox{ if } \exists (L_1,L_2)\in S,\ \text{s.t. }L_2\rightarrow ^{*}w\),
In the above definition, the third case handles when the call symbol \({‹ {a}}\) in rule \(L_2\rightarrow {‹ {a}}L_3\) matches \({b› }\) in \(w_1 {b› }L_5\) produced by \(L_4\). The last case handles when \({‹ {a}}\) in rule \(L_2\rightarrow {‹ {a}}L_3\) does not have a matched return; that is, it is a pending call.
The following three lemmas and their proofs are the same as before (except that Lemma B.2 requires well-matched strings):
Lemma B.1.
If \((S_1,T)\leadsto w \wedge S_1\subseteq S_2\), then \((S_2,T)\leadsto w\).
Lemma B.2.
If \(L_2\rightarrow ^{*}w_1L_3\), \(\operatorname{well-matched}(w_1)\), and \((\lbrace (L_1,L_3)\rbrace ,T)\leadsto w\), then \((\lbrace (L_1,L_2)\rbrace ,T)\leadsto w_1w\).
Lemma B.3.
If \((S,T)\leadsto w\), then \(\exists (L_1,L_2)\in S\), s.t. \((\lbrace (L_1,L_2)\rbrace , T)\leadsto w\).
In addition, we need the following lemma:
Lemma B.4.
IF \(L \rightarrow ^{*}w \delta\), where \(\delta\) is a string of terminals or nonterminals, then we have either (1) \(\operatorname{matched-rets}(w)\), or (2) exists \(w_1, {b› }, w_2\), so \(w=w_1 {b› }w_2\) and \(\operatorname{well-matched}(w_1)\) and exists \(L_1\) so \(L \rightarrow ^{*}w_1 {b› }L_1\) and \(L_1 \rightarrow ^{*}w_2 \delta\).
Proof.
Sketch: If \(w=\epsilon\), then \(\operatorname{matched-rets}(\epsilon)\). Otherwise, prove it by induction over the length of the derivation of \(L \rightarrow ^{*}w \delta\) and then perform case analysis over the first derivation step.□
Theorem B.5.
For a plain symbol c, \((S,T)\leadsto cw\) iff \(\delta _c(S)=(S^{\prime },f)\), and \((S^{\prime },f(T)) \leadsto w\).
The proof is similar to the proof before, except the \(\Rightarrow\) direction has more cases to consider.
Theorem B.6.
For \({‹ {a}}\in \Sigma _{\text{call}}\), \((S,T)\leadsto {‹ {a}}w\) iff \(\delta _{{‹ {a}}}S=(S^{\prime },f)\), and \((S^{\prime }, f(T))\leadsto w\).
The proof is similar to the proof before, except with more cases to consider. The \(\Rightarrow\) direction requires the use of Lemma B.4.
Theorem B.7.
(1)
If \(\delta _{ {b› }}(S, [S_1,{‹ {a}}])=(S^{\prime },\text{tail})\), then \((S, [S_1,{‹ {a}}]\cdot T)\leadsto {b› }w\) iff \((S^{\prime },T)\leadsto w\).
(2)
If \(\delta _{ {b› }}(S,\text{None})=(S^{\prime },\lambda T. T)\), then \((S, \bot)\leadsto {b› }w\) iff \((S^{\prime },\bot)\leadsto w\).
Part (1)’s proof is similar to before, except with more cases and sometimes need to use Lemma B.4. Part(2)’s proof is straightforward.
Lemma B.8.
Given a VPG \(G=(V,\Sigma ,P,L_0)\), suppose a PDA is generated according to Algorithm 1. Then, for a string \(w\in \Sigma ^*\), \((\lbrace (L_0,L_0)\rbrace ,\bot)\leadsto w\) iff w is accepted by the PDA.
The lemma can be proved as before, except with more cases.
Theorem B.9.
For VPG G and its start nonterminal \(L_0\), a string \(w\in \Sigma ^*\) is derived from \(L_0\), i.e., \(L_0\rightarrow ^{*}w\), iff w is accepted by the corresponding PDA.
The proof is as before.
C Parser and Extraction PDAs for General VPGs
In this section, we discuss the parser PDA and the extraction PDA for general VPGs. For general VPGs, the definitions of \(\mathrm{headNT}\left(\right)\), \(\mathrm{nextNT}\left(\right),\) and dotted rules of well-matched VPGs also apply; the definition for plain symbols generalize for pending call and return symbols naturally. We define the derivative functions of the parser PDA as follows:
Definition C.1 (The Derivative Function p for the Parser PDA) Given a VPG \(G=(V,\Sigma ,P,L_0)\), suppose the current state of the parser PDA is m and the current stack is T, the transition functions \(p_c\), \(p_{{‹ {a}}}\), and \(p_{ {b› }}\) are defined as follows:
(1)
For plain symbols, the derivative function of general VPGs is the same as that for well-matched VPGs; we restate the function here: \(p_c(m)=(m^{\prime },\lambda T.T)\), where
\begin{equation*} m^{\prime }=\lbrace (r^{\prime },\mathrm{nextNT}\left(r\right)\rightarrow c. L_1) \mid (r^{\prime },r)\in m \wedge (\mathrm{nextNT}\left(r\right)\rightarrow c. L_1)\in \dot{P}\rbrace . \end{equation*}
For each pair \((r^{\prime },r)\) in m, the new state keeps the context rule \(r^{\prime }\) and updates the current rule to a rule with head \(\mathrm{nextNT}\left(r\right)\) and that derives c.
(2)
For call symbols, the derivative function of general VPGs is similar to that for well-matched VPGs; we only need to introduce a new condition for pending rules: \(p_{{‹ {a}}}(m)=(m^{\prime },\lambda T.\ m\cdot T)\), where
(3)
For return symbols, we have two more cases to consider: \(p_{ {b› }}(m,\overline{m_{\text{call}}})=(m^{\prime },\text{tail})\), where \(\text{tail}\) is the function that removes the top of a stack and returns the rest of stack (for empty stacks, \(\text{tail}\) return empty stacks), and \(\overline{m_{\text{call}}}=\operatorname{head }(T)\) if \(T\ne \bot\), and \(\overline{m_{\text{call}}}=\emptyset\) if \(T=\bot\), and
We remove the top of the stack and construct the new state in three cases, depending on the context rule \(r_1\) for a pair \((r_1,r_2)\) in m. We have already discussed the case where \(r_1\) is a matching rule in Definition 6.4. Below are the two new cases for pending rules.
If the context \(r_1\) is a pending rule, then the new state updates the current rule to a rule that rewrites \(\mathrm{nextNT}\left(r_2\right)\) and generates \({b› }\). Similar to the first case, we have \(\overline{m_{\text{call}}}\ne \emptyset\); the update of the context rule in the new state is similar.
Otherwise, if there is no context rule, then no call symbol is pending, and the \({b› }\) symbol is an unmatched return symbol. In this case, we must have \(\overline{m_{\text{call}}}=\emptyset\). We construct the new state in a way similar to the second case, except that the context rule is None.
The acceptance configurations for general VPGs also allow the contexts to be pending rules, as shown below.
Given a VPG \(G=(V,\Sigma ,P,L_0)\), a pair \((m,E)\) is an acceptance configuration for the parser PDA if the state m includes a pair \((r^{\prime },r)\), where \(r^{\prime }\) is either \(\text{None}\) or a pending rule, and \(\mathrm{nextNT}\left(r\right)\) derives the empty string \(\epsilon\), i.e., \(\exists r,\ (\text{None},r)\in m\ \wedge \ (\mathrm{nextNT}\left(r\right)\rightarrow \epsilon)\in P\).
Definition C.3 (The Extraction Function \(\operatorname{extract}_\mathrm{init}\)) We only need to add one more condition to the function \(\operatorname{extract}^{\mathrm{pre}}_\mathrm{init}\) to allow a pending rule as the context; the rest are the same as those of well-matched VPGs.
\begin{equation*} \operatorname{extract}^{\mathrm{pre}}_\mathrm{init}(m)=\lbrace r \mid (r^{\prime },r)\in m, (\mathrm{nextNT}\left(r\right)\rightarrow \epsilon)\in P \wedge r^{\prime } \text{ is None or a pending rule} \rbrace . \end{equation*}
Note that this also implies that the stack of a acceptance configuration for general VPGs is not necessarily empty.
Definition C.4 (The One-step Extraction Function).
Similar to the case of the function \(\operatorname{extract}^{\mathrm{pre}}_\mathrm{init}\), we only need to add some additional conditions to the function \(\operatorname{extendable}\); the rest are the same as those of well-matched VPGs.
\begin{align*} \operatorname{extendable}((r^{\prime },r),{(v,E)})= {\left\lbrace \begin{array}{ll} \text{true}, & \text{if }r^{\prime } \text{ is None or a pending rule}\ \wedge \\ &\qquad \operatorname{head }(E) \text{ is None or a pending rule}\ \wedge \\ & \qquad \mathrm{nextNT}\left(r\right)=\mathrm{headNT}\left(v\right) \ \vee \\ &r^{\prime } \text{ and } \operatorname{head }(E)\text{ are the same matching rule} \\ & \quad \text{(but with different {dot} positions)}\ \wedge \\ &\qquad ((\operatorname{firstRule}(v) \in \mathcal {P}_{\text{ret}}\ \wedge \ \\ & \qquad \quad (\mathrm{nextNT}\left(r\right)\rightarrow \epsilon)\in P) \ \vee \\ &\qquad \ \ (\operatorname{firstRule}(v) \in \mathcal {P}_{\text{pln}}\cup \mathcal {P}_{\text{call}}\ \wedge \\ & \qquad \qquad \mathrm{nextNT}\left(r\right)=\mathrm{headNT}\left(v\right))),\\ \text{false}, & \text{otherwise}. \\ \end{array}\right.} \end{align*}
Compared with the case of well-matched VPGs, the only difference is that the context \(r^{\prime }\) can not only be None, but also be a pending rule, and, in accordance, the stack should be either empty or have a pending rule as the stack top.
D Correctness Proof of VPG-based Parsing
Given a VPG \(G=(V,\Sigma ,P,L_0)\) and an input string w, let \(V=\operatorname{extract}(\operatorname{forest}(w))\). We call V correct if it includes exactly the set of valid parse trees of w according to G; i.e.,
To prove the above, we first state and prove the invariants of p and \(\operatorname{extract}_{\mathrm{oneStep}}\); the correctness follows as a corollary.
We first discuss the invariants of the parser derivative function p. Intuitively, a rule sequence \([r_1,\ldots ,r_{|w|}]\) that derives w should be included in \(\operatorname{forest}(w)= [m_1, m_2, \ldots , m_{|w|}]\); i.e., there exists \(r_i^{\prime }\) such that \((r_i^{\prime },r_i)\in m_i\) for \(i=1..|w|\). To formalize a rule sequence, we introduce a forward small-step relation in Figure 12. Judgment \(({v},{E}) \stackrel{i}{\rightarrow } ({v+[r]},{E^{\prime }})\) means that the rule sequence v can be extended with a rule r, which generates the input symbol i. Here, “forward” means that the relation generates the input from left to right and “small-step” means that the relation generates one symbol at a time, as opposed to the big-step relation in Figure 4. Figure 13 further defines the transitive closure of the small-step relation; consequently, \(({[]},{\bot }) \stackrel{w}{\longrightarrow ^*} ({v},{E})\) means that v generates w in multiple steps.
Fig. 12.
Fig. 12. The forward small-step parse-tree derivation relation, given a VPG \(G=(\Sigma ,V,P,L_0)\).
Fig. 13.
Fig. 13. The transitive closure of the forward small-step parse-tree derivation relation.
The forward small-step relation is flexible in that it allows combining and splitting parse trees. We introduce a series of lemmas to formalize those operations: Lemma D.1 shows we can concatenate two forward small-step relations, assuming the first one has an empty stack; the cases of nonempty stacks are covered in Lemmas D.2 and D.3. Lemma D.4 shows that we can split a forward small-step relation into two parts and replace the first part with another small-step relation. Lemma D.5 shows we can split a forward small-step relation into two parts, and the first part is also a small-step relation. The proofs of these lemmas are directly by induction over the length of w; we omit them here.
The forward small-step relation is equivalent to the big-step relation under certain conditions; we state the equivalence in Lemma D.7. To prove the equivalence, we first introduce the invariant of the forward small-step relation in Lemma D.6, which splits a forward small-step relation into two parts based on the last unmatched rule and considers them separately.
Lemma D.6 (Invariant of the Forward Small-step Relation).
Proof.
Prove it by induction over the length of w. There are two cases of w: If \(w=i\in \Sigma\), then the lemma is straightforward to show; so assume \(w=w^{\prime }i\) and \(w^{\prime }\ne \epsilon\). By the definition of the small-step relation, we have \(\exists v^{\prime }\ r,\ v=v^{\prime }+[r]\) and \(\exists E^{\prime },({[]},{\bot }) \stackrel{w^{\prime }}{\longrightarrow ^*} ({v^{\prime }},{E^{\prime }})\); we apply the induction hypothesis to the small-step relation of \(w^{\prime }\) and have two cases of \(E^{\prime }\).
In the first case, \(E^{\prime }=\bot\) or \(\operatorname{head }(E^{\prime })\) is a pending rule. There are two cases of i.
(1)
\(i\in \Sigma _{\text{plain}}\cup \Sigma _{\text{ret}}\). We have \(E=E^{\prime }\) when \(i\in \Sigma _{\text{plain}}\), and \(E=\operatorname{tail }(E^{\prime })\) when \(i\in \Sigma _{\text{ret}}\). In both cases, either \(E=\bot\) or \(\operatorname{head }(E)\) is a pending rule.5 For any big-step \({\mathrm{nextNT}\left(v\right)} \Downarrow \left({\hat{w}},{\hat{v}}\right)\), we extend it to \({\mathrm{nextNT}\left(v^{\prime }\right)} \Downarrow \left({i\hat{w}},{[r]+\hat{v}}\right)\) by definition, and by the induction hypothesis, we have \({\mathrm{headNT}\left(v\right)} \Downarrow \left({w^{\prime }i\hat{w}},{v^{\prime }+[r]+\hat{v}}\right)\).
(2)
\(i\in \Sigma _{\text{call}}\). Then, we have \(E=r\cdot E^{\prime }\). There are two cases of r.
(a)
If r is a pending rule, then for any big-step \({\mathrm{nextNT}\left(v\right)} \Downarrow \left({\hat{w}},{\hat{v}}\right)\), we first extend it to \({\mathrm{nextNT}\left(v^{\prime }\right)} \Downarrow \left({i\hat{w}},{[r]+\hat{v}}\right)\) by definition, then we have \({\mathrm{headNT}\left(v\right)} \Downarrow \left({w^{\prime }i\hat{w}},{v^{\prime }+[r]+\hat{v}}\right)\) by the induction hypothesis.
(b)
If r is a matching rule, then we construct \(v_1=v^{\prime }\), \(r_{{‹ {a}}}=r\), \(v_2=[]\) and \(w_1=w^{\prime }\), \({‹ {a}}=i\), \(w_2=\epsilon\). We already have \(({[]},{\bot }) \stackrel{w_1}{\longrightarrow ^*} ({v_1},{E^{\prime }})\). Since both \(v_2\) and \(w_2\) are empty, \({\mathrm{nextNT}\left(v\right)} \Downarrow \left({\hat{w}},{\hat{v}}\right) \Rightarrow {\mathrm{nextNT}\left(r_{{‹ {a}}}\right)} \Downarrow \left({w_2\hat{w}},{v_2+\hat{v}}\right)\) is trivial to show.
In the second case, \(\operatorname{head }(E^{\prime })\) is a matching rule, and we have
There are three cases of i.
(1)
\(i\in \Sigma _{\text{plain}}\). Then, we have
We construct \(v_1=v_1^{\prime }\), \(r_{{‹ {a}}}=r_{{‹ {a}}^{\prime }}\), \(v_2=v_2^{\prime }+[r]\), and \(w_1=w_1^{\prime }\), \({‹ {a}}={‹ {a}}^{\prime }\), \(w_2=w_2^{\prime }i\). We only need to prove the last clause. For any big-step \({\mathrm{nextNT}\left(v\right)} \Downarrow \left({\hat{w}},{\hat{v}}\right)\), we extend it to \({\mathrm{nextNT}\left(v^{\prime }\right)} \Downarrow \left({i\hat{w}},{[r]+\hat{v}}\right)\) by definition, and by the induction hypothesis, we have \({\mathrm{nextNT}\left(r_{{‹ {a}}^{\prime }}\right)} \Downarrow \left({w_2^{\prime }i\hat{w}},{v_2^{\prime }+[r]+v^{\prime }}\right)\).
We construct \(v_1=v^{\prime }\), \(r_{{‹ {a}}}=r\), \(v_2=[]\), and \(w_1=w^{\prime }\), \({‹ {a}}=i\), \(w_2=\epsilon\). The rest of the proof is straightforward.
(3)
\(i\in \Sigma _{\text{ret}}\). Then, we have \(E^{\prime }=r_{{‹ {a}}^{\prime }}\cdot E\). For any big-step \({\mathrm{nextNT}\left(v\right)} \Downarrow \left({\hat{w}},{\hat{v}}\right)\), we first construct the big-step \({\mathrm{headNT}\left(r_{{‹ {a}}^{\prime }}\right)} \Downarrow \left({{‹ {a}}^{\prime }w_2^{\prime }i\hat{w}},{[r_{{‹ {a}}^{\prime }}]+v_2^{\prime }+[r]+\hat{v}}\right)\) by definition. Now, there are two cases of E.
(a)
\(E=\bot\) or \(\operatorname{head }(E)\) is a pending rule. If \(w_1^{\prime }=\epsilon\), then we are done. Otherwise, we apply the induction hypothesis to the small-step relation of \(w_1^{\prime }\) and use it to extend the big-step relation to
(b)
\(\operatorname{head }(E)\) is a matching rule. We apply the induction hypothesis to the small-step relation of \(w_1^{\prime }\), and have
So, we have
By the induction hypothesis, we extend the big-step relation to
□
Lemma D.7 (Relating Small-step and Big-step Relations).
Proof.
\(\Rightarrow\). Prove it by induction over the big-step relation \({L} \Downarrow \left({w},{v}\right)\).
(1)
\(w=i w^{\prime }\), \(v=[r]+v^{\prime }\), and \({\mathrm{nextNT}\left(r\right)} \Downarrow \left({w^{\prime }},{v^{\prime }}\right)\). The case of \(w^{\prime }=\epsilon\) is trivial, so assume \(w^{\prime }\ne \epsilon\). Apply the induction hypothesis to \(w^{\prime }\) and we have
\(i\in \Sigma _{\text{call}}\). We construct the small-step relation \(({[]},{\bot }) \stackrel{i}{\longrightarrow ^*} ({[r]},{r\cdot \bot })\) by definition, and by Lemma D.2, we append \(v^{\prime }\) to \([r]\) and have \(({[]},{\bot }) \stackrel{i w^{\prime }}{\longrightarrow ^*} ({[r]+v^{\prime }},{E^{\prime }\cdot [r]})\) or \(({[]},{\bot }) \stackrel{i w^{\prime }}{\longrightarrow ^*} ({[r]+v^{\prime }},{E^{\prime }})\).
(b)
\(i\in \Sigma _{\text{plain}}\cup \Sigma _{\text{ret}}\). We construct the small-step relation \(({[]},{\bot }) \stackrel{i}{\longrightarrow ^*} ({[r]},{\bot })\) by definition, and by Lemma D.1, we append \(v^{\prime }\) to \([r]\) and have \(({[]},{\bot }) \stackrel{i w^{\prime }}{\longrightarrow ^*} ({[r]+v^{\prime }},{E^{\prime }})\).
(2)
\(w={‹ {a}}w_1 {b› }w_2\), \(v=[r_1]+v_1+[r_2]+v_2\), and \({\mathrm{nextNT}\left(r_i\right)} \Downarrow \left({w_i},{v_i}\right)\), \(i=1,2\). We first construct the small-step relation \(({[]},{\bot }) \stackrel{{‹ {a}}}{\longrightarrow ^*} ({[r_1]},{r_1\cdot \bot })\). If \(w_1\ne \epsilon\), then we apply the induction hypothesis to \(w_1\) and have the small-step relation \(({[]},{\bot }) \stackrel{w_1}{\longrightarrow ^*} ({v_1},{\bot })\) (note that \(\mathrm{headNT}\left(v_1\right)\in V^0\)). Then, by Lemma D.3, we have \(({[]},{\bot }) \stackrel{{‹ {a}}w_1}{\longrightarrow ^*} ({[r_1]+v_1},{r_1\cdot \bot })\); we extend it to \(({[]},{\bot }) \stackrel{{‹ {a}}w_1 {b› }}{\longrightarrow ^*} ({[r_1]+v_1+[r_2]},{\bot })\) by definition. Finally, if \(w_2\ne \epsilon ,\) then we apply the induction hypothesis to \(w_2\) and have the small-step relation of \(w_2\). By Lemma D.4, we have \(({[]},{\bot }) \stackrel{w}{\longrightarrow ^*} ({[r_1]+v_1+[r_2]+v_2},{\bot })\).
\(\impliedby\). This is a corollary of Lemma D.6.□
We describe the specification of the parser derivative functions in Lemma D.8.
Lemma D.8 (Specification of Parser Derivative Functions).
Now, we give the invariants of the parser PDA. Lemma D.9 states that the forward small-step relation for input \(w_{1} \ldots w_{i}\) is represented in \(m_i\), for all \(i \in [1..|w|]\), where \(m_i\) is the parser PDA state after processing \(w_i\). Lemma D.10 states that each pair in \(m_i\) represents a forward small-step relation. In these lemmas, we write \(\mathcal {P}\) for the transitive closure of the parser transition function (Definition D.1) and use a helper function \(\operatorname{split}(v,E,w)=(v_1,v_2,w_1,w_2)\), which splits v into \(v=v_1+[r_{{‹ {a}}}]+v_2\) and w into \(w=w_1{‹ {a}}w_2\), where \(r_{{‹ {a}}}\) is the rule that generates the last unmatched call symbol \({‹ {a}}\). Specially, if w does not include unmatched call symbols, then \(\operatorname{split}(v,E,w)=([],v,[],w)\).
Definition D.1 (Transition Closure of the Parser PDA).
Given a string w, we define the relation \(\mathcal {P}({m_0},{\bot },{w}) = ({m},{T})\), meaning that starting from \((m_0,\bot)\), running the parser PDA on w and the last configuration is \((m,T)\).
There are two cases of w: If \(w=i\in \Sigma\), then we can check the lemma straightforwardly. Next, assume \(w=w^{\prime }i\) and \(w^{\prime }\ne \epsilon\). Then, we have \(v=v^{\prime }+[r]\). We invert \(({[]},{\bot }) \stackrel{w}{\rightarrow } ({v},{E})\) and have \(({[]},{\bot }) \stackrel{w^{\prime }}{\rightarrow } ({v^{\prime }},{E^{\prime }})\). There are three cases of i.
(1)
\(i \in \Sigma _{\text{plain}}\). Then \(E=E^{\prime }\). By the induction hypothesis, we have \((\operatorname{head }(E^{\prime }),\operatorname{lastRule}(v^{\prime }))\in m^{\prime }\), where \((m^{\prime },T)=p_{w^{\prime }}(m_0,\bot)\). By Lemma D.8, we have \((\operatorname{head }(E),\operatorname{lastRule}(v))\in m\). If \(\operatorname{head }(E)=\text{None}\), i.e., \(E=\bot\), we are done. Otherwise, let \(\operatorname{split}({v^{\prime }},{E^{\prime }},{w^{\prime }})=(v_1^{\prime },v_2^{\prime },w_1^{\prime },w_2^{\prime })\). Clearly, we have \(v_1=v_1^{\prime }\) and \(w_1=w_1^{\prime }\), so we have \(\exists t\ T^{\prime },\ T=t::T^{\prime }\wedge \mathcal {P}({m_0},{\bot },{w_1}) = ({t},{T^{\prime }})\).
(2)
\(i \in \Sigma _{\text{call}}\). Then \(E=r\cdot E^{\prime }\), \(w_1=w^{\prime }\) and \(v_1=v^{\prime }\). By Lemma D.8, \((r,r)\in m\) and \(\exists T^{\prime },\ T=m^{\prime }::T^{\prime }\wedge \mathcal {P}({m_0},{\bot },{w_1}) = ({m^{\prime }},{T^{\prime }})\).
(3)
\(i \in \Sigma _{\text{ret}}\). Let \(\operatorname{split}({v^{\prime }},{E^{\prime }},{w^{\prime }})=(v_1^{\prime },v_2^{\prime },w_1^{\prime },w_2^{\prime })\). There are two cases of \(E^{\prime }\).
(1)
\(E^{\prime }=\bot\). Then, we have \(E=\bot\), and by Lemma D.8 and the induction hypothesis, we have \((\text{None}, \operatorname{lastRule}(v))\in m\).
(2)
\(\exists r_{{‹ {a}}},E^{\prime }=r_{{‹ {a}}}\cdot E\). Then, we have \(r_{{‹ {a}}}\in E^{\prime }\), and by Lemma D.8 and the induction hypothesis, we have \((\operatorname{head }(E), \operatorname{lastRule}(v))\in m\). If \(\operatorname{head }(E)=\text{None}\), then we are done. Otherwise, we have \(w_1^{\prime }\ne \epsilon\). We apply the induction hypothesis to \(w_1^{\prime }\) and finish the proof.□
□
Lemma D.10 (Invariants of the Parser PDA, Part 2).
There are two cases of w: If \(w=i\in \Sigma\), then we can check the lemma straightforwardly. Next, assume \(w=w^{\prime }i\) and \(w^{\prime }\ne \epsilon\). Let \(\mathcal {P}({m_0},{\bot },{w^{\prime }}) = ({m^{\prime }},{T^{\prime }})\).
There are three cases of i.
(1)
\(i \in \Sigma _{\text{plain}}\). For each \((r_1,r)\) in m, we have \(\exists r^{\prime },({r_1},r^{\prime })\in m^{\prime }\) and \(\mathrm{nextNT}\left(r^{\prime }\right)=\mathrm{headNT}\left(r\right)\). The lemma is then directly from the definition and the induction hypothesis applied to \(w^{\prime }\).
(2)
\(i \in \Sigma _{\text{call}}\). For each \((r_1,r)\) in m, we have \(\exists r^{\prime },(r_2,r^{\prime })\in m^{\prime }\) and \(\mathrm{nextNT}\left(r^{\prime }\right)=\mathrm{headNT}\left(r\right)\). By Lemma D.8, \(r_1=r\) and \((r_1,r_1)\in m\); by the induction hypothesis applied to \(w^{\prime }\), \(\exists H,({[]},{\bot }) \stackrel{w}{\longrightarrow ^*} ({v^{\prime }+[r_1]},{r_1::H})\).
(3)
\(i \in \Sigma _{\text{ret}}\). There are two cases of \(r_1\).
(a)
\(r_1=\text{None}\). Then \(T=\bot\) and by Lemma D.8 and the induction hypothesis, we have \(({[]},{\bot }) \stackrel{w}{\longrightarrow ^*} ({v+[r]},{\bot })\).
(b)
\(r_1\ne \text{None}\). Then \(\exists t,\ T=t\cdot T^{\prime }\), \(\exists r_3,(r_1,r_3)\in t\), \(\exists r_4\ r_5,(r_4,r_5)\in m^{\prime }\), and \(\mathrm{nextNT}\left(r_3\right)=\mathrm{headNT}\left(r_4\right)\). By Lemma D.8, we have \(\exists H,({[]},{\bot }) \stackrel{w}{\longrightarrow ^*} ({v^{\prime }+[r]},{r_1::H})\).
□
Our next step is to introduce the invariants of extraction functions, which build parse trees in a backward way. To formalize the parse trees built by the backward extraction process, we introduce the backward small-step relation in Figure 14. This relation is similar to the forward one, only in the reverse direction. The corresponding transitive closure is defined in Figure 15.
Fig. 14.
Fig. 14. The backward small-step parse-tree derivation, given a VPG \(G=(\Sigma ,V,P,L_0)\).
Fig. 15.
Fig. 15. The transitive closure of the backward small-step relation.
Similar to the forward small-step relation, the backward small-step relation is equivalent to the big-step relation under some conditions. We state the equivalence in Lemma D.13. To prove the lemma, we first give the invariants of the backward small-step relation in Lemma D.11.
Lemma D.11 (Invariants of the Backward Small-step Relation).
Proof.
Prove it by induction over the length of w.
There are two cases of w: If \(w=i\in \Sigma\), then we can prove the lemma straightforwardly; so assume \(w=iw^{\prime }\) and \(w^{\prime }\ne \epsilon\) and we have \(v=[e]+v^{\prime }\). From \(({[]},{\bot }) \stackrel{w}{\rightsquigarrow ^*} ({v},{E})\), we have \(\exists E^{\prime },\ ({[]},{\bot }) \stackrel{w^{\prime }}{\rightsquigarrow ^*} ({v^{\prime }},{E^{\prime }})\). There are three cases of i.
(1)
\(i\in \Sigma _{\text{plain}}\). Then, we have \(E=E^{\prime }\). If \(\operatorname{head }(E)\) is None or a pending rule, then we prove the lemma by extending \({\mathrm{headNT}\left(v^{\prime }\right)} \Downarrow \left({w^{\prime }},{v^{\prime }}\right)\) to \({\mathrm{headNT}\left(v\right)} \Downarrow \left({iw^{\prime }},{[e]+v^{\prime }}\right)\). So, assume \(\operatorname{head }(E)\) is a matching rule. Let \(v^{\prime }=v_1+[r]+v_2\), we prove the lemma by extending \({\mathrm{headNT}\left(v_1\right)} \Downarrow \left({w_1},{v_1}\right)\) to \({\mathrm{headNT}\left(v\right)} \Downarrow \left({iw_1},{[r]+v_1}\right)\).
(2)
\(i\in \Sigma _{\text{call}}\). If \(\operatorname{head }(E)\) is None or a pending rule, then we prove the lemma by extending the big-step relation of \(v^{\prime }\) to \({\mathrm{headNT}\left(v\right)} \Downarrow \left({iw^{\prime }},{[e]+v^{\prime }}\right)\). So, assume \(\operatorname{head }(E)\) is a matching rule. Let \(v^{\prime }=v_1^{\prime }+[r^{\prime }]+v_2^{\prime }\) and \(w^{\prime }=w_1^{\prime } {b^{\prime }› }w_2^{\prime }\). By induction, we have \(\exists E_2,\ ({[]},{\bot }) \stackrel{w_2^{\prime }}{\rightsquigarrow } ({v_2^{\prime }},{E_2})\). If \(\operatorname{head }(E)\) is None or a pending rule, then we can directly construct \({\mathrm{headNT}\left(v\right)} \Downarrow \left({w},{[e]+v_1^{\prime }+[r^{\prime }]+v_2^{\prime }}\right)\). If \(\operatorname{head }(E)\) is a matching rule, then we have \(v_2^{\prime }=v_1^{\prime \prime }+[r^{\prime \prime }]+v_2^{\prime \prime }\) and \(w_2^{\prime }=w_1^{\prime \prime } {b^{\prime \prime }› }w_2^{\prime \prime }\). We prove the lemma by combining the big-step of \(v_1^{\prime }\) and \(v_1^{\prime \prime }\) to \({\mathrm{headNT}\left(v\right)} \Downarrow \left({iw_1^{\prime } {b^{\prime }› }w_1^{\prime \prime }},{[e]+v_1^{\prime }+[r^{\prime }]+v_1^{\prime \prime }}\right)\).
(3)
\(i\in \Sigma _{\text{ret}}\). This case is trivial.□
□
Just as the case for the forward small-step relation, we can combine two backward small-step relations. We state it in Lemma D.12.
Prove it by induction over the big-step relation \({L} \Downarrow \left({w},{v}\right)\). Since \(L\in V^0\), there are two cases.
(1)
\(w=cw^{\prime }\) and \(v=[r]+v^{\prime }\). If \(w^{\prime }=\epsilon\), then we can prove the lemma straightforwardly; so, assume \(w^{\prime }\ne \epsilon\) and apply the induction hypothesis to \({\mathrm{headNT}\left(v^{\prime }\right)} \Downarrow \left({w^{\prime }},{v^{\prime }}\right)\). We can directly extend the backward small-step relation of \(v^{\prime }\) with \([r]\). For any \(({[]},{\bot }) \stackrel{w_2}{\rightsquigarrow ^*} ({[L_1\rightarrow {‹ {a}}L_2 {b› }. L_3]+v_2},{E_2})\), we first extend it with the small-step relation of \(v^{\prime }\) by the induction hypothesis, then further extend the small-step relation with \([r]\) by definition.
(2)
\(w={‹ {a}}^{\prime } w_1^{\prime } {b^{\prime }› } w_2^{\prime }\) and \(v=[r_{{‹ {a}}^{\prime }}]+v_1^{\prime }+[r_{ {b^{\prime }› }}]+v_2^{\prime }\). Note that, since \(L\in V^0\), we have \(\mathrm{headNT}\left({v_2^{\prime }}\right)\in V^0\). For any \(({[]},{\bot }) \stackrel{w_2}{\rightsquigarrow ^*} ({[L_1\rightarrow {‹ {a}}L_2 {b› }. L_3]+v_2},{E_2})\), by the induction hypothesis of \(v_2^{\prime }\) and the definition of the backward small-step, we extend it with \(v_2^{\prime }\), \(r_{ {b^{\prime }› }}\), \(v_1^{\prime }\), and \(r_{{‹ {a}}^{\prime }}\) in order.□
□
Lemma D.13 (Correctness of the Backward Small-step Relation).
Proof.
\(\Rightarrow\). Prove it by induction over the big-step relation \({{L}} \Downarrow \left({w},{v}\right)\).
(1)
\(w=i w^{\prime }\), \(v=[r]+v^{\prime }\), and \({\mathrm{nextNT}\left(r\right)} \Downarrow \left({w^{\prime }},{v^{\prime }}\right)\), where r is not a matching rule. The case of \(w^{\prime }=\epsilon\) is trivial; so, assume \(w^{\prime }\ne \epsilon\). Apply the induction hypothesis to \(w^{\prime }\) and we have \(\exists E^{\prime },\ ({[]},{\bot }) \stackrel{w^{\prime }}{\rightsquigarrow ^*} ({v^{\prime }},{E^{\prime }})\). We can directly extend this backward small-step relation with \([r]\).
(2)
\(w={‹ {a}}w_1 {b› }w_2\), \(v=[r_1]+v_1+[r_2]+v_2\), and \({\mathrm{nextNT}\left(r_i\right)} \Downarrow \left({w_i},{v_i}\right)\), \(i=1,2\). If \(w_2\ne \epsilon\), then we apply the induction hypothesis to \({\mathrm{headNT}\left(v_2\right)} \Downarrow \left({w_2},{v_2}\right)\) and have the small-step relation of \(w_2\). We extend it to \(({[]},{\bot }) \stackrel{ {b› }w_2}{\rightsquigarrow ^*} ({[r_2]+v_2},{r_2\cdot \bot })\) by definition. If \(w_2=\epsilon\), then we have \(v_2=[]\) and this small-step also holds. Then, if \(w_1\ne \epsilon\), then we apply the induction hypothesis to \({\mathrm{headNT}\left(v_1\right)} \Downarrow \left({w_1},{v_1}\right)\) and have the small-step relation of \(w_1\). By Lemma D.12, we construct \(({[]},{\bot }) \stackrel{w_1 {b› }w_2}{\rightsquigarrow ^*} ({v_1+[r_2]+v_2},{\bot })\). If \(w_1=\epsilon\), then this small-step also holds. Finally, we extend the small-step relation to \(({[]},{\bot }) \stackrel{w}{\rightsquigarrow ^*} ({[r_1]+v_1+[r_2]+v_2},{\bot })\) by definition.
\(\impliedby\). This is a corollary of Lemma D.11.□
We give the specification of \(\operatorname{extract}_\mathrm{init}\) and \(\operatorname{extract}_{\mathrm{oneStep}}\) in Lemmas D.14, D.15, and D.16. The proofs are directly based on the definitions, and we omit them here.
Lemma D.14 (The Specification of \(\operatorname{extract}_\mathrm{init}\))
Lemma D.15 shows that if m is a state and V includes partial parse trees in backward small-step relations, then \(\operatorname{extract}_{\mathrm{oneStep}}\) extends the relations; Lemma D.16 further shows that the relations are extended by rules in m and trees in V.
Lemma D.15 (The Specification of \(\operatorname{extract}_{\mathrm{oneStep}}\), Part I)
Lemma D.16 (The Specification of \(\operatorname{extract}_{\mathrm{oneStep}}\), Part II)
Now, we introduce the invariants of the extraction function.
Lemma D.17 (Invariants of the Extraction Function).
Given a VPG \(G=(\Sigma ,V,P,L_0)\), we have
Proof.
Prove it by induction over \(w_2\).
When \(w_2=\epsilon\), we have \(V=\operatorname{extract}_\mathrm{init}(m)\).
\(\Rightarrow\). For each \((v,E)\in \operatorname{extract}_\mathrm{init}(m)\), by Lemma D.14, we have \(({[]},{\bot }) \stackrel{i}{\rightsquigarrow } ({v},{E})\) and \(\exists r, v=[r]\). By Lemma D.10, we have
By Lemmas D.7 and D.14, the above small-step relation can be converted to a big-step relation; i.e., \({L_0} \Downarrow \left({w_1w_2i},{v_1+v}\right)\). If \(w_1=\epsilon\), then \(v_1=[]\) and \(w_1w_2i=i\), so \({L_0} \Downarrow \left({i},{v}\right)\). If \(w_1\ne \epsilon\), then we can invert the above small-step relation to get \(\exists E_1,\ ({[]},{\bot }) \stackrel{w_1}{\longrightarrow ^*} ({v_1},{E_1})\).
\(\impliedby\). For \((v,E)\) that satisfy \(({[]},{\bot }) \stackrel{w_1w_2i}{\rightsquigarrow ^*} ({v},{E})\), if \(w_1=\epsilon\) and \({L_0} \Downarrow \left({w_2i},{v}\right)\), then we first convert the big-step relation to a small-step relation by Lemma D.7, then, by Lemma D.9, we have \((\operatorname{head }(E),v)\in m\). Then, by Lemma D.14, we have \((v,E)\in V\). If \(w_1\ne \epsilon\) and \({L_0} \Downarrow \left({w_1i},{v_1+v}\right)\), then we can similarly show that \((\operatorname{head }(E),v)\in m\) and \((v,E)\in V\).
When \(w_2\ne \epsilon\), let \(w_2=i^{\prime }w_3\) and \(M_2=[m_2^{\prime }] + M_2^{\prime }\), we have \(V=\operatorname{extract}_{\mathrm{oneStep}}(m_2^{\prime },\operatorname{extract}(M_2^{\prime } + [m]))\).
\(\Rightarrow\). For each \((v,E)\in V\), by Lemma D.16, we have
As an overview, we will try to use \(v_r\) and \(v_1\) to construct the counterpart of v. In the simplest case, \(r_1=\text{None}\) and \(v_r\) can be directly used. In a more complex case, \(r_1\) is a matching rule, and we need to combine \(v_r\) and \(v_1\) to construct the counterpart. There are two cases of \(r_1\).
(1)
\(r_1=\text{None}\) or a pending rule. We will prove \({L_0} \Downarrow \left({w},{v_r+v}\right)\). We first convert the backward small-step \(({[]},{\bot }) \stackrel{w_3i}{\rightsquigarrow ^*} ({v^{\prime }},{E^{\prime }})\) to a forward small-step. There are two cases of \(E^{\prime }\).
(a)
\(E^{\prime }=\bot\) or \(\operatorname{head }(E^{\prime })\) is a pending rule. We convert \(({[]},{\bot }) \stackrel{w_3i}{\rightsquigarrow ^*} ({v^{\prime }},{E^{\prime }})\) to a big-step by Lemma D.13 and then convert the big-step to a forward small-step by Lemma D.7.
(b)
\(\operatorname{head }(E^{\prime })\) is a matching rule. This is not possible by Lemma D.16.
Then, we combine \(v_r+[r]\) and \(v^{\prime }\) to \(v_r+[r]+v^{\prime }\) by Lemma D.1. Finally, we convert \(v_r+[r]+v^{\prime }\) to a big-step by Lemma D.7 and finish the proof.
(2)
\(r_1\) is a matching rule. There are two cases of \(E^{\prime }\).
(a)
\(E^{\prime }=\bot\) or \(\operatorname{head }(E^{\prime })\) is a pending rule. This is not possible by Lemma D.16.
(b)
\(\operatorname{head }(E^{\prime })\) is a matching rule. We use the split function to split \(v_1\) into \(v_{1,1}+[r_{1,{‹ {a}}}]+v_{1,2}\) and split \(v_r+[r]\) into \(v_{r,1}+[r_{r,{‹ {a}}}]+v_{r,2}\). Then, we construct a new forward small-step \(({[]},{\bot }) \stackrel{w_1w_2i}{\rightarrow } ({v_{1,1}+[r_{1,{‹ {a}}}]+v_{r,2}},{E^{\prime \prime }})\) for some \(E^{\prime \prime }\). Then, we extend this small-step with \(\operatorname{firstRule}(v^{\prime })\) and use Lemma D.4 to further extend the small-step relation to \(v_{1,1}+[r_{1,{‹ {a}}}]+v_{r,2}+v^{\prime }\). Finally, we convert the small-step relation to a big-step by Lemma D.7.
\(\impliedby\). For each \((v,E)\) that satisfies \(({[]},{\bot }) \stackrel{w_1w_2i}{\rightsquigarrow ^*} ({v},{E})\), let \(v=[r]+v^{\prime }\) for some \(v^{\prime }\) and r, and \(w_2=i^{\prime }w_3\) for some \(i^{\prime }\) and \(w_3\). We invert the backward small-step of \(w_2i\) and have \(({[]},{\bot }) \stackrel{w_3i}{\rightsquigarrow ^*} ({v^{\prime }},{E^{\prime }})\) for some \(E^{\prime }\). Then, we convert \({L_0} \Downarrow \left({w_1w_2i},{v_1+v}\right)\) to a forward small-step \(({[]},{\bot }) \stackrel{w_1w_2i}{\longrightarrow ^*} ({v_1+v},{E})\) for some E, and by Lemma D.5, we have the forward small-step \(({[]},{\bot }) \stackrel{w_1i^{\prime }}{\longrightarrow ^*} ({v_1+[r]},{E_r})\) for some \(E_r\). With these small-steps, we apply the induction hypothesis and have \((v^{\prime },E^{\prime })\in \operatorname{extract}(M_2^{\prime } + [m])\), where \(M_2=[m_2^{\prime }]+ M_2^{\prime }\) for some \(m_2^{\prime }\). Then, by the definition of \(\operatorname{extract}_{\mathrm{oneStep}}\), we have \((v,E)\in V\).□
By Lemma D.17, when \(w_1=\epsilon\), we have the correctness of the parser generator.
Theorem D.18 (Correctness of the Parser Generator).
E Correspondence Between the Theorems and the Coq Proofs
Table 6 shows the theorems discussed in this article, the files that include the corresponding Coq theorems and their proofs, and the names of the Coq theorems.
Table 6. The Correspondence between the Theorems and the Mechanized Proofs
F Examples of Tagged CFGs
The following is a tagged CFG for JSON [3], where nonterminals start with lowercase characters, such as json, and terminals start with uppercase characters, such as “STRING.” Also, call and return symbols are tagged with “<” or “>”, respectively. The declarations of terminals are omitted.
The following is a tagged CFG for HTML, which is adapted from the HTML grammar from the repository of ANTLR [3]:
The following is a tagged CFG for XML adapted from ANTLR [3]:
G The Translation Algorithm
Translating simple forms to linear forms. The translation algorithm is based on the dependency graph G. We start by removing from G any dependency edge \((L,L^{\prime })\) that is labeled with \((s_1, s_2)\); intuitively, a matched token \({‹ {a}}L^{\prime } {b› }\) can be conceptually viewed as a “plain symbol” and its presence does not affect the following translation. We use the symbol t for either a plain symbol or a matched token. Further, a rule whose head is L is called a rule of L.
We find all strongly connected components (SCCs) in the dependency graph G and collapse each SCC into a single node; the resulting condensation graph is written as \(G^{\mathit {SCC}}\), which is a directed acyclic graph (DAG). For a nonterminal L, we write \(G^{\mathit {SCC}}[L]\) for the nonterminals in its SCC. Because the validator passes G, we know that any edge between nonterminals \(L_1\) and \(L_2\) in \(G^{\mathit {SCC}}[L]\) must be labeled with \((s,\epsilon)\).
We write \(\mathrm{allDep}(G^{\mathit {SCC}},L)\) for all nonterminals that L may (transitively) depend on according to \(G^{\mathit {SCC}}\); that is,
The translation algorithm maintains a set of processed nonterminals \(V_\mathrm{done}\), a map T, and a set P of the current set of rules. Set \(V_\mathrm{done}\) is used to track the set of nonterminals whose rules have been processed, meaning those rules are in the linear form; it is initialized to be the empty set. The algorithm uses T to keep track of new nonterminals created during translation. It is initialized to the empty map. During translation, a new nonterminal \(L_1\) may be created for a string of the form \(Ls\), where L is a nonterminal and \(s \in (\Sigma \cup V)^+\); the algorithm then adds the mapping \((Ls, L_1)\) to T. Finally, P is initialized to be the set of rules in the input grammar; during translation, new rules may be added to P and some rules may be removed.
At each iteration, if \(V_\mathrm{done}\) contains all nonterminals in G, then the algorithm terminates (which implies that all rules are already in linear forms). Otherwise, pick a nonterminal \(L \notin V_\mathrm{done}\) so \(\mathrm{allDep}(G^{\mathit {SCC}},L) \subseteq V_\mathrm{done}\); that is, nonterminals that \(G^{\mathit {SCC}}[L]\) depend on (transitively) have only linear-form rules. Since \(G^{\mathit {SCC}}\) is a DAG, we can always find such an L.
If all rules of L are in the linear form, then add L to \(V_\mathrm{done}\) and move to the next iteration. Otherwise, pick a non-linear form rule of \(L \rightarrow t_1 \ldots t_k L_1 s\), where each \(t_i\) is either a plain symbol or a matched token, \(L_1\) is the left-most nonterminal (outside any matched token) in the rule, and \(s \in (\Sigma \cup V)^+\). We must have that \(L_1 \notin G^{\mathit {SCC}}[L]\), since otherwise it would violate the conditions imposed by the validator. So the rules of \(L_1\) (as well as rules of those nonterminals that \(L_1\) transitively depends on) must already be in the linear form.
With the non-linear form rule \(L \rightarrow t_1 \ldots t_k L_1 s\), the algorithm performs multiple rounds to transform it to a set of linear-form rules.
•
The goal of the first round is to invoke \(\mathrm{linearizeFirstNT(L \rightarrow t_1 \ldots t_k L_1 s)}\), defined in Algorithm 5, to remove the dependency from L to \(L_1\). When \(k\gt 0\), what the procedure does is to introduce a new nonterminal \(L_1^{\prime }\) (if \(L_1s\) is not in the translation table already), change the rule to \(L \rightarrow t_1 \ldots t_k L_1^{\prime }\), and add the map \((L_1s, L_1^{\prime })\) to the translation table T; if further recursively invokes \(\mathrm{linearizeFirstNT(L_1^{\prime } \rightarrow L_1s)}\) to remove the dependency from \(L_1^{\prime }\) to \(L_1\). When \(k=0\), the rule becomes \(L \rightarrow L_1 s\) and the procedure in Algorithm 5 uses the rules of \(L_1\) to rewrite \(L_1s\); note that in this case \(L_1\) can have only linear-form rules. After \(\mathrm{linearizeFirstNT(L \rightarrow t_1 \ldots t_k L_1 s)}\) is finished, this round removes rule \(L \rightarrow t_1 \ldots t_k L_1 s\) from P, updates the dependency graph G (as new nonterminals are added and rules are changed).
•
It is easy to see from Algorithm 5 that all non-linear form rules added in the first round have the form of \(L_u \rightarrow \vec{t} s\), where \(L_u\) can be L or a newly generated nonterminal, \(\vec{t}\) is a sequence of plain symbols or matched tokens, and s is the string in the original rule \(L \rightarrow t_1 \ldots t_k L_1 s\). Compared to the original rule, progress has been made, since the size of s is less than the size of \(L_1s\).
In the next round, the algorithm then takes rules of form \(L_u \rightarrow \vec{t} s\) and feeds those non-linear ones to \(\mathrm{linearizeFirstNT(-)}\) to remove the dependency from \(L_u\) to the first nonterminal in s. We continue these rounds until the original rule \(L \rightarrow t_1 \ldots t_k L_1 s\) is completely linearized.
After \(L \rightarrow t_1 \ldots t_k L_1 s\) is completely linearized, the algorithm continues to the next iteration.
We have now finished the discussion of the translation algorithm. For its correctness, we can prove that the algorithm produces an equivalent grammar as the input grammar by showing that each rewriting step creates an equivalent grammar.
For termination, the following lemma shows that \(\mathrm{linearizeFirstNT(-)}\) terminates:
Lemma G.1.
\(\mathrm{linearizeFirstNT(L \rightarrow t_1 \ldots t_k L_1 s)}\) terminates, if \(L_1\) and terminals in \(\mathrm{allDep} (G^{\mathit {SCC}},L_1)\) have only linear-form rules.
Proof.
Inspecting Algorithm 5, we know that a recursive call to \(\mathrm{linearizeFirstNT(L_u^{\prime } \rightarrow L_u s)}\) is made only when \(L_u \in \lbrace L_1\rbrace \cup \mathrm{allDep}(G^{\mathit {SCC}},L_1)\) and \(L_us\) is not in T. Since the set \(\lbrace L_1\rbrace \cup \mathrm{allDep}(G^{\mathit {SCC}},L_1)\) is finite and \(L_us\) is added to T when a recursive call \(\mathrm{linearizeFirstNT(L_u^{\prime } \rightarrow L_u s)}\) is made, only a finite number of recursive calls can be made. Therefore, the procedure terminates.□
Further, as mentioned earlier, all generated rules when calling \(\mathrm{linearizeFirstNT(L \rightarrow t_1 \ldots t_k L_1 s)}\) have the form of \(L_u \rightarrow \vec{t} s\), from which we know that the distance from the first nonterminal in \(\vec{t} s\) to the end is shorter than the distance from \(L_1\) to the end in \(t_1 \ldots t_k L_1 s\). Moreover, it is easy to show only a finite number of rules of the form \(L_u \rightarrow \vec{t} s\) are generated. In summary, at each round of linearizing the rule \(L \rightarrow t_1 \ldots t_k L_1 s\), the number of invocations to \(\mathrm{linearizeFirstNT(-)}\) is finite and round \(j+1\) is simpler than round j in the sense that a rule in round \(j+1\) has a shorter distance from the first nonterminal to the end of the rule. Therefore, the process of linearizing rule \(L \rightarrow t_1 \ldots t_k L_1 s\) terminates.
Finally, since the number of rules in the input grammar is finite, there is only a finite number of iterations. Therefore, the translation algorithm terminates.
H The Full Evaluation
The full evaluation results are shown in Tables 7–11.
Table 7.
Name
#Token
ANTLR Parse
Bison Parse
VPG Parse + Extract
VPG Conv
VPG Total
ANTLR Lex
JFlex
blog_entries
1,402
11.2 ms
0.8 ms
0.08 ms
0.1 ms
0.2 ms
11.0 ms
5.5 ms
gists
6,302
15.1 ms
1.9 ms
0.36 ms
0.5 ms
0.9 ms
12.3 ms
3.9 ms
emojis
7,198
14.7 ms
2.0 ms
0.40 ms
0.5 ms
0.9 ms
12.8 ms
4.6 ms
github
11,680
17.2 ms
2.7 ms
0.77 ms
1.4 ms
2.2 ms
15.9 ms
6.8 ms
poked
12,916
18.4 ms
3.4 ms
0.88 ms
1.5 ms
2.4 ms
14.6 ms
5.6 ms
Members
13,226
17.8 ms
4.3 ms
0.87 ms
1.3 ms
2.2 ms
13.0 ms
3.9 ms
senator
17,932
19.4 ms
4.2 ms
1.19 ms
1.8 ms
3.0 ms
17.4 ms
6.4 ms
AskReddit
18,975
20.1 ms
3.7 ms
1.33 ms
2.0 ms
3.4 ms
16.8 ms
8.1 ms
parliament
27,182
21.6 ms
4.8 ms
1.98 ms
3.3 ms
5.3 ms
16.5 ms
12.0 ms
prize
30,210
22.1 ms
5.8 ms
2.24 ms
3.2 ms
5.4 ms
17.2 ms
10.6 ms
municipis
39,292
27.2 ms
5.6 ms
3.06 ms
3.8 ms
6.8 ms
23.8 ms
11.6 ms
reddit_all
52,941
26.5 ms
8.0 ms
4.14 ms
7.0 ms
11.1 ms
26.1 ms
15.1 ms
y77d_th95
54,666
27.2 ms
8.3 ms
4.04 ms
5.6 ms
9.6 ms
23.8 ms
12.5 ms
twitter
55,264
23.9 ms
6.8 ms
4.37 ms
5.5 ms
9.9 ms
104.9 ms
24.3 ms
representative
70,750
28.7 ms
7.7 ms
4.92 ms
7.6 ms
12.5 ms
29.2 ms
24.8 ms
transactions
71,744
30.5 ms
9.5 ms
5.54 ms
8.1 ms
13.6 ms
24.2 ms
19.5 ms
laureate
79,953
31.6 ms
11.0 ms
4.90 ms
11.3 ms
16.2 ms
25.2 ms
12.9 ms
catalog
135,991
53.5 ms
20.2 ms
7.33 ms
17.0 ms
24.3 ms
39.8 ms
39.1 ms
canada
334,374
103.7 ms
49.9 ms
17.02 ms
51.5 ms
68.6 ms
46.7 ms
39.5 ms
educativos
426,398
92.3 ms
40.5 ms
20.47 ms
48.2 ms
68.7 ms
108.5 ms
75.3 ms
airlines
555,410
106.5 ms
58.0 ms
30.01 ms
68.2 ms
98.2 ms
87.8 ms
91.3 ms
movies
751,917
171.2 ms
81.9 ms
39.40 ms
125.1 ms
164.5 ms
87.9 ms
97.9 ms
js
1,288,350
224.5 ms
127.5 ms
72.14 ms
161.9 ms
234.0 ms
141.6 ms
141.4 ms
Table 7. The Parsing Times of JSON
Table 8.
Name
#Token
ANTLR Parse
Bison Parse
VPG Parse + Extract
VPG Conv
VPG Total
ANTLR Lex
JFlex
nav_63_0
351
10.18
1.40
0.02
0.01
0.03
10.56
0.76
nav_78_0
354
8.90
1.32
0.02
0.02
0.04
10.44
0.74
nav_50_0
523
9.87
1.74
0.03
0.02
0.05
11.06
0.98
cd_catalog
733
9.71
1.58
0.04
0.03
0.07
9.41
0.67
form
740
10.50
1.68
0.04
0.03
0.07
14.47
1.35
OfficeOrder
898
10.58
1.89
0.05
0.04
0.09
13.40
1.17
book-order
2,723
12.73
2.73
0.14
0.28
0.42
12.69
1.80
book
2,723
13.38
2.96
0.14
0.28
0.41
12.69
1.76
bioinfo
3,165
13.51
2.80
0.17
0.27
0.44
15.93
2.36
soap_small
3,415
13.24
3.27
0.17
0.37
0.54
14.40
2.04
cd_big
4,829
14.69
3.24
0.26
0.60
0.87
13.36
2.49
soap_mid
17,015
20.26
7.34
1.17
2.92
4.09
18.89
10.89
ORTCA
39,072
30.02
8.30
3.22
6.31
9.53
54.09
71.00
blog
43,000
24.52
11.35
3.38
6.92
10.30
55.60
35.01
SUAS
118,446
54.32
30.51
6.37
20.56
26.93
88.16
130.68
po1m
128,742
65.65
28.18
6.54
20.74
27.28
48.21
30.79
soap
340,015
125.32
58.69
16.15
55.71
71.86
67.39
52.75
bioinfo_big
406,493
107.50
67.30
20.85
72.30
93.15
89.64
74.09
address
1,437,142
253.83
231.90
74.04
283.35
357.40
200.98
250.01
cd
4,198,694
720.94
734.56
204.18
704.03
908.21
415.73
508.57
po
9,266,526
1,359.55
1,834.59
461.16
1,527.52
1,988.69
854.08
1,617.16
Table 8. The Parsing Time (ms) of XML
Table 9.
Name
#Token
ANTLR Parse
Bison Parse
VPG Parse + Extract
VPG Conv
VPG Total
ANTLR Lex
JFlex
uglylink
13
14.28
2.03
0.00
0.00
0.00
9.77
0.14
script1
26
14.88
2.20
0.00
0.00
0.00
8.18
0.20
style1
26
15.42
2.20
0.00
0.00
0.00
7.89
0.22
attvalues
27
35.70
2.14
0.00
0.00
0.01
11.50
0.21
html4
38
16.42
2.21
0.00
0.00
0.01
10.85
0.31
antlr
545
37.14
4.75
0.04
0.23
0.26
22.49
2.63
google
659
24.08
3.77
0.04
0.27
0.30
31.78
33.82
gnu
1,013
38.58
5.40
0.06
0.10
0.16
19.86
2.95
freebsd
1,521
49.51
5.74
0.09
0.21
0.29
19.29
2.54
abc.com
1,524
43.20
4.83
0.09
0.21
0.30
30.64
15.06
github
2,547
46.82
6.83
0.21
0.26
0.47
26.12
4.71
metafilter
2,615
50.94
5.49
0.21
0.45
0.66
33.13
30.18
wikipedia
2,750
37.49
6.46
0.24
0.37
0.61
33.93
10.71
nbc.com
3,821
94.82
7.37
0.32
0.59
0.92
46.21
14.79
bbc
3,883
88.48
5.62
0.32
0.80
1.12
95.48
51.55
cnn1
4,974
89.63
6.84
0.29
1.02
1.31
42.38
34.00
reddit2
4,976
31.97
6.85
0.28
1.11
1.39
53.65
41.84
reddit
4,989
31.37
7.41
0.27
1.10
1.37
53.29
38.30
digg
6,250
126.95
8.53
0.51
1.34
1.85
52.07
38.00
youtube
16,316
49.35
11.56
1.05
5.92
6.97
54.83
58.53
Table 9. The Parsing Time (ms) of HTML
Table 10.
Name
#Token
ANTLR Lex
VPG Parse + Extract
VPG Total
ANTLR Lex + VPG Total
SpiderM
JSCore
V8
Chakra
Members
1,402
13.03
0.87
2.24
15.27
15.06
45.33
8.80
13.05
poked
6,302
14.61
0.88
2.40
17.00
14.82
50.83
8.72
12.68
gists
7,198
12.38
0.36
0.90
13.28
16.29
85.23
9.21
13.72
senator
11,680
17.49
1.19
3.07
20.57
16.01
55.43
9.98
13.87
AskReddit
12,916
16.81
1.33
3.42
20.24
14.67
51.59
9.28
13.42
blog_entries
13,226
11.03
0.08
0.26
11.28
14.51
51.90
8.85
13.21
github
17,932
15.99
0.77
2.21
18.20
15.78
66.17
9.99
12.43
emojis
18,975
12.83
0.40
0.98
13.82
15.40
46.53
9.19
14.14
parliament
27,182
16.53
1.98
5.35
21.88
16.36
70.35
10.38
13.69
prize
30,210
17.25
2.24
5.45
22.71
16.76
134.30
9.98
13.15
y77d_th95
39,292
23.87
4.04
9.63
33.50
20.72
178.03
12.53
16.35
municipis
52,941
23.89
3.06
6.89
30.78
20.02
60.20
11.48
16.54
laureate
54,666
25.26
4.90
16.22
41.47
17.41
105.22
11.96
15.37
reddit_all
55,264
26.15
4.14
11.15
37.30
22.00
92.54
15.61
17.44
transactions
70,750
24.23
5.54
13.65
37.87
22.27
132.86
15.21
16.19
twitter
71,744
104.96
4.37
9.91
114.88
21.09
78.90
11.78
18.35
representative
79,953
29.22
4.92
12.56
41.78
20.72
178.03
12.53
16.35
catalog
135,991
39.83
7.33
24.36
64.19
33.81
70.64
27.59
24.77
canada
334,374
46.74
17.02
68.61
115.35
57.14
67.51
33.95
43.83
educativos
426,398
108.58
20.47
68.75
177.34
70.78
420.99
44.53
48.64
airlines
555,410
87.82
30.01
98.26
186.08
73.86
95.12
41.93
55.52
movies
751,917
87.92
39.40
164.58
252.51
70.93
106.41
68.00
50.85
JSON.parse
1,288,350
141.64
72.14
234.07
375.71
118.11
139.21
76.36
87.81
Table 10. Comparison with Hand-crafted Parsers (ms) of Parsing JSON
Table 11.
Name
#Token
ANTLR Lex
VPG Parse + Extract
VPG Total
ANTLR Lex + VPG Total
Fast-XML
Libxmljs
SAX-JS
HTMLP2
nav_63_0
351
10.56
0.02
0.03
10.59
0.14
0.13
0.67
0.20
nav_78_0
354
10.44
0.02
0.04
10.47
0.12
0.12
0.68
0.20
nav_50_0
523
11.06
0.03
0.05
11.11
0.21
0.17
0.99
0.30
cd_catalog
733
9.41
0.04
0.07
9.48
0.22
0.09
0.36
0.16
form
740
14.47
0.04
0.07
14.54
0.38
0.17
0.93
0.27
OfficeOrder
898
13.40
0.05
0.09
13.49
0.37
0.14
0.88
0.35
book-order
2,723
12.69
0.14
0.42
13.11
0.69
0.26
1.41
0.54
book
2,723
12.69
0.14
0.41
13.10
0.64
0.26
1.41
0.54
bioinfo
3,165
15.93
0.17
0.44
16.37
0.99
0.38
2.16
0.90
soap_small
3,415
14.40
0.17
0.54
14.94
0.85
0.40
2.42
0.78
cd_big
4,829
13.36
0.26
0.87
14.23
1.17
0.44
2.20
0.88
soap_mid
17,015
18.89
1.17
4.09
22.98
4.11
1.95
11.74
3.81
ORTCA
39,072
54.09
3.22
9.53
63.62
138.49
91.25
664.62
88.79
blog
43,000
55.60
3.38
10.30
65.91
29.57
6.56
48.07
17.83
SUAS
118,446
88.16
6.37
26.93
115.09
254.39
182.43
1,213.67
168.74
po1m
128,742
48.21
6.54
27.28
75.49
33.51
11.56
95.22
27.11
soap
340,015
67.39
16.15
71.86
139.25
110.14
38.13
260.88
76.24
bioinfo_big
406,493
89.64
20.85
93.15
182.79
116.93
42.74
276.96
82.29
address
1,437,142
200.98
74.04
357.40
558.37
584.40
196.12
1,012.01
330.51
cd
4,198,694
415.73
204.18
908.21
1,323.94
1,298.48
419.07
2,102.58
734.97
po
9,266,526
854.08
461.16
1,988.69
2,842.77
3,278.30
897.30
6,617.60
1,826.90
Table 11. Comparison with Hand-crafted Parsers (ms) of Parsing XML
Aditi Barthwal and Michael Norrish. 2009. Verified, executable parsing. In Programming Languages and Systems, Giuseppe Castagna (Ed.). Springer Berlin, 160–174.
Nils Anders Danielsson. 2010. Total parser combinators. In Proceedings of the 15th ACM SIGPLAN International Conference on Functional Programming) (ICFP’10). Association for Computing Machinery, New York, NY, 285–296. DOI:DOI:
Romain Edelmann, Jad Hamza, and Viktor Kunčak. 2020. Zippy LL(1) parsing with derivatives. In Proceedings of the 41st ACM SIGPLAN Conference on Programming Language Design and Implementation (PLDI’20). Association for Computing Machinery, New York, NY, 1036–1051. DOI:DOI:
Ian Henriksen, Gianfranco Bilardi, and Keshav Pingali. 2019. Derivative grammars: A symbolic approach to parsing with derivatives. Proc. ACM Program. Lang. 3, OOPSLA (2019), 1–28.
Xiaodong Jia, Ashish Kumar, and Gang Tan. 2021. A derivative-based parser generator for visibly Pushdown grammars. Proc. ACM Program. Lang. 5, OOPSLA (2021), 1–24.
Jacques-Henri Jourdan, François Pottier, and Xavier Leroy. 2012. Validating LR(1) parsers. In Programming Languages and Systems, Helmut Seidl (Ed.). Springer Berlin, 397–416.
Tadao Kasami. 1965. An Efficient Recognition and Syntax-Analysis Algorithm for Context-Free Languages. Technical Report. Air Force Cambridge Research Laboratory.
Adam Koprowski and Henri Binsztok. 2010. TRX: A formally verified parser interpreter. In Programming Languages and Systems, Andrew D. Gordon (Ed.). Springer Berlin, 345–365.
Sam Lasser, Chris Casinghino, Kathleen Fisher, and Cody Roux. 2019. A verified LL(1) parser generator. In 10th International Conference on Interactive Theorem Proving (ITP 2019)(Leibniz International Proceedings in Informatics (LIPIcs), Vol. 141), John Harrison, John O’Leary, and Andrew Tolmach (Eds.). Schloss Dagstuhl–Leibniz-Zentrum fuer Informatik, Dagstuhl, Germany, 24:1–24:18. DOI:DOI:
Sam Lasser, Chris Casinghino, Kathleen Fisher, and Cody Roux. 2021. CoStar: A verified ALL(*) parser. In Proceedings of the 42nd ACM SIGPLAN International Conference on Programming Language Design and Implementation (PLDI’21). Association for Computing Machinery, New York, NY, 420–434. DOI:DOI:
Jay McCarthy, Burke Fetscher, Max New, Daniel Feltey, and Robert Bruce Findler. 2016. A Coq library for internal verification of running-times. In Functional and Logic Programming, Oleg Kiselyov and Andy King (Eds.). Springer International Publishing, Cham, 144–162.
Matthew Might, David Darais, and Daniel Spiewak. 2011. Parsing with derivatives: A functional pearl. In Proceedings of the 16th ACM SIGPLAN International Conference on Functional Programming (ICFP’11). Association for Computing Machinery, New York, NY, 189–195. DOI:DOI:
Greg Morrisett, Gang Tan, Joseph Tassarotti, Jean-Baptiste Tristan, and Edward Gan. 2012. RockSalt: Better, faster, stronger SFI for the X86. In Proceedings of the 33rd ACM SIGPLAN Conference on Programming Language Design and Implementation (PLDI’12). Association for Computing Machinery, New York, NY, 395–404. DOI:DOI:
Terence Parr, Sam Harwell, and Kathleen Fisher. 2014. Adaptive LL(*) parsing: The power of dynamic analysis. SIGPLAN Not. 49, 10 (Oct.2014), 579–598. DOI:DOI:
T. Petsios, A. Tang, S. Stolfo, A. D. Keromytis, and S. Jana. 2017. NEZHA: Efficient domain-independent differential testing. In Proceedings of the IEEE Symposium on Security and Privacy (SP’17). IEEE Computer Society, 615–632. DOI:DOI:
Tahina Ramananandro, Antoine Delignat-Lavaud, Cedric Fournet, Nikhil Swamy, Tej Chajed, Nadim Kobeissi, and Jonathan Protzenko. 2019. EverParse: Verified secure zero-copy parsers for authenticated message formats. In Proceedings of the 28th USENIX Security Symposium (USENIX Security’19). USENIX Association, 1465–1482. Retrieved from https://www.usenix.org/conference/usenixsecurity19/presentation/delignat-lavaud.
Tom Ridge. 2011. Simple, functional, sound and complete parsing for all context-free grammars. In Certified Programs and Proofs, Jean-Pierre Jouannaud and Zhong Shao (Eds.). Springer Berlin, 103–118.
Jia XTan G(2024)V-Star: Learning Visibly Pushdown Grammars from Program InputsProceedings of the ACM on Programming Languages10.1145/36564588:PLDI(2003-2026)Online publication date: 20-Jun-2024
In this paper, we present a derivative-based, functional recognizer and parser generator for visibly pushdown grammars. The generated parser accepts ambiguous grammars and produces a parse forest containing all valid parse trees for an input string in ...
In this chapter we discuss practical LR-like parser generator models for Tree Adjoining Grammars (TAGs) and propose a new algorithm. The algorithm has been implemented and applied to two large coverage TAGs for English: the XTAG English grammar, and a ...
ACMSE '82: Proceedings of the 20th annual ACM Southeast Regional Conference
Extended context free grammars (ECFG) are context free grammars (CFG) in which the right side or each production may be an arbitrary finite state machine. There are ECFG subsets, ELR(k) and ELALR(k) which correspond to the context free subsets, LR(k) and ...
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].
Jia XTan G(2024)V-Star: Learning Visibly Pushdown Grammars from Program InputsProceedings of the ACM on Programming Languages10.1145/36564588:PLDI(2003-2026)Online publication date: 20-Jun-2024