ReDoS

Last updated

A regular expression denial of service (ReDoS) [1] is an algorithmic complexity attack that produces a denial-of-service by providing a regular expression and/or an input that takes a long time to evaluate. The attack exploits the fact that many [2] regular expression implementations have super-linear worst-case complexity; on certain regex-input pairs, the time taken can grow polynomially or exponentially in relation to the input size. An attacker can thus cause a program to spend substantial time by providing a specially crafted regular expression and/or input. The program will then slow down or become unresponsive. [3] [4]

Contents

Description

Regular expression ("regex") matching can be done by building a finite-state automaton. Regex can be easily converted to nondeterministic automata (NFAs), in which for each state and input symbol, there may be several possible next states. After building the automaton, several possibilities exist:

Of the above algorithms, the first two are problematic. The first is problematic because a deterministic automaton could have up to states where is the number of states in the nondeterministic automaton; thus, the conversion from NFA to DFA may take exponential time. The second is problematic because a nondeterministic automaton could have an exponential number of paths of length , so that walking through an input of length will also take exponential time. [7] The last two algorithms, however, do not exhibit pathological behavior.

Note that for non-pathological regular expressions, the problematic algorithms are usually fast, and in practice, one can expect them to "compile" a regex in O(m) time and match it in O(n) time; instead, simulation of an NFA and lazy computation of the DFA have O(m2n) worst-case complexity. [lower-alpha 1] Regex denial of service occurs when these expectations are applied to a regex provided by the user, and malicious regular expressions provided by the user trigger the worst-case complexity of the regex matcher.

While regex algorithms can be written in an efficient way, most regex engines in existence extend the regex languages with additional constructs that cannot always be solved efficiently. Such extended patterns essentially force the implementation of regex in most programming languages to use backtracking.

Examples

Exponential backtracking

The most severe type of problem happens with backtracking regular expression matches, where some patterns have a runtime that is exponential in the length of the input string. [8] For strings of characters, the runtime is . This happens when a regular expression has three properties:

The second condition is best explained with two examples:

In both of these examples we used $ to match the end of the string, satisfying the third condition, but it is also possible to use another character for this. For example (a|aa)*c has the same problematic structure.

All three of the above regular expressions will exhibit exponential runtime when applied to strings of the form . For example, if you try to match them against aaaaaaaaaaaaaaaaaaaaaaaa! on a backtracking expression engine, it will take a significantly long time to complete, and the runtime will approximately double for each extra a before the !.

It is also possible to have backtracking which is polynomial time , instead of exponential. This can also cause problems for long enough inputs, though less attention has been paid to this problem as malicious input must be much longer to have a significant effect. An example of such a pattern is "a*b?a*x", when the input is an arbitrarily long sequence of "a"s.

Vulnerable regexes in online repositories

So-called "evil" or vulnerable regexes have been found in online regular expression repositories. Note that it is enough to find a vulnerable subexpression in order to attack the full regex:

  1. RegExLib, id=1757 (email validation) - see red part
    ^([a-zA-Z0-9])(([\-.]|[_]+)?([a-zA-Z0-9]+))*(@){1}[a-z0-9]+[.]{1}(([a-z]{2,3})|([a-z]{2,3}[.]{1}[a-z]{2,3}))$
  2. OWASP Validation Regex Repository, Java Classname - see red part
    ^(([a-z])+.)+[A-Z]([a-z])+$

These two examples are also susceptible to the input aaaaaaaaaaaaaaaaaaaaaaaa!.

Attacks

If the regex itself is affected by user input, such as a web service permitting clients to provide a search pattern, then an attacker can inject a malicious regex to consume the server's resources. Therefore, in most cases, regular expression denial of service can be avoided by removing the possibility for the user to execute arbitrary patterns on the server. In this case, web applications and databases are the main vulnerable applications. Alternatively, a malicious page could hang the user's web browser or cause it to use arbitrary amounts of memory.

However, if a vulnerable regex exists on the server-side already, then an attacker may instead be able to provide an input that triggers its worst-case behavior. In this case, e-mail scanners and intrusion detection systems could also be vulnerable.

In the case of a web application, the programmer may use the same regular expression to validate input on both the client and the server side of the system. An attacker could inspect the client code, looking for evil regular expressions, and send crafted input directly to the web server in order to hang it. [9]

Mitigation

ReDoS can be mitigated without changes to the regular expression engine, simply by setting a time limit for the execution of regular expressions when untrusted input is involved. [10]

ReDoS can be avoided entirely by using a non-vulnerable regular expression implementation. After CloudFlare's web application firewall (WAF) was brought down by a PCRE ReDoS in 2019, the company rewrote its WAF to use the non-backtracking Rust regex library, using an algorithm similar to RE2. [11] [12]

Vulnerable regular expressions can be detected programmatically by a linter. [13] Methods range from pure static analysis [14] [15] to fuzzing. [16] In most cases, the problematic regular expressions can be rewritten as "non-evil" patterns. For example, (.*a)+ can be rewritten to ([^a]*a)+. Possessive matching and atomic grouping, which disable backtracking for parts of the expression, [17] can also be used to "pacify" vulnerable parts. [18] [19]

See also

Related Research Articles

<span class="mw-page-title-main">Regular expression</span> Sequence of characters that forms a search pattern

A regular expression, sometimes referred to as rational expression, is a sequence of characters that specifies a match pattern in text. Usually such patterns are used by string-searching algorithms for "find" or "find and replace" operations on strings, or for input validation. Regular expression techniques are developed in theoretical computer science and formal language theory.

In theoretical computer science and formal language theory, a regular language is a formal language that can be defined by a regular expression, in the strict sense in theoretical computer science.

Computability is the ability to solve a problem in an effective manner. It is a key topic of the field of computability theory within mathematical logic and the theory of computation within computer science. The computability of a problem is closely linked to the existence of an algorithm to solve the problem.

<span class="mw-page-title-main">Deterministic finite automaton</span> Finite-state machine

In the theory of computation, a branch of theoretical computer science, a deterministic finite automaton (DFA)—also known as deterministic finite acceptor (DFA), deterministic finite-state machine (DFSM), or deterministic finite-state automaton (DFSA)—is a finite-state machine that accepts or rejects a given string of symbols, by running through a state sequence uniquely determined by the string. Deterministic refers to the uniqueness of the computation run. In search of the simplest models to capture finite-state machines, Warren McCulloch and Walter Pitts were among the first researchers to introduce a concept similar to finite automata in 1943.

In automata theory, a finite-state machine is called a deterministic finite automaton (DFA), if

In the theory of computation, a generalized nondeterministic finite automaton (GNFA), also known as an expression automaton or a generalized nondeterministic finite state machine, is a variation of a nondeterministic finite automaton (NFA) where each transition is labeled with any regular expression. The GNFA reads blocks of symbols from the input which constitute a string as defined by the regular expression on the transition. There are several differences between a standard finite state machine and a generalized nondeterministic finite state machine. A GNFA must have only one start state and one accept state, and these cannot be the same state, whereas an NFA or DFA both may have several accept states, and the start state can be an accept state. A GNFA must have only one transition between any two states, whereas a NFA or DFA both allow for numerous transitions between states. In a GNFA, a state has a single transition to every state in the machine, although often it is a convention to ignore the transitions that are labelled with the empty set when drawing generalized nondeterministic finite state machines.

In automata theory, an alternating finite automaton (AFA) is a nondeterministic finite automaton whose transitions are divided into existential and universal transitions. For example, let A be an alternating automaton.

In computer science, a parsing expression grammar (PEG) is a type of analytic formal grammar, i.e. it describes a formal language in terms of a set of rules for recognizing strings in the language. The formalism was introduced by Bryan Ford in 2004 and is closely related to the family of top-down parsing languages introduced in the early 1970s. Syntactically, PEGs also look similar to context-free grammars (CFGs), but they have a different interpretation: the choice operator selects the first match in PEG, while it is ambiguous in CFG. This is closer to how string recognition tends to be done in practice, e.g. by a recursive descent parser.

In the theory of computation and automata theory, the powerset construction or subset construction is a standard method for converting a nondeterministic finite automaton (NFA) into a deterministic finite automaton (DFA) which recognizes the same formal language. It is important in theory because it establishes that NFAs, despite their additional flexibility, are unable to recognize any language that cannot be recognized by some DFA. It is also important in practice for converting easier-to-construct NFAs into more efficiently executable DFAs. However, if the NFA has n states, the resulting DFA may have up to 2n states, an exponentially larger number, which sometimes makes the construction impractical for large NFAs.

In computer science, in particular in automata theory, a two-way finite automaton is a finite automaton that is allowed to re-read its input.

In mathematics and computer science, the probabilistic automaton (PA) is a generalization of the nondeterministic finite automaton; it includes the probability of a given transition into the transition function, turning it into a transition matrix. Thus, the probabilistic automaton also generalizes the concepts of a Markov chain and of a subshift of finite type. The languages recognized by probabilistic automata are called stochastic languages; these include the regular languages as a subset. The number of stochastic languages is uncountable.

<span class="mw-page-title-main">DFA minimization</span> Task of transforming a deterministic finite automaton

In automata theory, DFA minimization is the task of transforming a given deterministic finite automaton (DFA) into an equivalent DFA that has a minimum number of states. Here, two DFAs are called equivalent if they recognize the same regular language. Several different algorithms accomplishing this task are known and described in standard textbooks on automata theory.

RE2 is a software library which implements a regular expression engine via finite-state machines using automata theory, in contrast to almost all other regular expression libraries, which use backtracking implementations. It provides a C++ interface.

In computer science, Thompson's construction algorithm, also called the McNaughton–Yamada–Thompson algorithm, is a method of transforming a regular expression into an equivalent nondeterministic finite automaton (NFA). This NFA can be used to match strings against the regular expression. This algorithm is credited to Ken Thompson.

In computational learning theory, induction of regular languages refers to the task of learning a formal description of a regular language from a given set of example strings. Although E. Mark Gold has shown that not every regular language can be learned this way, approaches have been investigated for a variety of subclasses. They are sketched in this article. For learning of more general grammars, see Grammar induction.

In theoretical computer science, in particular in formal language theory, Kleene's algorithm transforms a given nondeterministic finite automaton (NFA) into a regular expression. Together with other conversion algorithms, it establishes the equivalence of several description formats for regular languages. Alternative presentations of the same method include the "elimination method" attributed to Brzozowski and McCluskey, the algorithm of McNaughton and Yamada, and the use of Arden's lemma.

<span class="mw-page-title-main">Weighted automaton</span> Finite-state machine where edges carry weights

In theoretical computer science and formal language theory, a weighted automaton or weighted finite-state machine is a generalization of a finite-state machine in which the edges have weights, for example real numbers or integers. Finite-state machines are only capable of answering decision problems; they take as input a string and produce a Boolean output, i.e. either "accept" or "reject". In contrast, weighted automata produce a quantitative output, for example a count of how many answers are possible on a given input string, or a probability of how likely the input string is according to a probability distribution. They are one of the simplest studied models of quantitative automata.

In automata theory, an unambiguous finite automaton (UFA) is a nondeterministic finite automaton (NFA) such that each word has at most one accepting path. Each deterministic finite automaton (DFA) is an UFA, but not vice versa. DFA, UFA, and NFA recognize exactly the same class of formal languages. On the one hand, an NFA can be exponentially smaller than an equivalent DFA. On the other hand, some problems are easily solved on DFAs and not on UFAs. For example, given an automaton A, an automaton A which accepts the complement of A can be computed in linear time when A is a DFA, whereas it is known that this cannot be done in polynomial time for UFAs. Hence UFAs are a mix of the worlds of DFA and of NFA; in some cases, they lead to smaller automata than DFA and quicker algorithms than NFA.

RE/flex is a free and open source computer program written in C++ that generates fast lexical analyzers in C++. RE/flex offers full Unicode support, indentation anchors, word boundaries, lazy quantifiers, and performance tuning options. RE/flex accepts Flex lexer specifications and offers options to generate scanners for Bison parsers. RE/flex includes a fast C++ regular expression library.

In the automata theory, a tagged deterministic finite automaton (TDFA) is an extension of deterministic finite automaton (DFA). In addition to solving the recognition problem for regular languages, TDFA is also capable of submatch extraction and parsing. While canonical DFA can find out if a string belongs to the language defined by a regular expression, TDFA can also extract substrings that match specific subexpressions. More generally, TDFA can identify positions in the input string that match tagged positions in a regular expression.

References

  1. Lazy computation of the DFA can usually reach the speed of deterministic automatons while keeping worst case behavior similar to simulation of an NFA. However, it is considerably more complex to implement and can use more memory.
  1. OWASP (2010-02-10). "Regex Denial of Service" . Retrieved 2010-04-16.
  2. Davis, James; Louis, Michael; Coghlan, Christy; Servant, Francisco; Lee, Dongyoon (2019). "Why Aren't Regular Expressions a Lingua Franca? An Empirical Study on the Re-use and Portability of Regular Expressions" (PDF). The ACM Joint European Software Engineering Conference and Symposium on the Foundations of Software Engineering: 443–454.
  3. RiverStar Software (2010-01-18). "Security Bulletin: Caution Using Regular Expressions". Archived from the original on 2011-07-15. Retrieved 2010-04-16.
  4. Ristic, Ivan (2010-03-15). ModSecurity Handbook. London, UK: Feisty Duck Ltd. p. 173. ISBN   978-1-907117-02-2. Archived from the original on 2016-08-08. Retrieved 2010-04-16.
  5. Crosby and Wallach, Usenix Security (2003). "Regular Expression Denial Of Service". Archived from the original on 2005-03-01. Retrieved 2010-01-13.
  6. Bryan Sullivan, MSDN Magazine (2010-05-03). "Regular Expression Denial of Service Attacks and Defenses" . Retrieved 2010-05-06.{{cite web}}: External link in |author= (help)CS1 maint: numeric names: authors list (link)
  7. Kirrage, J.; Rathnayake, A.; Thielecke, H. (2013). "Static Analysis for Regular Expression Denial-of-Service Attacks". Network and System Security. Madrid, Spain: Springer. pp. 135–148. arXiv: 1301.0849 . doi:10.1007/978-3-642-38631-2_11.
  8. Jim Manico and Adar Weidman (2009-12-07). "OWASP Podcast 56 (ReDoS)" . Retrieved 2010-04-02.
  9. Barlas, Efe; Du, Xin; Davis, James (2022). "Exploiting Input Sanitization for Regex Denial of Service" (PDF). ACM/IEEE International Conference on Software Engineering: 1–14. arXiv: 2303.01996 .
  10. "Backtracking in .NET regular expressions - .NET". learn.microsoft.com. 11 August 2023. When using System.Text.RegularExpressions to process untrusted input, pass a timeout. A malicious user can provide input to RegularExpressions, causing a Denial-of-Service attack. ASP.NET Core framework APIs that use RegularExpressions pass a timeout.
  11. "Making the WAF 40% faster". The Cloudflare Blog. 1 July 2020.
  12. Cox, Russ (2007). "Regular Expression Matching Can Be Simple And Fast" . Retrieved 2011-04-20. describes the RE2 algorithm
  13. See e.g. Schmidt, Michael (30 March 2023). "RunDevelopment/scslre". GitHub ., TSUYUSATO, Kitsune. "recheck Introduction"., and Davis, James. "vuln-regex-detector/src/detect/README.md". GitHub.
  14. H. Thielecke, A. Rathnayake (2013). "Regular expression denial of service (ReDoS) static analysis Archived 2014-08-03 at the Wayback Machine ". Retrieved 2013-05-30.
  15. B. van der Merwe, N Weideman (2017). "Regex Static Analysis". Retrieved 2017-08-12.
  16. "Fuzzing with Static Analysis | recheck". makenowjust-labs.github.io.
  17. "Essential classes: Regular Expressions: Quantifiers: Differences Among Greedy, Reluctant, and Possessive Quantifiers". The Java Tutorials. Oracle. Archived from the original on 7 October 2020. Retrieved 23 December 2016.
  18. "compose-regexp.js, "Atomic matching"". GitHub. 2 January 2024.
    "tc39/proposal-regexp-atomic-operators". Ecma TC39. 31 December 2023.
  19. "Preventing Regular Expression Denial of Service (ReDoS)". www.regular-expressions.info.