Regular Expressions/Implementation
Implementations and running times
[edit | edit source]There are at least 3 different algorithms that decide if (and how) a given string matches a regular expression. They are based on different representations of the regular expression as a Finite Automaton and on the amount of functionality present in the matcher.
- An NFA based matcher without back-references and look ahead/behind. An input of size O(n) can be tested against a regular expression of size O(m) in time O(nm), and additional O(m) extra space by simulating an NFA using Thompson's algorithm. If c sub-match capture groups are to be recorded, then the running time increases to O(nm log c), but the space requirement remains O(m).
- An NFA based matcher with back-references and look ahead/behind. Such a matcher needs to be implemented using backtracking. An input of size O(n) can be tested against a regular expression of size O(m) in time O(2mn) using backtracking. Some effort is needed to ensure that the backtracking based matcher doesn't enter an infinite loop, testing the same path over and over again.
- A DFA based matcher. DFA based matchers can't support back-references, sub-match captures, or look ahead/behind. This is the oldest and fastest kind of matcher and relies on a result in formal language theory that allows every nondeterministic Finite State Machine (NFA) to be transformed into a deterministic finite state machine (DFA). The algorithm performs or simulates this transformation and then runs the resulting DFA on the input string, one symbol at a time. The latter process (DFA matching) takes time that is proportional to the length of the input string. More precisely, a regular expression of size m on an input alphabet of size S can be converted into a DFA in time O(2mS), and subsequently an input string of size n can be tested against a DFA of any size in time O(n).
The DFA based algorithm is fast to match input against a regular expression, but can be used only for matching and not for recalling grouped subexpressions. There is a variant that can recall grouped subexpressions, but its running time slows down to O(n2m) [citation needed].
The running time of the backtracking based algorithm can be exponential, which simple implementations exhibit when matching against expressions like "(a|aa)*b" that contain both alternation and unbounded quantification and force the algorithm to consider an exponential number of subcases. More complex implementations identify and speed up various common cases where they would otherwise run slowly.
Even though backtracking implementations only give an exponential guarantee in the worst case, they allow much greater flexibility and provide more expressive power. For instance any implementation that allows the use of backreferences, or implements the various improvements that Perl introduced, must use a backtracking implementation.
Some implementations try to provide the best of both algorithms by first running a fast DFA match to see if the string matches the regular expression at all, and only in that case perform a potentially slower backtracking match.