I'm looking for an algorithm that is as efficient as possible for a regex engine that supports sub-match tracking (a.k.a capturing parentheses) and backreferencing. What I mean by as efficient as possible is this the simple parts of the regex (as
. *. *) must always be matched in linear time, that the back references immediately match or be exponential.
Thompson's NFA Algorithm seems to be a very promising basic algorithm because it can correspond to a large regex $ m $ to a size chain $ n $ with a complexity in the worst case $ O (mn) $ and a complexity of space worst case $ O (m) $.
It could be easily modified to support capture parentheses by incorporating in the active states all captured substrings leading to the current state. When a given state is accessible from two paths, some rules are used to decide which set of substrings is preserved (for example, the leftmost). Correct me if I'm wrong, I do not think this method changes the time complexity in the worst case. However, the use of memory in the worst case goes to $ O (m ^ 2) $ instead of $ O (m) $ since each active state must know at least one set of substrings for all previous capture parentheses.
Back references are where I'm stuck. I know that an exponential time in the worst case is inevitable (unless P = NP). But the only solution I can think of is taking an exponential amount of memory. The idea would be to take the algorithm above and keep all the matching substrings combinations so that future back references can choose which ones match.
I think it would be possible to create an algorithm to save the state of the regular expression engine (which states of the NFA were active with their associated data) when an active state left a referenced capture parenthesis backwards -plan. In order for us to engage in the first match and later, the backward reference fails, it will return to the saved state and attempt the next match for that pair of parentheses.
The problem with this idea is that it does not allow for a preference when more than one substring can match. For example, we might want to choose the longest match when a greedy operator is used.
An idea of how this could be done?
Perhaps there is already a known solution that has not been addressed in my research.
I found an article that proposes an algorithm that is wrong or that I did not understand.
There is the Henry Spencer Hybrid Regex Engine who manage to mix Thompson's NFA (sometimes called a DFA engine) with some backtracking (sometimes called NFA engine). The only documentation I could find about this is the README file of posgres for a top-level view, and Russ Cox on comp.compilers which explains that sub-mapping extraction has a $ O (n ^ 4) $ worst case. And I still have no information on how back references are supported or how they interact with the simple parts of the regex.