During my search for a way to effectively integrate back references (using backtracking) into a text-driven regex engine, I found an article entitled Extending Automata Finis for effectively match Perl compatible regular expressions that seems to solve this problem. (The PDF can be found on your favorite science hub.)
Here is what I understand about the proposed algorithm.
- It is based on Thompson's NFA algorithm.
- It associates with each active state a list of sub-string sets. Each set contains the strings that correspond to a particular pair of parentheses. There is a set by capture parentheses.
- When you encounter a backward reference, a letter of each word in the set of substrings is deleted.
- Items in the set whose first letter does not match are completely removed.
- When an element of the set is empty, the transition titled
1is really crossed.
If I'm not mistaken, this algorithm would correspond
(a | ab) (bc | c) 1 2 with the string
abc ab bc, which is obviously wrong. I think this algorithm does not take into account the fact that sub-matches interact with each other.
Moreover, it seems to be a polynomial algorithm. Although it is known that matching rational expressions with back references is NP-hard.
Have I forgotten something and the algorithm is correct? Or is the proposed algorithm really wrong?
Is there a way to solve this problem without storing an exponential combination of sub-matches?