On Tue, Mar 7, 2023 at 12:34 PM Ralph Corderoy <ralph(a)inputplus.co.uk> wrote:
I'm sure
that with enough effort, it _is_ possible to make a 50K RE
that is understandable to mere mortals. But it begs the question: why
bother?
It could be the quickest way to express the intent.
Ok, I challenge you to find me anything for which the quickest way to
express the intent is a 50 *thousand* symbol regular expression. :-)
The answer to
that question, in my mind, shows the difference between
a clever programmer and a pragmatic engineer.
I think those two can overlap. :-)
Indeed they can. But I have grave doubts that this is a good example
of such overlap.
I submit that
it's time to reach for another tool well before you get
to an RE that big
Why, if the grammar is type three in Chomsky's hierarchy, i.e. a regular
grammar? I think sticking with code aimed at regular grammars, or more
likely regexps, will do better than, say, a parser generator for a
type-two context-free grammar.
Is there an extant, non-theoretical, example of such a grammar?
As well as the lex(1) family, there's Ragel as
another example.
http://www.colm.net/open-source/ragel/
This is moving the goal posts more than a bit. I'm suggesting that a
50k-symbol RE is unlikely to be the best solution to any reasonable
problem. A state-machine generator, even one with 50k statements, is
not a 50k RE.
3. Optimizing
regular expressions. You're still bound by the known
theoretical properties of finite automata here.
Well, we're back to the RE v. regexp again. /^[0-9]+\.jpeg$/ is matched
by some engines by first checking the last five bytes are ‘.jpeg’.
...in general, in order to find the end, won't _something_ have to
traverse the entire input?
(Note that I said, "in general". Allusions to mmap'ed files or seeking
to the end of a file are not general, since they don't apply well to
important categories of input sources, such as pipes or network
connections.)
$ debugperl -Dr -e \
'"123546789012354678901235467890123546789012.jpg" =~ /^[0-9]+\.jpeg$/'
...
Matching REx "^[0-9]+\.jpeg$" against
"123546789012354678901235467890123546789012.jpg"
Intuit: trying to determine minimum start position...
doing 'check' fbm scan, [1..46] gave -1
Did not find floating substr ".jpeg"$...
Match rejected by optimizer
Freeing REx: "^[0-9]+\.jpeg$"
$
Boyer-Moore string searching can be used. Common-subregexp-elimination
can spot repetitive fragment of regexp and factor them into a single set
of states along with pairing the route into them with the appropriate
route out.
Well, that's what big-O notation accounts for.
I'm afraid none of this really changes the time bounds, however, when
applied in general.
The more regexp engines are optimised, the more
benefit to the
programmer from sticking to a regexp rather than, say, ad hoc parsing.
This is comparing apples and oranges. There may be all sorts of
heuristics that we can apply to specific regular expressions to prune
the search space, and that's great. But by their very nature,
heuristics are not always generally applicable.
As an analogy, we know that we cannot solve _the_ halting problem, but
we also know that we can solve _many_ halting problem_s_. For example,
a compiler can recognize that any of, `for(;;);` or `while(1);` or
`loop {}` do not halt, and so on, ad nauseum, but even if some oracle
can recognize arbitrarily many such halting problems, we still haven't
solved the general problem.
The theory of REs is interesting and important, but
regexps deviate from
it ever more.
Yup. My post should not be construed as suggesting that regexps are
not useful, or that they should not be a part of a programmer's
toolkit. My post _should_ be construed as a suggestion that they are
not always the best solution, and a part of being an engineer is
finding that dividing line.
- Dan C.