On Thu, Mar 2, 2023 at 1:55 PM Grant Taylor via COFF <coff(a)tuhs.org> wrote:
I'd like some thoughts ~> input on extended
regular expressions used
with grep, specifically GNU grep -e / egrep.
What are the pros / cons to creating extended regular expressions like
the following:
^\w{3}
vs:
^(Jan|Feb|Mar|Apr|May|Jun|Jul|Aug|Sep|Oct|Nov|Dec)
Well, obviously the former matches any sequence 3 of
alpha-numerics/underscores at the beginning of a string, while the
latter only matches abbreviations of months in the western calendar;
that is, the two REs are matching very different things (the latter is
a strict subset of the former). But I suspect you mean in a more
general sense.
Or:
[ :[:digit:]]{11}
...do you really want to match a space, a colon and a single digit 11
times in a single string?
vs:
( 1| 2| 3| 4| 5| 6| 7| 8|
9|10|11|12|13|14|15|16|17|18|19|20|21|22|23|24|25|26|27|28|29|30|31)
(0|1|2)[[:digit:]]:(0|1|2|3|4|5)[[:digit:]]:(0|1|2|3|4|5)[[:digit:]]
Using character classes would greatly simplify what you're trying to
do. It seems like this could be simplified to (untested) snippet:
( [1-9]|[12][[0-9]]|3[01]) [0-2][0-9]:[0-5][0-9]:[0-5][0-9]
For this, I'd probably eschew `[:digit:]`. Named character classes are
for handy locale support, or in lieu of typing every character in the
alphabet (though we can use ranges to abbreviate that), but it kind of
seems like that's not coming into play here and, IMHO, `[0-9]` is
clearer in context.
I'm currently eliding the 61st (60) second, the
32nd day, and dealing
with February having fewer days for simplicity.
It's not clear to me that dates, in their generality, can be matched
with regular expressions. Consider leap years; you'd almost
necessarily have to use backtracking for that, but I admit I haven't
thought it through.
For matching patterns like the following in log
files?
Mar 2 03:23:38
I'm working on organically training logcheck to match known good log
entries. So I'm *DEEP* in the bowels of extended regular expressions
(GNU egrep) that runs over all logs hourly. As such, I'm interested in
making sure that my REs are both efficient and accurate or at least not
WILDLY badly structured. The pedantic part of me wants to avoid
wildcard type matches (\w), even if they are bounded (\w{3}), unless it
truly is for unpredictable text.
`\w` is a GNU extension; I'd probably avoid it on portability grounds
(though `\b` is very handy).
The thing about regular expressions is that they describe regular
languages, and regular languages are those for which there exists a
finite automaton that can recognize the language. An important class
of finite automata are deterministic finite automata; by definition,
recognition by such automata are linear in the length of the input.
However, construction of a DFA for any given regular expression can be
superlinear (in fact, it can be exponential) so practically speaking,
we usually construct non-deterministic finite automata (NDFAs) and
"simulate" their execution for matching. NDFAs generalize DFAs (DFAs
are a subset of NDFAs, incidentally) in that, in any non-terminal
state, there can be multiple subsequent states that the machine can
transition to given an input symbol. When executed, for any state, the
simulator will transition to every permissible subsequent state
simultaneously, discarding impossible states as they become evident.
This implies that NDFA execution is superlinear, but it is bounded,
and is O(n*m*e), where n is the length of the input, m is the number
of nodes in the state transition graph corresponding to the NDFA, and
e is the maximum number of edges leaving any node in that graph (for a
fully connected graph, that would m, so this can be up to O(n*m^2)).
Construction of an NDFA is O(m), so while it's slower to execute, it's
actually possible to construct in a reasonable amount of time. Russ's
excellent series of articles that Clem linked to gives details and
algorithms.
I'd appreciate any feedback and recommendations
from people who have
been using and / or optimizing (extended) regular expressions for longer
than I have been using them.
Thank you for your time and input.
In practical terms? Basically, don't worry about it too much. Egrep
will generate an NDFA simulation that's going to be acceptably fast
for all but the weirdest cases.
- Dan C.