On Thu, Mar 2, 2023 at 8:06 PM Grant Taylor via COFF <coff(a)tuhs.org> wrote:
On 3/2/23 2:53 PM, Dan Cross wrote:
Well, obviously the former matches any sequence 3
of
alpha-numerics/underscores at the beginning of a string, while the
latter only matches abbreviations of months in the western calendar;
that is, the two REs are matching very different things (the latter
is a strict subset of the former).
I completely agree with you. That's also why I'm wanting to start
utilizing the latter, more specific RE. But I don't know where the line
of over complicating things is to avoid crossing it.
I guess what I'm saying is, match what you want to match and don't sweat
the small stuff.
But I suspect
you mean in a more general sense.
Yes and no. Does the comment above clarify at all?
Not exactly. :-)
What I understand you to mean, based on this and the rest of your note, is
that you want to find a good division point between overly specific,
complex REs and simpler, easy to understand REs that are less specific. The
danger with the latter is that they may match things you don't intend,
while the former are harder to maintain and (arguably) more brittle. I can
sympathize.
...do you
really want to match a space, a colon and a single digit
11 times ...
Yes.
... in a single string?
What constitutes a single string? ;-) I sort of rhetorically ask.
For the purposes of grep/egrep, that'll be a logical "line" of text,
terminated by a newline, though the newline itself isn't considered part of
the text for matching. I believe the `-z` option can be used to set a NUL
byte as the "line" terminator; presumably this lets one match strings with
embedded newlines, though I haven't tried.
The log lines start with
MMM dd hh:mm:ss
Where:
- MMM is the month abbreviation
- dd is the day of the month
- hh is the hour of the day
- mm is the minute of the hour
- ss is the second of the minute
So, yes, there are eleven characters that fall into the class consisting
of a space or a colon or a number.
Is that a single string? It depends what you're looking at, the
sequences of non white space in the log? No. The patter that I'm
matching ya.
"string" in this context is the input you're attempting to match against.
`egrep` will attempt to match your pattern against each "line" of text it
reads from the files its searching. That is, each line in your log file(s).
But consider what `[ :[:digit:]]{11}` means: you've got a character class
consisting of space, colon and a digit; {11} means "match any of the
characters in that class exactly 11 times" (as opposed to other variations
on the '{}' syntax that say "at least m times", "at most n
times", or
"between n and m times"). But that'll match all sorts of things that
don't
look like 'dd hh:mm:ss':
term% egrep '[ :[:digit:]]{11}'
11111111111
11111111111
111111111
1111111111111
1111111111111
::::::::::::::::
::::::::::::::::
aaaa bbbbb
aaaa bbbbb
term%
(The first line is my typing; the second is output from egrep except for
the short line of 9 '1's, for which egrep had no output. That last two
lines are matching space characters and egrep echoing the match, but I'm
guessing gmail will eat those.)
Note that there are inputs with more than 11 characters that match; this is
because there is some 11-character substring that matches the RE in those
lines. In any event, I suspect this would generally not be what you want.
But if nothing else in your input can match the RE (which you might know a
priori because of domain knowledge about whatever is generating those logs)
then it's no big deal, even if the RE was capable of matching more things
generally.
Using
character classes would greatly simplify what you're trying to
do. It seems like this could be simplified to (untested) snippet:
Agreed.
I'm starting with the examples that came with; "^\w{3} [
:[:digit:]]{11}", the logcheck package that I'm working with and
evaluating what I want to do.
Ah. I suspect this relies on domain knowledge about the format of log lines
to match reliably. Otherwise it could match, `___ 123 456:789` which is
probably not what you are expecting.
I actually like the idea of dividing out the
following:
- months that have 31 days: Jan, Mar, May, Jul, Aug, Oct, and Dec
- months that have 30 days: Apr, Jun, Sep, Nov
- month that have 28/29 days: Feb
Sure. One nice thing about `egrep` et al is that you can put the REs into
a file and include them with `-f`, as opposed to having them all directly
on the command line.
(
[1-9]|[12][[0-9]]|3[01]) [0-2][0-9]:[0-5][0-9]:[0-5][0-9]
Aside: Why do you have the double square brackets in "[12][[0-9]]"?
Typo. :-)
For this,
I'd probably eschew `[:digit:]`. Named character classes
are for handy locale support, or in lieu of typing every character
in the alphabet (though we can use ranges to abbreviate that), but
it kind of seems like that's not coming into play here and, IMHO,
`[0-9]` is clearer in context.
ACK
"[[:digit:]]+" was a construct that I'm parroting. It and
[.:[:xdigit:]]+ are good for some things. But they definitely aren't
the best for all things.
Hence trying to find the line of being more accurate without going too
far.
It's not clear to me that dates, in their
generality, can be
matched with regular expressions. Consider leap years; you'd almost
necessarily have to use backtracking for that, but I admit I haven't
thought it through.
Given the context that these extended regular expressions are going to
be used in, logcheck -- filtering out known okay log entries to email
what doesn't get filtered -- I'm okay with having a few things slip
through like leap day / leap seconds / leap frogs.
That seems reasonable.
`\w` is a GNU
extension; I'd probably avoid it on portability grounds
(though `\b` is very handy).
I hear, understand, and acknowledge your concern. At present, these
filters are being used in a package; logcheck,
Aside: I found the note on it's website amusing: Brought to you by the UK's
best gambling sites! "Only gamble with what you can afford to lose." Yikes!
which I believe is
specific to Debian and ilk. As such, GNU grep is very much a thing.
I'd proceed with caution here; it also seems to be in the FreeBSD and
DragonFly ports collections and Homebrew on the Mac (but so is GNU grep for
all of those).
I'm also not a fan of the use of `\w` and would
prefer to (...|...)
things.
Yeah. IMHO `\w` is too general for what you're trying to do.
The thing
about regular expressions is that they describe regular
languages, and regular languages are those for which there exists a
finite automaton that can recognize the language. An important class
of finite automata are deterministic finite automata; by definition,
recognition by such automata are linear in the length of the input.
However, construction of a DFA for any given regular expression can be
superlinear (in fact, it can be exponential) so practically speaking,
we usually construct non-deterministic finite automata (NDFAs) and
"simulate" their execution for matching. NDFAs generalize DFAs (DFAs
are a subset of NDFAs, incidentally) in that, in any non-terminal
state, there can be multiple subsequent states that the machine can
transition to given an input symbol. When executed, for any state,
the simulator will transition to every permissible subsequent state
simultaneously, discarding impossible states as they become evident.
This implies that NDFA execution is superlinear, but it is bounded,
and is O(n*m*e), where n is the length of the input, m is the number
of nodes in the state transition graph corresponding to the NDFA, and
e is the maximum number of edges leaving any node in that graph (for
a fully connected graph, that would m, so this can be up to O(n*m^2)).
Construction of an NDFA is O(m), so while it's slower to execute, it's
actually possible to construct in a reasonable amount of time. Russ's
excellent series of articles that Clem linked to gives details and
algorithms.
I only vaguely understand those three paragraphs as they are deeper
computer science than I've gone before.
I think I get the gist of them but could not explain them if my life
depended upon it.
Basically, a regular expression is a regular expression if you can build a
machine with no additional memory that can tell you whether or not a given
string matches the RE examining its input one character at a time.
In practical
terms? Basically, don't worry about it too much. Egrep
will generate an NDFA simulation that's going to be acceptably fast
for all but the weirdest cases.
ACK
It sounds like I can make any reasonable extended regular expression a
human can read and I'll probably be good.
I think that's about right.
Thank you for the detailed response Dan. :-)
Sure thing!
- Dan C.