On 3/2/23 8:04 PM, Dan Cross wrote:
I guess what I'm saying is, match what you want
to match and don't sweat
the small stuff.
ACK
Not exactly. :-)
What I understand you to mean, based on this and the rest of your note,
is that you want to find a good division point between overly specific,
complex REs and simpler, easy to understand REs that are less specific.
The danger with the latter is that they may match things you don't
intend, while the former are harder to maintain and (arguably) more
brittle. I can sympathize.
You got it.
For the purposes of grep/egrep, that'll be a
logical "line" of text,
terminated by a newline, though the newline itself isn't considered part
of the text for matching. I believe the `-z` option can be used to set a
NUL byte as the "line" terminator; presumably this lets one match
strings with embedded newlines, though I haven't tried.
Fair enough. That's also sort of what I thought might be the case.
"string" in this context is the input
you're attempting to match
against. `egrep` will attempt to match your pattern against each "line"
of text it reads from the files its searching. That is, each line in
your log file(s).
*nod*
But consider what `[ :[:digit:]]{11}` means:
you've got a character
class consisting of space, colon and a digit; {11} means "match any of
the characters in that class exactly 11 times" (as opposed to other
variations on the '{}' syntax that say "at least m times", "at
most n
times", or "between n and m times").
Yep, I'm well aware of the that.
But that'll match all sorts of things that
don't look like 'dd
hh:mm:ss':
That's one of the reasons that I'm interested in coming up with a more
precise regular expression ... without being overly complex.
(The first line is my typing; the second is output
from egrep except for
the short line of 9 '1's, for which egrep had no output. That last two
lines are matching space characters and egrep echoing the match, but I'm
guessing gmail will eat those.)
Note that there are inputs with more than 11 characters that match; this
is because there is some 11-character substring that matches the RE in
those lines. In any event, I suspect this would generally not be what
you want. But if nothing else in your input can match the RE (which you
might know a priori because of domain knowledge about whatever is
generating those logs) then it's no big deal, even if the RE was capable
of matching more things generally.
Yep.
Here's an example of the full RE:
^\w{3} [ :[:digit:]]{11} [._[:alnum:]-]+
postfix/msa/smtpd\[[[:digit:]]+\]: timeout after STARTTLS from
[._[:alnum:]-]+\[[.:[:xdigit:]]+\]$
As you can see the "[ :[:digit:]]{11}" is actually only a sub-part of a
larger RE and there is bounding & delimiting around the subpart.
This is to match a standard message from postfix via standard SYSLOG.
Ah. I suspect this relies on domain knowledge about
the format of log
lines to match reliably. Otherwise it could match, `___ 123 456:789`
which is probably not what you are expecting.
Yep.
Though said domain knowledge isn't anything special in and of itself.
Sure. One nice thing about `egrep` et al is that you
can put the REs
into a file and include them with `-f`, as opposed to having them all
directly on the command line.
Yep. logcheck makes extensive use of many files like this to do it's work.
Typo. :-)
ACKK
That seems reasonable.
Thank you for the logic CRC.
Aside: I found the note on it's website amusing:
Brought to you by the
UK's best gambling sites! "Only gamble with what you can afford to
lose." Yikes!
Um ... that's concerning.
I'd proceed with caution here; it also seems to
be in the FreeBSD and
DragonFly ports collections and Homebrew on the Mac (but so is GNU grep
for all of those).
Fair enough.
My use case is on Linux where GNU egrep is a thing.
Yeah. IMHO `\w` is too general for what you're
trying to do.
I think that `\w` is a good primer, but not where I want things to end
up long term.
Basically, a regular expression is a regular
expression if you can build
a machine with no additional memory that can tell you whether or not a
given string matches the RE examining its input one character at a time.
I /think/ that I could build a complex nested tree of switch statements
to test each character to see if things match what they should or not.
Though I would need at least one variable / memory to hold absolutely
minimal state to know where I am in the switch tree. I think a number
to identify the switch statement in question would be sufficient. So
I'm guessing two bytes of variable and uncounted bytes of program code.
I think that's about right.
Thank you again Dan.
Sure thing!
:-)
--
Grant. . . .
unix || die