[COFF] Re: Requesting thoughts on extended regular expressions in grep.

3 Mar 2023

On 3/2/23 2:53 PM, Dan Cross wrote:
...
  Well, obviously the former matches any sequence 3 of
 alpha-numerics/underscores at the beginning of a string, while the
 latter only matches abbreviations of months in the western calendar;
 that is, the two REs are matching very different things (the latter
 is a strict subset of the former). 
I completely agree with you.  That's also why I'm wanting to start
utilizing the latter, more specific RE.  But I don't know where the line
of over complicating things is to avoid crossing it.
...
  But I suspect you mean in a more general sense. 
Yes and no.  Does the comment above clarify at all?
...
  ...do you really want to match a space, a colon and a
single digit
 11 times ... 
Yes.
...
  ... in a single string? 
What constitutes a single string?  ;-)  I sort of rhetorically ask.
The log lines start with
MMM dd hh:mm:ss
Where:
  - MMM is the month abbreviation
  - dd  is the day of the month
  - hh  is the hour of the day
  - mm  is the minute of the hour
  - ss  is the second of the minute
So, yes, there are eleven characters that fall into the class consisting
of a space or a colon or a number.
Is that a single string?  It depends what you're looking at, the
sequences of non white space in the log? No.  The patter that I'm
matching ya.
...
  Using character classes would greatly simplify what
you're trying to
 do. It seems like this could be simplified to (untested) snippet: 
Agreed.
I'm starting with the examples that came with; "^\w{3} [
:[:digit:]]{11}", the logcheck package that I'm working with and
evaluating what I want to do.
I actually like the idea of dividing out the following:
  - months that have 31 days: Jan, Mar, May, Jul, Aug, Oct, and Dec
  - months that have 30 days: Apr, Jun, Sep, Nov
  - month that have 28/29 days: Feb
...
  ( [1-9]|[12][[0-9]]|3[01])
[0-2][0-9]:[0-5][0-9]:[0-5][0-9] 
Aside:  Why do you have the double square brackets in "[12][[0-9]]"?
...
  For this, I'd probably eschew `[:digit:]`. Named
character classes
 are for handy locale support, or in lieu of typing every character
 in the alphabet (though we can use ranges to abbreviate that), but
 it kind of seems like that's not coming into play here and, IMHO,
 `[0-9]` is clearer in context. 
ACK
"[[:digit:]]+" was a construct that I'm parroting.  It and
[.:[:xdigit:]]+ are good for some things.  But they definitely aren't
the best for all things.
Hence trying to find the line of being more accurate without going too far.
...
  It's not clear to me that dates, in their
generality, can be
 matched with regular expressions.  Consider leap years; you'd almost
 necessarily have to use backtracking for that, but I admit I haven't
 thought it through. 
Given the context that these extended regular expressions are going to
be used in, logcheck -- filtering out known okay log entries to email
what doesn't get filtered -- I'm okay with having a few things slip
through like leap day / leap seconds / leap frogs.
...
  `\w` is a GNU extension; I'd probably avoid it on
portability grounds
 (though `\b` is very handy). 
I hear, understand, and acknowledge your concern.  At present, these
filters are being used in a package; logcheck, which I believe is
specific to Debian and ilk.  As such, GNU grep is very much a thing.
I'm also not a fan of the use of `\w` and would prefer to (...|...) things.
...
  The thing about regular expressions is that they
describe regular
 languages, and regular languages are those for which there exists a
 finite automaton that can recognize the language. An important class
 of finite automata are deterministic finite automata; by definition,
 recognition by such automata are linear in the length of the input.
 However, construction of a DFA for any given regular expression can be
 superlinear (in fact, it can be exponential) so practically speaking,
 we usually construct non-deterministic finite automata (NDFAs) and
 "simulate" their execution for matching. NDFAs generalize DFAs (DFAs
 are a subset of NDFAs, incidentally) in that, in any non-terminal
 state, there can be multiple subsequent states that the machine can
 transition to given an input symbol. When executed, for any state,
 the simulator will transition to every permissible subsequent state
 simultaneously, discarding impossible states as they become evident.
 This implies that NDFA execution is superlinear, but it is bounded,
 and is O(n*m*e), where n is the length of the input, m is the number
 of nodes in the state transition graph corresponding to the NDFA, and
 e is the maximum number of edges leaving any node in that graph (for
 a fully connected graph, that would m, so this can be up to O(n*m^2)).
 Construction of an NDFA is O(m), so while it's slower to execute, it's
 actually possible to construct in a reasonable amount of time. Russ's
 excellent series of articles that Clem linked to gives details and
 algorithms. 
I only vaguely understand those three paragraphs as they are deeper
computer science than I've gone before.
I think I get the gist of them but could not explain them if my life
depended upon it.
...
  In practical terms? Basically, don't worry about
it too much. Egrep
 will generate an NDFA simulation that's going to be acceptably fast
 for all but the weirdest cases. 
ACK
It sounds like I can make any reasonable extended regular expression a
human can read and I'll probably be good.
Thank you for the detailed response Dan.  :-)
--
Grant. . . .
unix || die

2025

2024

2023

2022

2021

2020

2019

2018

[COFF] Re: Requesting thoughts on extended regular expressions in grep.