[COFF] Re: Requesting thoughts on extended regular expressions in grep.

7 Mar 2023

On 3/7/23 4:39 AM, Ralph Corderoy wrote:
...
  Readable to you, which is fine because you're the
prime future
 reader.  But it's less readable than the regexp to those that know
 and read them because of the indirection introduced by the variables.
 You've created your own little language of CAPITALS rather than the
 lingua franca of regexps.  :-) 
I want to agree, but then I run into things like this:
    ^\w{3} [ :[:digit:]]{11} [._[:alnum:]-]+
postfix(/smtps)?/smtpd\[[[:digit:]]+\]: disconnect from
[._[:alnum:]-]+\[[.:[:xdigit:]]+\]( helo=[[:digit:]]+(/[[:digit:]]+)?)?(
ehlo=[[:digit:]]+(/[[:digit:]]+)?)?(
starttls=[[:digit:]]+(/[[:digit:]]+)?)?(
auth=[[:digit:]]+(/[[:digit:]]+)?)?(
mail=[[:digit:]]+(/[[:digit:]]+)?)?(
rcpt=[[:digit:]]+(/[[:digit:]]+)?)?(
data=[[:digit:]]+(/[[:digit:]]+)?)?(
bdat=[[:digit:]]+(/[[:digit:]]+)?)?(
rset=[[:digit:]]+(/[[:digit:]]+)?)?(
noop=[[:digit:]]+(/[[:digit:]]+)?)?(
quit=[[:digit:]]+(/[[:digit:]]+)?)?(
unknown=[[:digit:]]+(/[[:digit:]]+)?)?(
commands=[[:digit:]]+(/[[:digit:]]+)?)?$
Which is produced by this m4:
    define(`DAEMONPID', `$1\[DIGITS\]:')dnl
    define(`DATE', `\w{3} [ :[:digit:]]{11}')dnl
    define(`DIGIT', `[[:digit:]]')dnl
    define(`DIGITS', `DIGIT+')dnl
    define(`HOST', `[._[:alnum:]-]+')dnl
    define(`HOSTIP', `HOST\[IP\]')dnl
    define(`IP', `[.:[:xdigit:]]+')dnl
    define(`VERB', `( $1=DIGITS`'(/DIGITS)?)?')dnl
    ^DATE HOST DAEMONPID(`postfix(/smtps)?/smtpd') disconnect from
HOSTIP`'VERB(`helo')VERB(`ehlo')VERB(`starttls')VERB(`auth')VERB(`mail')VERB(`rcpt')VERB(`data')VERB(`bdat')VERB(`rset')VERB(`noop')VERB(`quit')VERB(`unknown')VERB(`commands')$
I only consider myself to be an /adequate/ m4 user.  Though I've done
some things that are arguably creating new languages.
I personally find the generated regular expression to be onerous to read
and understand, much less modify.  I would be highly dependent on my
editor's (vim's) parenthesis / square bracket matching (%) capability
and / or would need to explode the RE into multiple components on
multiple lines to have a hope of accurately understanding or modifying it.
Conversely I think that the m4 is /largely/ find and replace with a
little syntactic sugar around the definitions.
I also think that anyone that does understand regular expressions and
the concept of find & replace is likely to be able to both recognize
patterns -- as in "VERB(...)" corresponds to "(
$1=DIGITS`'(/DIGITS)?)?", that "DIGITS" corresponds to
"DIGIT+", and
that "DIGIT" corresponds to "[[:digit:]]".
There seems to be a point between simple REs w/o any supporting
constructor and complex REs with supporting constructor where I think it
is better to have the constructors.  Especially when duplication comes
into play.
If nothing else, the constructors are likely to reduce one-off typo
errors.  The typo will either be everywhere the constructor was used, or
similarly be fixed everywhere at the same time.  Conversely, finding an
unmatched parenthesis or square bracket in the RE above will be annoying
at best if not likely to be more daunting.
...
  Each time the original language was readable because
practitioners
 had to read and write it.  When its replacement came along, the old
 skill was no longer learnt and the language became ‘unreadable’. 
I feel like there is an analogy between machine code and assembly
language as well as assembly language and higher level languages.
My understanding is that the computer industry has vastly agreed that
the higher level language is easier to understand and maintain.
...
  ‘{1}’ is redundant. 
That may very well be.  But what will be more maintainable / easier to
correct in the future; adding `{2}` when necessary or changing the value
of `1` to `2`?
I think this is an example of tradeoff of not strictly required to make
something more maintainable down the road.  Sort of like fleet vehicles
vs non-fleet vehicles.
...
  BTW, ‘{0,1}’ is more readable to those who know
regexps as ‘?’. 
I think this is another example of the maintainability.
...
  I'm sending this to just the list. 
I'm also replying to only the COFF mailing list.
...
  Perhaps your account on the list is configured to not
send you an
 email if it sees your address in the header's fields. 
There is a reasonable chance that the COFF mailing list and / or your
account therein is configured to minimize duplicates meaning the COFF
mailing list won't send you a copy if it sees your subscribed address as
receiving a copy directly.
I personally always prefer the mailing list copy and shun the direct
copies.  I think that the copy from the mailing list keeps the
discussion on the mailing list and avoids accidental replies bypassing
the mailing list.
--
Grant. . . .
unix || die

2025

2024

2023

2022

2021

2020

2019

2018

[COFF] Re: Requesting thoughts on extended regular expressions in grep.