Thanks, Grant and contributors in
this thread,
Great thread on RE's. I bought and read
the book (it's on the floor over there
in the corner and I'm not getting up).
My task was finding dates in binary
and text files. It turns out RE's work just
fine for that. Because I was looking at
both text files and binary files, I
wrote my stuff using 8-bit python
"bytes" rather than python "text" which
is, I think, 7-bit in python. (I use
python because it works on both
Linux, Macs and Windows and reduces the
number of RE implementations I have
to deal with to 1).
I finished my first round of the
program late fall of 2022. Then
I put it down and now I am
revisiting it. I was creating:
A Python program to search for
media files (pictures and movies)
and copy them to another
directory tree, copying only the
unique ones (deduplication), and
renaming each with
*YYYY-MM-DD-*
as a prefix.
Here is a list of observations from my
programming.
1. RE's are quite unreadable. I defined
a lot of python variables and simply
added them together in python to make
a larger byte string (see below).
The resulting
expressions were shorter on screen
and more readable. Furthermore,
I could construct them incrementally.
I insist on readable code
because I frequently put things down
for a month or more. A while back
it was a sad day when I restarted
something and simply had to throw it
away, moaning, "What was that
programmer thinking?".
Here is an example RE for
YYYY-MM-DD
# FR = front BA = back
# ymdt is text version
ymdt = FRSEP + Y_ + SEP + M_ + SEP + D_ + BASEP
ymdc = re.compile( ymdt )
1a. I also had a time defining
delimiters. There are delimiters
for the beginning, delimiters
for internal separation,
and delimiters for the end.
The significant thing is I have
to find the RE if it is the very
first string in the file or the
very last. That also complicates
buffered reading immensely. Hence, I wrote
the whole program by reading the
file into a single python variable.
However, when files become much
larger than memory, python simply
ground to a halt as did my Windows
machine. I then rewrote it using a
memory mapped file (for all files)
and the problem was fixed.
2. Dates are formatted in a number of
ways. I chose exactly one
format to learn about RE's
and how to construct them and use
them. Even the book didn't elaborate
everything. I could not find
detailed documentation on some of
the interfaces in the book.
On a whim, I asked chatGPT
to write a python module that returns
a list of offsets and dates in a file.
Surprisingly, it wrote one that was
quite credible. It had bugs but it
knew more about how to use the various
functional interfaces in RE's than I
did.
3. Testing an RE is maybe even more
difficult than writing one. I have
not given any serious effort to
verification testing yet.
I would like to extend my program to
any date format. That would require
a much bigger RE. I have been led to
believe that a 50Kbyte or 500Kbyte
RE works just as well (if not
as fast) as a 100 byte RE. I think
with parentheses and
pipe-symbols suitably used,
one could match
Monday, March 6, 2023
2023-03-06
Mar 6, 2023
or
...
I'm just guessing, though. This
thread has been very informative.
I have much to read.
Thank all of you.
Ed Bradford
Pflugerville, TX
On Thu, Mar 2, 2023 at 12:55 PM Grant Taylor via COFF <coff(a)tuhs.org> wrote:
Hi,
I'd like some thoughts ~> input on extended regular expressions used
with grep, specifically GNU grep -e / egrep.
What are the pros / cons to creating extended regular expressions like
the following:
^\w{3}
vs:
^(Jan|Feb|Mar|Apr|May|Jun|Jul|Aug|Sep|Oct|Nov|Dec)
Or:
[ :[:digit:]]{11}
vs:
( 1| 2| 3| 4| 5| 6| 7| 8|
9|10|11|12|13|14|15|16|17|18|19|20|21|22|23|24|25|26|27|28|29|30|31)
(0|1|2)[[:digit:]]:(0|1|2|3|4|5)[[:digit:]]:(0|1|2|3|4|5)[[:digit:]]
I'm currently eliding the 61st (60) second, the 32nd day, and dealing
with February having fewer days for simplicity.
For matching patterns like the following in log files?
Mar 2 03:23:38
I'm working on organically training logcheck to match known good log
entries. So I'm *DEEP* in the bowels of extended regular expressions
(GNU egrep) that runs over all logs hourly. As such, I'm interested in
making sure that my REs are both efficient and accurate or at least not
WILDLY badly structured. The pedantic part of me wants to avoid
wildcard type matches (\w), even if they are bounded (\w{3}), unless it
truly is for unpredictable text.
I'd appreciate any feedback and recommendations from people who have
been using and / or optimizing (extended) regular expressions for longer
than I have been using them.
Thank you for your time and input.
--
Grant. . . .
unix || die
--
Advice is judged by results, not by intentions.
Cicero