Thank you for the very useful comments. However,

I disagree with you about the RE language. While

I agree all RE experts don't need that, when I was

hiring and gave some software to a new hire (whether

an experienced programmer or a recent college grad)

simply handing over huge RE's to my new hire was

a daunting task to that person. I wrote that stuff

that way to help remind me and anyone who might

use the python program.

I don't claim success. It does help me.

When you say '{1}' is redundant, I think I did that

to avoid any possibility of conflicts with the

next string that is

concatentated to the Y_ (e.g. '*' or '+' or '{4,7}').

I am embarrassed I did not communicate that

in the code. I had to think about it for a couple of hours

before I recalled the "why". I will fix that.

  (it would be difficult to discuss

   this RE if I had to write

"(19\d\d|20[01]\d|202" + "[0-" + lastYearRE + "]" + ")

rather than just Y_).

My initial thoughts

on naming were I wanted the definition to be defined

in exactly one place in the software.

Python and the BTL folks told me to never

use a constant in code. Always name it.

Hence, I gave it a name. Each name might

be used in multiple places. They might be imported.

You are correct, the expression is unbalanced. I tried

to remove the text2bytes(lastYearRE) call so the expression in

this email was all text. I failed to remove the trailing ) when

I removed the call to text2bytes(). My hasty transcriptions

might have produced similar errors in my email.

Recall, my focus was on any file of any size.

I'm on Windows 10 and an m1 MacBook.

Python works on both. I don't have

a Linux machine or enough desktop space to

host one. I'm also mildly fed-up with

virtual machines.

Friedl taught me one thing. Most

RE implementations are different. I'm trying

to write a program that I could give

to anyone and could reliably find a date (an RE) in

any file. YYYY, MM, DD, HR, MI, SE, TH are words

my user could use in the command line or in

an options dialog. LAT and LON might also be

possibilities. CST, EST, MST, PST, ... also.

A 500 gigabyte archive or directory/folder

of pictures and movies would be

a great test target.

I very much appreciate your comments. If this

discussion is boring to others, I would be happy

to take it to emails.

I like your program. My experience

with RE, grep, python, and sed suggests that

anything but gnu grep and sed might not work due to the

different implementations.

I've been out of the Unix software business

for 30 years after starting work at BTL in the 1970s

and working on Version 6. I didn't know "printf" was now

built into bash! That was a surprise. It's an incremental

improvement, but doesn't compare with f-strings in python.

The interactive interpreter for python should have

a "bash" mode?!

Does grep use a memory mapped file for its search, thereby

avoiding all buffering boundaries? That too, would

be new information to me. The additional complexity

of dealing with buffering is more than annoying.

Do you have any thoughts on how to verify

a program that uses RE's. I've given no thought

until now. My first thought for dates would be

to write a separate module that simply searched

through the file looking for 4 numbers in a row

without using RE's, recording the offsets and 16 characters

after and 1 character before in a python list of (offset,str)

of tuples, ddddList, and using ddddList

as a proxy for the entire file. I could then

aim my RE's at ddddList. [A list of tuples in python

is wonderful! !] It seems to me '*' and '+' and {x,y} are the performance

hogs in RE's. My RE's avoid them. One pass, I think, should

suffice. What do you think? I haven't "archived" my 350 GB

of pictures and movies, but one pass over all files therein

ought to suffice, right? Two different programs that use different

algorithms should be pretty good proof of correctness wouldn't

you think?

My RE's have no stars or pluses. If there is a mismatch before

a match, give up and move on.

On my Windows 10 machine, I have cygwin.

Microsoft says my CPU doesn't have a TPM and

the specific Intel Core I7 on my system is not

supported so Windows 11 is not happening.

Microsoft is DOS personified.

(An unkind editorial remark about the low

quality of software coming from Microsoft.)

Anyway, I thank you again for your patience with me

and your observations. I value your views and the

other views I've seen here on coff@tuhs.org.

I welcome all input to my education and will share

all I have done so far with anyone who wants to

collaborate, test, or is just curious.

GOAL: run python program from an at-cost thumb drive that:

reaps all media files from a user specified

directory/folder tree and

Adds files to the thumb drive.

Adds files means

Original file system is untouched

Adds only unique files (hash codes are unique)

Creates on the thumb drive a relative directory

wherein the original file was found

Prepends a "YYYY-MM-DD-" string to the filename

if one can be found (EXIF is great shortcut).

Copies

srcroot/relative_path/oldfilename

thumbdrive/relative_path/YYYY-MM-DD-oldfilename

thumbdrive/relative_path/0000-oldfilename.

Can also incrementally add new files by just

scanning anywhere in any other computer

file system or any other computer.

Must work on Mac, Windows, and Linux

What I have is a working prototype. It works

on Mac and Windows. It doesn't do the

date thing very well, and there are other shortcomings.

I have delivered exactly one Christmas present to my favorite person

in the world - a 400 GB SSD drive with all our pictures and media

we have ever taken. The next things are to add more media

and re-unique-ify (check) what is already present on the SSD drive

and improve the proper choice of "YYYY-MM-DD-" prefix to

filenames.

I am retired and this is fun.

I'm too old to want to get rich.

Ed Bradford

Pflugerville, TX

egbegb2@gmail.com

On Tue, Mar 7, 2023 at 5:40 AM Ralph Corderoy <ralph@inputplus.co.uk> wrote:

Hi Ed,

> I have made an attempt to make my RE stuff readable and supportable.

Readable to you, which is fine because you're the prime future reader.
But it's less readable than the regexp to those that know and read them
because of the indirection introduced by the variables. You've created
your own little language of CAPITALS rather than the lingua franca of
regexps. :-)

> Machine language was unreadable and then along came assembly language.
> Assembly language was unreadable, then came higher level languages.

Each time the original language was readable because practitioners had
to read and write it. When its replacement came along, the old skill
was no longer learnt and the language became ‘unreadable’.

> So far, I can do that for this RE program that works for small files,
> large files, binary files and text files for exactly one pattern:
> YYYY[-MM-DD]
> I constructed this RE with code like this:
> # ymdt is YYYY-MM-DD RE in text.
> # looking only for 1900s and 2000s years and no later than today.
> _YYYY = "(19\d\d|20[01]\d|202" + "[0-" + lastYearRE) + "]" + "){1}"

‘{1}’ is redundant.

> # months
> _MM = "(0[1-9]|1[012])"
> # days
> _DD = "(0[1-9]|[12]\d|3[01])"
> ymdt = _YYYY + '[' + _INTERNALSEP +
>    _MM +
>    _INTERNALSEP +
>    ']'{0,1)

I think we're missing something as the ‘'['’ is starting a character
class which is odd for wrapping the month and the ‘{0,1)’ doesn't have
matching brackets and is outside the string.

BTW, ‘{0,1}’ is more readable to those who know regexps as ‘?’.

> For the whole file, RE I used
> ymdthf = _FRSEP + ymdt + _BASEP
> where FRSEP is front separator which includes
> a bunch of possible separators, excluding numbers and letters, or-ed
> with the up arrow "beginning of line" RE mark.

It sounds like you're wanting a word boundary; something provided by
regexps. In Python, it's ‘\b’.

>>> re.search(r'\bfoo\b', 'endfoo foostart foo ends'),
(<re.Match object; span=(16, 19), match='foo'>,)

Are you aware of the /x modifier to a regexp which ignores internal
whitespace, including linefeeds? This allows a large regexp to be split
over lines. There's a comment syntax too. See
https://docs.python.org/3/library/re.html#re.X

GNU grep isn't too shabby at looking through binary files. I can't use
/x with grep so in a bash script, I'd do it manually. \< and \> match
the start and end of a word, a bit like Python's \b.

re='
.?\<
(19[0-9][0-9]|20[01][0-9]|202[0-3])
(
([-:._])
(0[1-9]|1[0-2])
\3
(0[1-9]|[12][0-9]|3[01])
)?
\>.?
'
re=${re//$'\n'/}
re=${re// /}

printf '%s\n' 2001-04-01,1999_12_31 1944.03.01,1914! 2000-01.01 >big-binary-file
LC_ALL=C grep -Eboa "$re" big-binary-file | sed -n l

which gives

0:2001-04-01,$
11:1999_12_31$
22:1944.03.01,$
33:1914!$
39:2000-$

showing:

- the byte offset within the file of each match,
- along with the any before and after byte if it's not a \n and not
already matched, just to show the word-boundary at work,
- with any non-printables escaped into octal by sed.

> I thought I was on the COFF mailing list.

I'm sending this to just the list.

> I received this email by direct mail to from Larry.

Perhaps your account on the list is configured to not send you an email
if it sees your address in the header's fields.

--
Cheers, Ralph.