[TUHS] ML archive

Michael Welle mwe012008 at gmx.net
Sun Jul 3 15:46:14 AEST 2016


Random832 <random832 at fastmail.com> writes:

> On Sat, Jul 2, 2016, at 07:00, Michael Welle wrote:
>> Hello,
>> I want to complete my local ML archive (I deleted a few emails and I
>> wasn't subscribed before 2001 or so I think). After downloading the
>> archives and hitting them a few times to get somewhat importable mboxes,
>> I ended with 8699 emails in a maildir (in theory that should be a
>> superset of the 5027 emails in my regular TUHS maildir. I will merge
>> them next.). Two dozens mails are obviously defective (can be repaired
>> manually maybe) and some more might be defective (needs deeper
>> checking). So, has anybody more ;)?
> Gah, the archive files from before 2002 are a nightmare.
yepp, it gets better with later archive files. In the first run I have

1995-October.txt, 1995-November.txt, 1995-December.txt,
1996-March.txt, 1996-September.txt, 1996-November.txt,
1997-August.txt, 1997-September.txt, 1997-October.txt,
1997-November.txt, 1998-April.txt, 1998-February.txt,
1998-March.txt, 1998-May.txt, 1998-August.txt,
1998-November.txt, 1998-December.txt, 1999-January.txt,
1999-February.txt, 1999-March.txt, 1999-May.txt,
1999-June.txt, 1999-August.txt, 1999-September.txt, 
1999-November.txt, 1999-October.txt, 1999-December.txt,
2000-January.txt, 2000-February.txt, 2000-April.txt,
2000-May.txt, 2000-July.txt, 2000-June.txt,
2000-August.txt, 2000-October.txt, 2001-January.txt,
2001-February.txt, 2001-March.txt, 2001-April.txt,
2001-May.txt, 2002-October.txt

> My best guess on the proper message count is 8675. This is the number of
> blocks of non-blank lines which either start with "From " (the easy
> case) or, for the hard case, meet the following conditions:
> In a file dated 2001 or earlier
> First line contains a colon
> Contains at least one line starting with "Received:"
I started with an empty line followed by '^(Date|From|To|Message-Id|Received):'.
There is at least one match of a 'Date:' in the message's body and there
are cases where people appended emails incl. headers to their emails. So
a little bit more tweaking is needed. 

> There might still be a handful of messages this fails to split out. I
> would recommend interpreting files dated October 2001 or later as a
> strict MBOXO archive, and only doing special processing to files dated
> September 2001 or earlier (I haven't factored out my own script to be
> able to do anything other than count messages)
One other thing I haven't made my mind up about is date headers. Some
emails don't have a date header. I feel a bit uneasy about manipulating
the emails and add a date header. On the other hand an approximate date 
header would allow to sort the emails in the clients and give them a bit
more context. Any opinions on that?


More information about the TUHS mailing list