[not CCing Matt because his address didn't come through to the list]
Hi Matt,
At 2022-07-09T08:58:10+1000, Warren Toomey via TUHS wrote:
----- Forwarded message from Matt Gilmore -----
Subject: Documents for UNIX Collections
Good afternoon everyone, my name is Matt Gilmore, and I recently
worked with some folks here to help facilitate the scanning and
release of the "Documents for UNIX" package as well as a few odds and
ends pertinent to UNIX/TS 4.0. I've been researching pretty heavily
the history of published memoranda and how they ultimately became the
formal documents that Western Electric first published with UNIX/TS
5.0 and System V. Think the User's Guide, Graphics Guide, etc.
That's excellent work--thank you for doing it!
One of the projects I'm working on (slowly) is
comparing these
documents with the 4.0 docs I scanned for Arnold and making edits to
the *ROFF sources with the hopes I could then use them to produce 1:1
clean copies of the 4.0 docs, while providing an easy means for
diff'ing the documents as well (to flush out changes between 3.0 and
4.0).
Are you using groff to do your rendering? If so, please consider me a
resource; I've been the most active groff developer for the past 4
years. (I am, however, not the release manager--we're feeling heavily
pregnant with groff 1.23, 3.5 years in the making.)
Some of the following issues may be familiar to you; I apologize if I
wear a rut in well-trodden ground here.
I am wondering what you mean by "1:1 clean copies". I embarked on a
similar exercise only about a week ago with the Kernighan & Cherry
document "Typesetting Mathematics -- User's Guide (Second Edition)",
which was part of Volume 2 of the V7 Unix Programmer's Manual.
In the course of that effort I learned several things. I identified
(and fixed) bugs in groff's ms(7) implementation, and to my surprise
also discovered one in, apparently, V7 troff that caused an equation at
the bottom of a column to go missing. Because groff was independently
developed, the equation sprung back to life in its rendering. You can
find a narrative of my experiences at the following thread, along with
commentary from others.
https://lists.gnu.org/archive/html/groff/2022-07/msg00000.html
Pixel-perfect matching of C/A/T (or APS-5, etc.) output will be
impossible because the fonts are different. More than that, the font
_metrics_ are different, which means lines will not always fill the same
when comparing historical typesetter output and a modern
implementation's (this will be true even if you use Heirloom Doctools
Troff, which is descended from V7 Unix, but has seen many changes over
the years, starting with Kernighan's revision for device-independence
ca. 1980, plus many changes for the commercial Documenter's Workbench
product, and then many more by Gunnar Ritter and his successors in the
Heirloom project).
Beyond that, Unix troff and groff use different hyphenation systems. I
don't know how stable Unix troff's was over time.
All of that said, with the Kernighan and Cherry document, by spending
just a few minutes eyeballing old scans and groff PostScript output,
flicking between two fullscreen viewers like an ersatz blink comparator,
and using binary search to tweak the ms(7) LL, PO, and MINGW registers,
I was able to _almost_ perfectly match column and page breaks between
the two renderings, which was a higher fidelity of reproduction than I
expected. The risen equation noted above was the most dramatic change.
Encouraged by that experience, I also reset the V7 Unix version of the
article "A System for Typesetting Mathematics". This apparently was
_not_ published in the Programmer's Manual, possibly because much of its
content was duplicated in the user's guide. But the amount of effort
required of me was shockingly low. On the other hand, for this I didn't
have an authentically typeset copy to compare to, so all I did was look
for what I would consider rendering errors as opposed to cosmetic
changes. (Maybe this the standard you want to apply in your own work?)
I'm attaching a diff.
Another apparent difference arises between V7 Unix eqn and groff eqn; in
eqn input such as "lim from {x-> pi /2} ( tan~x) sup{sin~2x}~=~1", V7
eqn will recognize "->" as beginning a new token and convert it to a
right arrow glyph in the output, despite the manual (as I understand it)
implying that it won't. groff eqn _does_ require token separation in
this case.
I say that differences are "apparent", rather than making the stronger
claim of outright bugs in V7 Unix tools mainly because I don't have a
cat2dit(1) tool I can run in my V7 Unix environment in SIMH. In my
opinion such a tool (in K&R C, of course) would be well worth having.
Right now, to satisfy myself of V7 Unix troff behavior I have to produce
an octal dump of the typesetter output, pull it out of the emulation
environment with copy-and-paste, undump it with a custom program (xxd is
not helpful), and then give the reconstructed C/A/T stream to an
interpreter written by John Garder in JavaScript. John's tool (and his
personal assistance) has proven invaluable, but it's a component of a
larger project of his that renders device-independent troff output in a
Web browser window. For this to be practical he has to introduce
additional device-independent troff commands into the output. I'd
prefer something more rabidly puritan (and, if I'm honest, something
written in a more traditional Unix system programming language).
https://github.com/Alhadis/Roff.js/
The big advantage of a V7 Unix/PDP-11 cat2dit(1) would be that
device-independent troff output is plain text and much easier to spirit
out of the emulated environment to the host system. Also, some people,
who may be pitied, have taught themselves to read it, making more
observations possible and hypotheses testable within the PDP-11
environment. (In principle, this is also true of C/A/T command streams,
whether raw or octal-encoded, but I'll just let the pity roll downhill.)
Thanks largely to Henry Spencer, the information to write a new
cat2dit(1) from scratch is available. Eventually, if no one else does
so, I will undertake it myself; but my queue is deep (mostly with groff
defect reports and feature requests).
https://github.com/Alhadis/otroff/blob/92683053f9aad5b926fc447843bf2092ad59…
Dan Plassche pointed me toward Adobe Transcript, but my understanding is
that it falls short of my needs in 3 ways: it produces PostScript, which
I can't easily read, not device-independent troff output (which I can);
it's not available in a version ready to run in a modern Unix
environment; and it has a licensing encumbrance. I'd like a cat2dit(1)
we can all trade around libre and gratis.
Alternatively, if someone leaked the troff sources from UNIX/TS 4.0,
that would bring a grin of Jack Nicholsonian proportions to my face.
That should be buildable in vivo on a PDP-11 and would facilitate much
other historical research besides. (With it, someone could annotate a
diff of the troff/nroff source trees between V7 and UNIX/TS 4.0, which I
wager constitutes a highly positive and teachable moment in software
design and engineering.)
Okay, brain dump terminated. Please let me know if I can help.
Regards,
Branden