> 1. What's the provenance of regex in unix (when did it appear, in what form, etc)?
> 2. What are the 'best' implementations throughout unix (keep it pre1980s)?
> 3. What are some of the milestones along the way (major changes, forks, disagreements)?
The editor ed was in Unix from day 1. For the necessarily tiny
implementation, Ken discarded various features
from the ancestral qed. Among the casualties was alternation
in regular expressions. It has never fully returned.
Ken's original paper described a method for simulating all paths
of a nondeterministic finite automaton in parallel, although he
didn't describe it in these exact terms. This meant he had to
keep track of up to n possible states, where n is the number of
terminal symbols in the regular expression.
"Computing Reviews" published a scathing critique of the paper:
everyone knows a deterministic automaton can recognize regular
expressions with one state transition per input character; what
a waste of time to have to keep track of multiple states! What the
review missed was that the size of the DFA can be exponential in n.
For one-shot use, as in an editor, it can take far longer to construct
the DFA than to run it.
This lesson came home with a vengeance when Al Aho wrote egrep,
which implemented full regular expressions as DFA's. I happened
to be writing calendar(1) at the same time, and used egrep to
search calendar files for dates in rather free formats for today
and all days through the next working day. Here's an example
(egrep interprets newline as "|"):
(^|[ (,;])(([Aa]ug[^ ]* *|(08|8)/)0*1)([^0123456789]|$)
(^|[ (,;])((\* *)0*1)([^0123456789]|$)
(^|[ (,;])(([Aa]ug[^ ]* *|(08|8)/)0*2)([^0123456789]|$)
(^|[ (,;])((\* *)0*2)([^0123456789]|$)
(^|[ (,;])(([Aa]ug[^ ]* *|(08|8)/)0*3)([^0123456789]|$)
(^|[ (,;])((\* *)0*3)([^0123456789]|$)
Much to Al's chagrin, this regular expression took the better
part of a minute to compile into a DFA, which would then run in
microseconds. The trouble was that the DFA was enormously
bigger than the input--only a tiny fraction of the machine's
states would be visited; the rest were useless. That led
him to the brilliant idea of constructing the machine on
the fly, creating only the states that were pertinent to
the input at hand. That innovation made the DFA again
competitive with an NFA.
Doug
This topic is still primarily UNIX but is getting near the edge of COFF, so
I'll CC there if people want to follow up.
As I mentioned to Will, during the time Research was doing the work/put out
their 'editions', the 'releases' were a bit more ephemeral - really a set
of bits (binary and hopefully matching source, but maybe not always)
that become a point in time. With 4th (and I think 5th) Editions it was a
state of disk pack when the bits were copies, but by 6th edition, as Noel
points out, there was a 'master tape' that the first site at an
institution received upon executing of a signed license, so the people at
each institution (MIT, Purdue, CMU, Harvard) passed those bits around
inside.
But what is more, is what Noel pointed out, we all passed source code and
binaries between each other, so DNA was fairly mixed up [sorry Larry - it
really was 'Open Source' between the licensees]. Sadly, it means some
things that actually were sourced at one location and one system, is
credited sometimes credited from some other place the >>wide<< release was
in USG or BSD [think Jim Kulp's Job control, which ended up in the kernel
and csh(1) as part in 4BSD, our recent discussions on the list about
more/pg/less, the different networking changes from all of MIT/UofI/Rand,
Goble's FS fixes to make the thing more crash resilient, the early Harvard
ar changes - *a.k.a.* newar(1) which became ar(1), CMU fsck, e*tc*.].
Eventually, the AT&T Unix Support Group (USG) was stood up in Summit, as I
understand it, originally for the Operating Companies as they wanted to use
UNIX (but not for the licenses, originally). Steve Johnson moved from
Research over there and can tell you many more of the specifics.
Eventually (*i.e.* post-Judge Green), distribution to the world moved from
MH's Research and the Patent Licensing teams to USG and AT&T North Carolina
business folks.
That said, when the distribution of UNIX moved to USG in Summit, things started
to a bit more formal. But there were still differences inside, as we have
tried to unravel. PWB/TS and eventually System x. FWIW, BSD went
through the same thing. The first BSD's are really the binary state of the
world on the Cory 11/70, later 'Ernie.' By the time CSRG gets stood
up because their official job (like USG) is to support Unix for DARPA, Sam
and company are acting a bit more like traditional SW firms with alpha/beta
releases and a more formal build process. Note that 2.X never really
went through that, so we are all witnessing the wonderful efforts to try to
rebuild early 2.X BSD, and see that the ephemeral nature of the bits has
become more obvious.
As a side story ... the fact is that even for professional SW houses, it
was not as pure as it should be. To be honest, knowing the players and
processes involved, I highly doubt DEC could rebuild early editions of VMS,
particularly since the 'source control' system was a physical flag in
Cutler's office.
The fact is that the problem of which bits were used to make what other
bits was widespread enough throughout the industry that in the mid-late 80s
when Masscomp won the bid to build the system that Nasa used to control the
space shuttle post-Challenger, a clause of the contract was that we have
put an archive of the bits running on the build machine ('Yeti'), a copy of
the prints and even microcode/PAL versions so that Ford Aerospace (the
prime contractor) could rebuild the exact system we used to build the
binaries for them if we went bankrupt. I actually, had a duplicate of that
Yeti as my home system ('Xorn') in my basement when I made some money for a
couple of years as a contract/on-call person for them every time the
shuttle flew.
Anyway - the point is that documentation and actual bits being 100% in sync
is nothing new. Companies work hard to try to keep it together, but
different projects work at different speeds. In fact, the 'train release'
model is what is usually what people fall into. You schedule a release of
some piece of SW and anything that goes with it, has to be on the train or
it must wait for the next one. So developers and marketing people in firms
argue what gets to be the 'engine' [hint often its HW releases which are a
terrible idea, but that's a topic for COFF].
> From: Warner Losh
> 8 I think was the limit.
IIRC, you could use longer names than that (in C), but external references
only used the first 7 (in C - C symbols had a leading '_' tacked on; I used to
know why), 8 (in assembler).
> Could that cause this error?
Seems unlikely - see below.
> The error comes from lookloc. This is called for external type
> relocations. It searches the local symbol table for something that
> matches the relocation entry. This error happens when it can't find
> it...
Someone who actually looked at the source:
https://www.tuhs.org/cgi-bin/utree.pl?file=V7/usr/src/cmd/ld.c
instead of just guessing. Give that man a star!
I spent a while looking at the code, trying to figure out i) how it works, and
ii) what's going wrong with that message, but I don't have a definitive
answer. The code is not super well commented, so one has to actually
understand what it's doing! :-)
It seems to my initial perusal that it maintains two symbol tables, one for
globals (which accumulates as each file is processed), and one for locals
(which is discarded/reset for each file). As Werner mentioned, the message
appears when a local symbol referenced in the relocation information in the
current file can't be found (in the local symbol table).
It's not, I think, simply due to too many local symbols in an input file -
there seems to be a check for that as it's reading the input file symbol
table:
if (lp >= &local[NSYMPR])
error(1, "Local symbol overflow");
*lp++ = symno;
*lp++ = sp;
although of course there could be a bug which breaks this check. It seems to
me that this is an 'impossible' error, one which can only happen due to i) a
bug in the loader (a fencepost error, or something), or ii) an error in the
input a.out file.
I don't want to spend more time on it, since I'm not sure if you've managed to
bypass the problem. If not, let me know, and we'll track it down. (This may
involve you addding some printf's so we have more info about the details.)
Noel
I finally munged lbforth.c (https://gist.github.com/lbruder/10007431) into
compiling cleanly on mostly-stock v7 with the system compiler (lbforth
itself does fine on 211BSD, but it needs a little help to build in a real
K&R environment).
Which would be nice, except that when it gets to the linker....
$ cc -o 4th forth.c
ld:forth.o: Local symbol botch
WTF?
How do I begin to debug this?
Adam
> From: Will Senn
> it finally clicked that it is just one (of many) bit buckets out there
> with the moniker v6. ... I am coming from a world where OS version
> floppy/cd/dvd images are copies of a single master ... These tape things
> could be snapshots of the systems they originate from at very different
> times and with different software/sources etc.
Well, sort of. Do remember that everyone with V6 had to have a license, which
at that point you could _only_ get from Western Electric. So every
_institution_ (which is not the same as every _machine_) had had to have had
dealings with them. However, once your institution was 'in the club', stuff
just got passed around.
E.g. the BBN V6 system with TCP/IP:
https://www.tuhs.org/cgi-bin/utree.pl?file=BBN-V6
I got that by driving over to BBN, and talking to Jack Haverty, and he gave us
a tape (or maybe a pack, I don't recall exactly). But we had a V6 license, so
we could do that.
But my particular machine, it worked just the way you described: we got our V6
from the other V6 machine in the Tech Sq building (RTS/DSSR), including not
only Bell post-V6 'leakage' like adb, but their local hacks (e.g. their TTY
driver, and the ttymod() system call to get to its extended features; the
ability to suspend selected applications; yadda, yadda). We never saw a V6
tape.
Noel
> From: Will Senn
> I don't think adb was in v6, where the fcreat function and buf struct
> are used... Were Maranzano and Bourne using some kind of hybrid 6+ system?
In addition to the point about skew between the released and internal development,
it's worth remembering just how long it was between the V6 and V7 releases, and
how much ground was covered technically during that period.
A lot of that stuff leaked out: we've talked about the upgraded 'Typesetter C'
(and compilers), which a lot of people had, and the V6+ system at MIT
(partially sort of PWB1) had both 'adb' and the stdio library. The latter also
made its way to lots of places; in my 'improved V6 page':
http://www.chiappa.net/~jnc/tech/ImprovingV6.html
it talks about finding the standard I/O stuff in several later V6 repositories,
including a UNSW tape. But it and typsetter C, also on the Shoppa pack, were
clearly quite widespread.
Noel
I've done research on this, but I'm confused and would appreciate some
help to understand what's going on. In the 7th edition manual, vol 2,
there's an ADB tutorial (pp. 323-336). In the tutorial, the authors,
Maranzano and Bourne, walk the reader through a debugging session. The
first example is predicated on a buffer overflow bug and the code includes:
struct buf {
int fildes;
int nleft;
char *nextp; char buff[512]; }bb;
struct buf *obuf;
...
if((fcreat(argv[1],obuf)) < 0){
...
Well, this isn't v7 code. As discussed in the v7 manual vol 1 (p. VII):
Standard I/O. The old fopen, getc, putc complex and the old –lp package
are both dead, and even getchar has changed. All have been replaced by
the clean, highly efficient, stdio(3) package. The first things to know
are that getchar(3) returns the integer EOF (–1), which is not a
possible byte value, on end of file, that 518-byte buffers are out, and
that there is a defined FILE data type.
The buffers are out, fcreat is gone, etc. So, what's up with this? I
don't think adb was in v6, where the fcreat function and buf struct are
used... Were Maranzano and Bourne using some kind of hybrid 6+ system?
Thanks,
Will
--
GPG Fingerprint: 68F4 B3BD 1730 555A 4462 7D45 3EAA 5B6D A982 BAAF
OK. I'm plowing through a lot of issues with the putative 2.11BSD
reconstructions I've done to date. I keep finding things dated too new to
be right.
And it turns out that a few patches "snuck in" when the patch 80 catch up
was done. I've outlined the ones I've found so far at
https://bsdimp.blogspot.com/2020/08/missing-211bsd-patches.html but I'm
sure there's at least one more. There was much ambiguity over /usr/new and
/usr/local that lead to some of these, but others look like they were in
the master tree, but never formally published that have all the hallmarks
of legit bug fixes...
I've also detailed the issues in going backwards. 2.11BSDpl195 had a
different .o format than 2.11BSDpl0. And to make matters worse, its
assembler couldn't assemble the assembler from the initial release, so I
had to get creative (using apout, thanks to all who contributed to that!).
I've also blogged about how to walk back a binary change when the old
programs no longer build on the new system. I think I got lucky that it was
possible at all :).
https://bsdimp.blogspot.com/2020/08/bootstrapping-211bsd-no-patches-from.ht…
has the blow by blow. There are a lot of steps to building even a normal
system... Let alone walking through the minefield of errors that you need
to do when stepping back...
And neither of these even begin to get into the issues with the build
system itself requiring workarounds for that...
But anyway, I keep making "ur2.11BSD" tapes, installing them and fixing the
issues I find... While much information was destroyed in the process,
there's a surprising amount of redundancy that one can use to 'test'
putative tapes.
Warner
P.S. ur2.11BSD is from urFOO in linguisting, meaning the original FOO
that's been lost and which some amount of reconstruction / speculation is
offered about it. Still looking for a good name for the reconstructed
2.11BSD release....
Probably same as others do, when I'm implementing some 'trace'
messages in a new program or one just 'under investigation' I try to
make sure the message has a nice format to be able to process a few
megabyte of logfile easily.
Cheers,
uncle rubl
John Gilmore:
Yes -- but [Bell Labs'] administration was anything but egalitarian or
meritocratic. I know someone who had immense trouble getting inside the
door at the Labs because "all" he had was a bachelor's degree. Let
their character be judged by how they treated a stranger.
Sign me proud to have succeeded in life with no degrees at all,
====
That was where local management came in.
I have no degrees at all. I haven't been nearly as
successful in many ways as John, but I was recruited
and hired by 1127. That I had no degree meant I was
initially hired as a `special technical assistant'
rather than a `member of technical staff,' but my
department head and director and executive director
(the last was the legendary Vic Vyssotsky) worked
tirelessly on my behalf, without my pushing them at
all, to get me upgraded, and succeeded after I'd been
there about a year. It was only later that I realized
just how much work they'd done on my behalf.
The upgrade gave me a big raise in pay, but I was
young enough and nerdy enough not to notice.
Within the 1127 culture there was no perceptible
difference; it was very much an egalitarian culture.
I felt respected as an equal from the start (really
from the day and a half I spent interviewing there).
Not every part of the Labs, let alone AT&T, was like
that, especially outside of the Research area. I
didn't realize it initially but that was one of the
ways I benefited from the success of UNIX (that 1127's
and 112's management could push past such bureaucratic
barriers).
After all, Ken never had more than an MS.
Norman Wilson
Toronto ON