`quotes'
> rules used ... to create British spelling from an American
> English database often leave a lot to be desired.
Among the BUGS listed for spell(1) in v7 was "Britsh spelling was
done by an American".
Nevertheless, at least one British expat thanked me for spell -b.
He had been using the original "spell", and ignoring its reports
of British "misspellings". But, he said, long exposure to American
writing had infected his writing. Spell -b was a blessing, for
revealed where his usage wobbled between traditions.
> I am curious if anyone on the list remembers much
> about the development of the first spell checkers in Unix?
Yes, intimately. They had no relationship to the PDP 10.
The first one was a fantastic tour de force by Bob Morris,
called "typo". Aside from the file "eign" of the very most common
English words, it had no vocabulary. Instead it evaluated the
likelihood that any particular word came from a source with the
same letter-trigram frequencies as the document as a whole. The
words were then printed in increasing order of likelihood. Typos
tended to come early in the list.
Typo, introduced in v3, was very popular until Steve Johnson wrote
"spell", a remarkably short shell script that (efficiciently) looks
up a document's words in the wordlist of Webster's Collegiate
Dictionary, which we had on line. The only "real" coding he did
was to write a simple affix-stripping program to make it possible
to look up plurals, past tenses, etc. If memory serves, Steve's
program is described in Kernighan and Pike. It appeared in v5.
Steve's program was good, but the dictionary isn't an ideal source
for real text, which abounds in proper names and terms of art.
It also has a lot of rare words that don't pull their weight in
a spell checker, and some attractive nuisances, especially obscure
short words from Scots, botany, etc, which are more likely to
arise in everyday text as typos than by intent. Given the basic
success of Steve's program, I undertook to make a more useful
spelling list, along with more vigorous affix stripping (and a
stop list to avert associated traps, e.g. "presenation" =
pre+senate+ion"). That has been described in Bentley's "Programming
Pearls" and in http://www.cs.dartmouth.edu/~doug/spell.pdf.
Morris's program and mine labored under space constraints, so
have some pretty ingenious coding tricks. In fact Morris has
a patent on the way he counted frequencies of the 26^3 trigrams
in 26^3 byes, even though the counts could exceed 256. I did
some heroic (and probabilistic) encoding to squeeze a 30,000
word dictionary into a 64K data space."
Doug
Hi,
I found this paper by bwk referenced in the Unix manpages,
in v4 as: TROFF Made Trivial (unpublished),
in v5 as: TROFF Made Trivial (internal memorandom),
also in the v6 "Unix Reading List",
but not anymore in v7.
Anyone have a copy or a scan?
--
Leah Neukirchen <leah(a)vuxu.org> http://leah.zone
> From: Larry McVoy
> So tape I can see being more weird, but isn't raw disk just "don't put
> it in buffer cache"?
One machines/controllers which are capable of it, with raw devices DMA happens
directly into the buffers in the process (which obviously has to be resident
while the I/O is happening).
Noel
> From: Will Senn
> I don't quite no how to investigate this other than to pore through the
> pdp11/40 instruction manual.
One of these:
https://www.ebay.com/itm/Digital-pdp-Programming-Card-8-Pages/142565890514
is useful; it has a list of all the opcodes in numerical order; something none
of the CPU manuals have, to my recollection. Usually there are a flock of
these "pdp11 Programming Cards" on eBait, but I only see this one at the
moment.
If you do any amount of work with PDP-11 binary, you'll soon find yourself
recognizing the common instructions. E.g. MOV is 01msmr (octal), where 'm' is
a mode specifier, and s and r are source and destination register
numbers. (That's why PDP-11 people are big on octal; the instructions are easy
to read in octal.) More here:
http://gunkies.org/wiki/PDP-11_architecture#Operands
So 0127xx is a move of an immediate operand.
>> You don't need to mount it on DECTape drive - it's just blocks. Mount
>> it as an RK05 image, or a magtape, or whatever.
> I thought disk (RK05) and tape (magtape) blocks were different...
Well, you need to differentiate between DECtape and magtape - very different
beasts.
DECtape on a PDP-11 _only_ supports 256 word (i.e. 512 byte) blocks, the same
as most disks. (Floppies are an exception when it comes to disks - sort
of. The hardware supports 128/256 byte sectors, but the usual driver - not in
V6 or V7 - invisibly makes them look like 512-byte blocks.)
Magtapes are complicated, and I don't remember all the details of how Unix
handles them, but the _hardware_ is prepared to write very long 'blocks', and
there are also separate 'file marks' which the hardware can write, and notice.
But a magtape written in 512-byte blocks, with no file marks, can be treated
like a disk; that's what the V6 distribution tapes look like:
http://gunkies.org/wiki/Installing_UNIX_Sixth_Edition#Installation_tape_con…
and IIRC 'tp' format magtape tapes are written the same way, hardware-wise (so
they look just like DECtapes).
Noel
> From: Will Senn
> (e) UNIX assembler uses the characters $ and "*" where the DEC
> assemblers use "#" and "@" respectively.
Amusing: the "UNIX Assembler Reference Manual" says:
The syntax of the address forms is identical to that in DEC assemblers,
except that "*" has been substituted for "@" and "$" for "#"; the
UNIX typing conventions make "@" and "#" rather inconvenient.
What's amusing is that in almost 40 years, it had never dawned on me that
_that_ was why they'd made the @->*, etc change! "Duhhhh" indeed!
Interesting side note: the UNIX erase/kill characters are described as being
the same as Multics', but since Bell pulled out of the Multics project fairly
early, I wonder if they'd used it long enough to get '@' and '#' hardwired
into their fingers. So I recently has the thought 'Multics was a follow-on to
CTSS, maybe CTSS used the same characters, and that's how they got burned in'.
So I looked in the "CTSS Programmer's Guide" (2nd edition), and no, according
to it (pg. AC.2.02), the erase and kill characters on CTSS were '"' and
'?'. So, so much for that theory!
> (l) The names "_edata" and "_end" are loader pseudo variables which
> define the size of the data segment, and the data segment plus the bss
> segment respectively.
That one threw me, too, when I first started looking at the kernel!
I don't recall if I found documentation about it, or just worked it out: it is
in the UPM, although not in ld(1) like one might expect (at least, not in the
V6 UPM; although in V7:
http://minnie.tuhs.org/cgi-bin/utree.pl?file=V7/usr/man/man1/ld.1
it is there), but in end(3):
http://minnie.tuhs.org/cgi-bin/utree.pl?file=V6/usr/man/man3/end.3
Noel
Why does the first of these incantations not present text, but the
second does (word is a file)? Neither errors out.
$ <word | sed 20q
$ <word sed 20q
Thanks,
Will
--
GPG Fingerprint: 68F4 B3BD 1730 555A 4462 7D45 3EAA 5B6D A982 BAAF
> From: Clem Cole <clemc(a)ccc.com>
> IIRC Tom Lyons started a 370 port at Princeton and finished it at
> Amdahl. But I think that was using VM
Maybe this is my lack of knowledge of VM showing, but how did having VM help
you over running on the bare hardware?
Noel