Was the compressed dictionary used? - TUHS - www.tuhs.org

List overview All Threads
Download

Was the compressed dictionary used?

userland exec -- wasn't exec that...

Bergson Quote, Forward, BSTJ, 57:6...

arnold＠skeeve.com

2 Jan 2025 2 Jan '25

12:40 p.m.

Hi. The paper on compressing the dictionary was interesting. In the day of 20 meg disks, compressing a ~ 2.5 meg file down to ~ .5 meg is a big savings. Was the compressed dictionary put into use? I could imaging that spell(1) at least would have needed some library routines to return a stream of words from it. Just wondering. Thanks, Arnold

Reply

Show replies by date

Douglas McIlroy

2 Jan 2 Jan

2:51 p.m.

I am not aware that the compressed dictionary was used for anything. Steve Johnson's first shell-script spelling-checker did make a pass over a dictionary, but not Webster's second, which would have caused lots of false negatives because it contains so many exotic small words that could result from typos. My production spell aggresively stripped affixes and used hashing and other coding tricks to keep its "dictionary" in the limited memory of a PDP-11. (The whole story is told in https://www.cs.dartmouth.edu/~doug/spell.pdf and insightfully described by Jon Bentley in https://dl.acm.org/doi/pdf/10.1145/3532.315102.) When larger memory became available, these heroics were replaced by basic common-prefix coding patterned after Morris and Thompson, just as Arnold surmised. On Thu, Jan 2, 2025 at 7:41 AM <arnold(a)skeeve.com> wrote:

Hi. The paper on compressing the dictionary was interesting. In the day of 20 meg disks, compressing a ~ 2.5 meg file down to ~ .5 meg is a big savings. Was the compressed dictionary put into use? I could imaging that spell(1) at least would have needed some library routines to return a stream of words from it. Just wondering. Thanks, Arnold

Reply

Warner Losh

3:12 p.m.

On Thu, Jan 2, 2025, 7:51 AM Douglas McIlroy <douglas.mcilroy(a)dartmouth.edu> wrote:

I am not aware that the compressed dictionary was used for anything. Steve Johnson's first shell-script spelling-checker did make a pass over a dictionary, but not Webster's second, which would have caused lots of false negatives because it contains so many exotic small words that could result from typos.

Where did the Websters Second file come from? Did the labs give the public domain paper dictionary to the equivalent of a typing pool and had them enter it? It did it come from elsewhere? Or something else? How was it checked for accuracy? Warner My production spell aggresively stripped

affixes and used hashing and other coding tricks to keep its "dictionary" in the limited memory of a PDP-11. (The whole story is told in https://www.cs.dartmouth.edu/~doug/spell.pdf and insightfully described by Jon Bentley in https://dl.acm.org/doi/pdf/10.1145/3532.315102.) When larger memory became available, these heroics were replaced by basic common-prefix coding patterned after Morris and Thompson, just as Arnold surmised. On Thu, Jan 2, 2025 at 7:41 AM <arnold(a)skeeve.com> wrote:

Hi. The paper on compressing the dictionary was interesting. In the day of 20 meg disks, compressing a ~ 2.5 meg file down to ~ .5 meg is a big savings. Was the compressed dictionary put into use? I could imaging that spell(1) at least would have needed some library routines to return a stream of words from it. Just wondering. Thanks, Arnold

Reply

Douglas McIlroy

5:20 p.m.

The word list of Webster's 2nd came from an Air Force project along with several other files, including a medical dictionary and an alphabetical list of tetragrams found in Web2--something one would expect to create for oneself nowadays. The files were freely distributed with no strings attached. We have not noticed any mistakes. The list includes 76205 entries that contain blanks or hyphens; these were omitted from the pinhead exercise. Doug On Thu, Jan 2, 2025 at 10:13 AM Warner Losh <imp(a)bsdimp.com> wrote:

On Thu, Jan 2, 2025, 7:51 AM Douglas McIlroy <douglas.mcilroy(a)dartmouth.edu> wrote:

I am not aware that the compressed dictionary was used for anything. Steve Johnson's first shell-script spelling-checker did make a pass over a dictionary, but not Webster's second, which would have caused lots of false negatives because it contains so many exotic small words that could result from typos.

Where did the Websters Second file come from? Did the labs give the public domain paper dictionary to the equivalent of a typing pool and had them enter it? It did it come from elsewhere? Or something else? How was it checked for accuracy? Warner > My production spell aggresively stripped > affixes and used hashing and other coding tricks to keep its > "dictionary" in the limited memory of a PDP-11. (The whole story is > told in https://www.cs.dartmouth.edu/~doug/spell.pdf and insightfully > described by Jon Bentley in > https://dl.acm.org/doi/pdf/10.1145/3532.315102.) When larger memory > became available, these heroics were replaced by basic common-prefix > coding patterned after Morris and Thompson, just as Arnold surmised. > > On Thu, Jan 2, 2025 at 7:41 AM <arnold(a)skeeve.com> wrote: > > > > Hi. > > > > The paper on compressing the dictionary was interesting. In the day > > of 20 meg disks, compressing a ~ 2.5 meg file down to ~ .5 meg is > > a big savings. > > > > Was the compressed dictionary put into use? I could imaging that > > spell(1) at least would have needed some library routines to return > > a stream of words from it. > > > > Just wondering. Thanks, > > > > Arnold

Reply

Warner Losh

9:19 p.m.

The BSDs since 4.4lite have added a lot of missing words, but few corrections. From FreeBSD: Capitalized Transvaal, fixed 'stock certificate' to have a 't' and preconsoidate -> preconsolidate Ahtena, freen, unknowen and structurelessness were removed corelate (etc) and freend were removed as typos and only thinly supported variants. Not bad for 50 years of nit-pickers pouring over the file. Warner On Thu, Jan 2, 2025 at 10:20 AM Douglas McIlroy < douglas.mcilroy(a)dartmouth.edu> wrote:

The word list of Webster's 2nd came from an Air Force project along with several other files, including a medical dictionary and an alphabetical list of tetragrams found in Web2--something one would expect to create for oneself nowadays. The files were freely distributed with no strings attached. We have not noticed any mistakes. The list includes 76205 entries that contain blanks or hyphens; these were omitted from the pinhead exercise. Doug On Thu, Jan 2, 2025 at 10:13 AM Warner Losh <imp(a)bsdimp.com> wrote:

On Thu, Jan 2, 2025, 7:51 AM Douglas McIlroy <

douglas.mcilroy(a)dartmouth.edu> wrote:

I am not aware that the compressed dictionary was used for anything. Steve Johnson's first shell-script spelling-checker did make a pass over a dictionary, but not Webster's second, which would have caused lots of false negatives because it contains so many exotic small words that could result from typos.

Where did the Websters Second file come from? Did the labs give the

public domain paper dictionary to the equivalent of a typing pool and had them enter it? It did it come from elsewhere? Or something else? How was it checked for accuracy?

Warner > My production spell aggresively stripped > affixes and used hashing and other coding tricks to keep its > "dictionary" in the limited memory of a PDP-11. (The whole story is > told in https://www.cs.dartmouth.edu/~doug/spell.pdf and insightfully > described by Jon Bentley in > https://dl.acm.org/doi/pdf/10.1145/3532.315102.) When larger memory > became available, these heroics were replaced by basic common-prefix > coding patterned after Morris and Thompson, just as Arnold surmised. > > On Thu, Jan 2, 2025 at 7:41 AM <arnold(a)skeeve.com> wrote: > > > > Hi. > > > > The paper on compressing the dictionary was interesting. In the day > > of 20 meg disks, compressing a ~ 2.5 meg file down to ~ .5 meg is > > a big savings. > > > > Was the compressed dictionary put into use? I could imaging that > > spell(1) at least would have needed some library routines to return > > a stream of words from it. > > > > Just wondering. Thanks, > > > > Arnold

Reply

Douglas McIlroy

11:32 p.m.

Warner, Thanks for those bugs. Here's a similar list for lucky owners of Webster's 7th Collegiate: dissymmettric brecia belicoseness assaugement A space is missing in the pronunciation field for Ouija. There must be more bugs in other fields, which constitute the bulk of the Web7 files. Doug On Thu, Jan 2, 2025 at 4:20 PM Warner Losh <imp(a)bsdimp.com> wrote:

The BSDs since 4.4lite have added a lot of missing words, but few corrections. From FreeBSD: Capitalized Transvaal, fixed 'stock certificate' to have a 't' and preconsoidate -> preconsolidate Ahtena, freen, unknowen and structurelessness were removed corelate (etc) and freend were removed as typos and only thinly supported variants. Not bad for 50 years of nit-pickers pouring over the file. Warner On Thu, Jan 2, 2025 at 10:20 AM Douglas McIlroy <douglas.mcilroy(a)dartmouth.edu> wrote: > > The word list of Webster's 2nd came from an Air Force project along > with several other files, including a medical dictionary and an > alphabetical list of tetragrams found in Web2--something one would > expect to create for oneself nowadays. The files were freely > distributed with no strings attached. We have not noticed any > mistakes. The list includes 76205 entries that contain blanks or > hyphens; these were omitted from the pinhead exercise. > > Doug > > On Thu, Jan 2, 2025 at 10:13 AM Warner Losh <imp(a)bsdimp.com> wrote: > > > > > > > > On Thu, Jan 2, 2025, 7:51 AM Douglas McIlroy <douglas.mcilroy(a)dartmouth.edu> wrote: > >> > >> I am not aware that the compressed dictionary was used for anything. > >> Steve Johnson's first shell-script spelling-checker did make a pass > >> over a dictionary, but not Webster's second, which would have caused > >> lots of false negatives because it contains so many exotic small words > >> that could result from typos. > > > > > > Where did the Websters Second file come from? Did the labs give the public domain paper dictionary to the equivalent of a typing pool and had them enter it? It did it come from elsewhere? Or something else? How was it checked for accuracy? > > > > Warner > > > > > >> My production spell aggresively stripped > >> affixes and used hashing and other coding tricks to keep its > >> "dictionary" in the limited memory of a PDP-11. (The whole story is > >> told in https://www.cs.dartmouth.edu/~doug/spell.pdf and insightfully > >> described by Jon Bentley in > >> https://dl.acm.org/doi/pdf/10.1145/3532.315102.) When larger memory > >> became available, these heroics were replaced by basic common-prefix > >> coding patterned after Morris and Thompson, just as Arnold surmised. > >> > >> On Thu, Jan 2, 2025 at 7:41 AM <arnold(a)skeeve.com> wrote: > >> > > >> > Hi. > >> > > >> > The paper on compressing the dictionary was interesting. In the day > >> > of 20 meg disks, compressing a ~ 2.5 meg file down to ~ .5 meg is > >> > a big savings. > >> > > >> > Was the compressed dictionary put into use? I could imaging that > >> > spell(1) at least would have needed some library routines to return > >> > a stream of words from it. > >> > > >> > Just wondering. Thanks, > >> > > >> > Arnold

Reply

arnold＠skeeve.com

4:20 p.m.

Douglas McIlroy <douglas.mcilroy(a)dartmouth.edu> wrote:

My production spell aggresively stripped affixes and used hashing and other coding tricks to keep its "dictionary" in the limited memory of a PDP-11. (The whole story is told in https://www.cs.dartmouth.edu/~doug/spell.pdf and insightfully described by Jon Bentley in https://dl.acm.org/doi/pdf/10.1145/3532.315102.) When larger memory became available, these heroics were replaced by basic common-prefix coding patterned after Morris and Thompson, just as Arnold surmised.

But all this would have been in the C code for spell, and not in the dictionary used, right? Thanks, Arnold P.S. A few years ago I made the v10 spell available for today's systems, see https://github.com/arnoldrobbins/v10spell.

Reply

Grant Taylor

3:13 p.m.

On 1/2/25 6:40 AM, arnold(a)skeeve.com wrote:

The paper on compressing the dictionary was interesting. In the day of 20 meg disks, compressing a ~ 2.5 meg file down to ~ .5 meg is a big savings.

It's even more important when sending data across the wire.

Was the compressed dictionary put into use? I could imaging that spell(1) at least would have needed some library routines to return a stream of words from it.

I couldn't help but think about the DNS on wire compression format which will re-use part of the existing query name to de-duplicate later parts of the same query name. I know it's not the same, but it felt un-ignorably close in both purpose and method. -- Grant. . . . unix || die

Reply

John Levine

3 Jan 3 Jan

3:14 a.m.

It appears that Grant Taylor via TUHS <gtaylor(a)tnetconsulting.net> said:

On 1/2/25 6:40 AM, arnold(a)skeeve.com wrote:

The paper on compressing the dictionary was interesting. In the day of 20 meg disks, compressing a ~ 2.5 meg file down to ~ .5 meg is a big savings.

It's even more important when sending data across the wire.

Was the compressed dictionary put into use? I could imaging that spell(1) at least would have needed some library routines to return a stream of words from it.

I couldn't help but think about the DNS on wire compression format which will re-use part of the existing query name to de-duplicate later parts of the same query name. I know it's not the same, but it felt un-ignorably close in both purpose and method.

Lempel and Ziv published the LZ77 paper in 1977 (hence the name) which uses back pointers into a sliding window of text. Later tweaks brought us LZ78 and compress and gzip. There's really only two ways to compress data: use a variable length coding scheme with the shortest codes for the most common tokens, or a dictionary that uses pointers to repeated strings. Huffman invented the former in 1951, Lempel and Ziv the latter in 1977, although as we've seen people did special purpose versions of the dictionary approach like this one. Modern schemes use combinarions of both. The DNS data formats were invented in about 1982 but I have no idea whether Mockapetris was familar with LZ. I suppose I could ask him. R's, John

Reply

217

days inactive

218

days old

Manage subscription

8 comments

5 participants

tags (0)

participants (5)

arnold＠skeeve.com
Douglas McIlroy
Grant Taylor
John Levine
Warner Losh