On Nov 22, 2017, at 5:05 PM, Doug McIlroy <doug(a)cs.dartmouth.edu> wrote:
Steve's program was good, but the dictionary isn't an ideal source
for real text, which abounds in proper names and terms of art.
It also has a lot of rare words that don't pull their weight in
a spell checker, and some attractive nuisances, especially obscure
short words from Scots, botany, etc, which are more likely to
arise in everyday text as typos than by intent. Given the basic
success of Steve's program, I undertook to make a more useful
spelling list, along with more vigorous affix stripping (and a
stop list to avert associated traps, e.g. "presenation" =
pre+senate+ion"). That has been described in Bentley's "Programming
Pearls" and in
http://www.cs.dartmouth.edu/~doug/spell.pdf.
This is quite interesting to me. A while ago I looked into building a spell
checker for Gujarati (a Sanskrit based language) and found it to be a
complicated affair -- words can have multiple suffixes since the Guj.
equivalents of from/to/in/ etc prepositions are tacked on at the end of
a word. But the same endings can also appear in normal words. And
there are other complications.... Even though the language is phonetic,
mistakes of using the wrong form of long/short vowel signs are common.
After reading your paper I am tempted to revive the effort.