Bell Foreign-Language UNIX Efforts

List overview All Threads
Download

newer

older

3B2 System V 3.2 make

NetBSD turns 30

segaloco

19 Mar 2023 19 Mar '23

5 a.m.

Good evening or whichever time of day you find yourself in. I was reading up on Japanese computer history when I got to thinking specifically on where UNIX plays in with it all, which then lead to some further curiosity with non-English UNIX in general. In the midst of documentation searches/study, I've spotted French and what I believe to be Japanese documentation bearing Bell/AT&T logos. I've also seen a few things pop up in German although they looked to be university resources, not something from the Bell System. In any case, is there any clear historical record on efforts within the USG/USL line, or research for that matter, towards the end of foreign language support or perhaps even single polyglot installations? Would BSD have been more poised for this sort of thing being more widely utilized in the academic scene? - Matt G.

Show replies by date

Diomidis Spinellis

19 Mar 19 Mar

1:32 p.m.

On 19-Mar-23 7:00, segaloco via TUHS wrote:

...

I think the most significant development that came out of Unix regarding internationalization was the proposal and adoption of Unicode and UTF-8. This was published in 1993 in the USENIX Technical Conference proceedings: Pike, Rob, and Ken Thompson. "Hello World or Καλημέρα κόσμε or こんにちは世界." Proceedings of the Winter 1993 USENIX Conference. 1993. At the time of the decision to adopt Unicode and UTF-8 in Unix (Plan 9 actually) there was no consensus on international character representations and encodings. Many systems extended ASCII with 8 bit characters to represent those required in a particular country These "code pages" were standardized in numerous mutually incompatible ISO-8859-X variants. My understanding is the for many Asian (Chinese, Japanese, and Korean) languages the situation was even worse, with ISO 2022 being used to shift mid-string from one character set encoding to another. In addition, Unicode was a draft standard for unified 16-bit character codes promoted by a group US companies. It was battling against the ISO 10646 draft, which had taken the approach of allocating character set blocks to national bodies, thus creating a sparse 32-bit representation with considerable redundancy between similar languages. Furthermore, the ISO 10646 standard proposed a (non-required) UTF multibyte encoding (now known as UTF-1), which was not self-synchronized, because bytes used for representing ASCII characters were also employed as parts of multibyte sequences. The Bell Labs team took the bold approach of adopting the draft Unicode standard and an X-Open proposal for encoding multibyte characters only using bytes with the top bit set. At the time the encoding was known as UTF-2; it is what we now call UTF-8. UTF-8 makes it easier to achieve backward compatibility in existing code; for example code scanning for the "/" file path separation character in a string, will never encounter it in the UTF-8 representation of non-ASCII characters. The Plan 9 choices proved wise and prescient. I do not know how much the Plan 9 implementation and the USENIX paper influenced further developments (its authors may enlighten us), but in the end Unicode converged with ISO 10646 becoming a single standard, and UTF-8 was widely adopted. The Plan 9 team's decision to adopt UTF-8 was by no means a given. Consider the case of Microsoft, which released Windows NT with Unicode support in the same year. Microsoft's Windows NT 1993 offering supported a wide character encoding, not UTF-8: initially UCS-2 and later UTF-16. To achieve backward compatibility the Windows API offers two functions for each call involving strings: a so-called "ANSI" version (actually using the currently active code page) and a "Wide" (Unicode) version. Furthermore, text files use a byte order mark to inform programs regarding their character representation, and in C/C++ code strings are often enclosed in a special macro to facilitate porting to wide characters. In the end, in 2019 Microsoft yielded, supporting UTF-8 in its Windows API through code page 65001 (CP_UTF8), and recommending its use. The double APIs and BOM files are still with us as a reminder that deficient technical decisions come at a cost. Diomidis - https://www.spinellis.gr

Ralph Corderoy

1:47 p.m.

Hi Diomidis,

...

The Bell Labs team took the bold approach of adopting the draft Unicode standard and an X-Open proposal for encoding multibyte characters only using bytes with the top bit set. At the time the encoding was known as UTF-2; it is what we now call UTF-8.

Rob and Ken altered Dave Prosser's FSS-UTF rather than just adopting an existing proposal, according to https://en.wikipedia.org/wiki/UTF-8#History More history, including Bell Labs emails from ’92: https://www.cl.cam.ac.uk/~mgk25/ucs/utf-8-history.txt -- Cheers, Ralph.

Rob Pike

8:27 p.m.

As my mail quoted in https://www.cl.cam.ac.uk/~mgk25/ucs/utf-8-history.txt says, Ken worked out a new packing that avoided all the problems with the existing ones. He didn't alter Prosser's encoding. UTF-8, as it was later called, was not based on anything but it was deeply informed by a couple of years of work coming to grips with the problem of programming with multibyte characters. What Prosser did do, and what we - all of us - are very grateful for, is start the conversation about replacing UTF with something practical. (Speaking of design by committee, the multibyte stuff in C89 was atrocious, and I heard was done in committee to get someone, perhaps the Japanese, to sign off.) Regarding windows, Nathan Myrhvold visited Bell Labs around this time, and we tried to talk to him about this, but he wasn't interested, claiming they had it all worked out. We later learned what he meant, and lamented. Not the only time someone wasn't open to hear an idea that might be worth hearing, but an educational one. It's important historically to understand how all the forces came together that day. The world was ready for a solution to international text, the proposed character set was acceptable to most but the ASCII compatibility issues were unbearable, the proposed solution to that was noxious, various committees were starting to solve the problem in committee, leading to technical briefs of varying quality, none right, and somehow a phone call was made one afternoon to a couple of people who had been thinking and working these issues for ages, one of whom was a genius. And it all worked out, which is truly unusual. -rob

arnold＠skeeve.com

20 Mar 20 Mar

7:55 a.m.

Hi Rob. Rob Pike <robpike(a)gmail.com> wrote:

...

(Speaking of design by committee, the multibyte stuff in C89 was atrocious, and I heard was done in committee to get someone, perhaps the Japanese, to sign off.)

It's not lovely, but I wouldn't call it atrocious. It gets the job done; code using it can handle multibyte encodings while being totally character-set agnostic. I speak from experience, gawk does this. (I use the "restartable" routins - mbrlen() and so on.) I understand that Unicode + UTF-8 solve the issue completely. But I'd like to ask, in all seriousness and so that I can learn, given the world as it was in 1989, how would you solve the problem? If you had designed the C level routines, what would they have looked like? Thanks, Arnold

Rob Pike

9:22 a.m.

...

Hi Rob. Rob Pike <robpike(a)gmail.com> wrote:

(Speaking of design by committee, the multibyte stuff in C89 was

atrocious,

and I heard was done in committee to get someone, perhaps the Japanese,

sign off.)

arnold＠skeeve.com

11:02 a.m.

Looking at the current man pages, the interfaces are very simple, which is fine. Do you think they would have worked for character sets with shift states and such? Thanks, Arnold Rob Pike <robpike(a)gmail.com> wrote:

...

Exactly the way we did it in Plan 9, and published in the paper cited earlier. In fact, it's possible the library work was done as early as 1989, but I'm not sure. Certainly by 1990. -rob On Mon, Mar 20, 2023 at 6:55 PM <arnold(a)skeeve.com> wrote: > Hi Rob. > > Rob Pike <robpike(a)gmail.com> wrote: > > > (Speaking of design by committee, the multibyte stuff in C89 was > atrocious, > > and I heard was done in committee to get someone, perhaps the Japanese, > to > > sign off.) > > It's not lovely, but I wouldn't call it atrocious. It gets the job > done; code using it can handle multibyte encodings while being totally > character-set agnostic. I speak from experience, gawk does this. > (I use the "restartable" routins - mbrlen() and so on.) > > I understand that Unicode + UTF-8 solve the issue completely. But I'd > like to ask, in all seriousness and so that I can learn, given the world > as it was in 1989, how would you solve the problem? If you had designed > the C level routines, what would they have looked like? > > Thanks, > > Arnold >

Steffen Nurpmeso

3:44 p.m.

arnold(a)skeeve.com wrote in <202303200755.32K7tIeW023352(a)freefriends.org>: |Rob Pike <robpike(a)gmail.com> wrote: |> (Speaking of design by committee, the multibyte stuff in C89 was \ |> atrocious, |> and I heard was done in committee to get someone, perhaps the Japanese, \ |> to |> sign off.) | |It's not lovely, but I wouldn't call it atrocious. It gets the job |done; code using it can handle multibyte encodings while being totally No it does not. |character-set agnostic. I speak from experience, gawk does this. However note that even something like "uppercase this string" cannot be done the right way, because a truly Unicode aware operation needs to look at the entire string (sentence), because there may be interdependencies that modify the result. Therefore the entire isw*() and tow*() series is simply wrong. And therefore gawk does this wrong, too. (But the GNU environment does have a solution, i think.) |(I use the "restartable" routins - mbrlen() and so on.) Yes. |I understand that Unicode + UTF-8 solve the issue completely. But I'd In fact to do it right you need something like ICU. There are special number systems, they do not fit ISO C. There are special grammatical rules to obey, which especially hurts regarding everything truly collation aware. (And then my brain simply runs away from the thinking that invented strcoll(3) for anything beyond all-american ten inch.) |like to ask, in all seriousness and so that I can learn, given the world |as it was in 1989, how would you solve the problem? If you had designed |the C level routines, what would they have looked like? P.S.: no, no, and one more no. If you want to have a nice Monday, please have a look at NetBSD current source code, lib/libc/gen/vis.c. There you see how good this interface "gets the job done". And i saw it evolve as the commits of Christos Zoulas flew by, ten years or so ago. No. |Thanks, Then again it all does not matter since IETF and more simply throw one more thing upon the other, so that you need a JSON library for a key=value list, and a HTTP, HTTP/2 and HTTP/3 library to download it over TLS (i think the entire world now proxies all protocols over :443, which makes it safer, and administration easier! .. i have heard). Why did you invent 16-bit ports by then? What were you thinking? One is enough, and much safer! That makes me wonder how OpenBSD could introduce two remotes holes for only one port, .. but that likely is a different story. Hysterical on a Monday, and that on Equinox. --steffen | |Der Kragenbaer, The moon bear, |der holt sich munter he cheerfully and one by one |einen nach dem anderen runter wa.ks himself off |(By Robert Gernhardt)

John Cowan

10:01 p.m.

On Mon, Mar 20, 2023 at 4:48 PM Steffen Nurpmeso <steffen(a)sdaoden.eu> wrote: However note that even something like "uppercase this string"

...

cannot be done the right way, because a truly Unicode aware operation needs to look at the entire string (sentence), because there may be interdependencies that modify the result.

If you are talking about downcasing Greek Σ, then it's true that always downcasing Σ to σ is inadequate. Unicode specifies that if the Σ appears before a space or punctuation mark, it downcases to ς instead. But this is not always correct. For example, if the string "ΦΙΛΟΣ." is the word "φιλοσ" (meaning 'beloved' or 'friend') at the end of a sentence, "φιλοσ." is the correct downcasing. But if it is the abbreviation for "φιλοσοφία", meaning "philosophy", then the correct downcasing is "φιλοσ." So getting this right is an AI-complete problem which neither Unicode nor ICU can solve.

Steffen Nurpmeso

10:28 p.m.

John Cowan wrote in <CAD2gp_TgTFL5agm8Z=immnGiMkpELL-wM_ZXos8OcKngw=2DLw(a)mail.gmail.com>: |On Mon, Mar 20, 2023 at 4:48 PM Steffen Nurpmeso <steffen(a)sdaoden.eu> \ |wrote: | |However note that even something like "uppercase this string" |> cannot be done the right way, because a truly Unicode aware |> operation needs to look at the entire string (sentence), because |> there may be interdependencies that modify the result. | |If you are talking about downcasing Greek Σ, then it's true that always |downcasing Σ to σ is inadequate. Unicode specifies that if the Σ appears |before a space or punctuation mark, it downcases to ς instead. But this is |not always correct. | |For example, if the string "ΦΙΛΟΣ." is the word "φιλοσ" (meaning 'beloved' |or 'friend') at the end of a sentence, "φιλοσ." is the correct downcasing. |But if it is the abbreviation for "φιλοσοφία", meaning "philosophy", then |the correct downcasing is "φιλοσ." So getting this right is an AI-complete |problem which neither Unicode nor ICU can solve. Oh, i'd wish i only would be able to speek/read/write (old) Greek. Unfortunately, after English, i either had to go to another school or choose in between French and Latin, (i would have given everything for Chinese, Japanese, and/or Russian), so i had chosen Latin. And whereas i started out as one of the three best, i then watched an Interview with a CDU ("republican") state secretary, with the wonderful Lea Rosh, and he talked Latin; and whereas she repeatedly said "i understand you, but what is with the audience?", you know, i as a young teenager, i was _so_ pissed that "i quit", as like in the book "The Tin Drum" of Günter Grass. So this made my grade point average a bit weaker. But yes, i think quite a lot of languages have this problem. Even my own native language German for the conversion of the lowercase sharp-s, even though for over hundred years some try to establish an uppercase variant, which the Swiss tongue has. (Mind you, even after WWII when that uppercase ss was forbidden, at least in some dosage forms, like that one used by the US rock band Kiss, ..not.) If you would ask on the Unicode mailing-list, you will be told to only convert entire sentences. But it seems Greek sigma is very special, says Unicode FAQ. --steffen | |Der Kragenbaer, The moon bear, |der holt sich munter he cheerfully and one by one |einen nach dem anderen runter wa.ks himself off |(By Robert Gernhardt)

Larry McVoy

22 Mar 22 Mar

2:25 a.m.

The brilliance of UTF-8 was to encode ASCII as is. That seems obvious in retrospect but as Rob says, the multibyte crud in C89 was just awful, and that was the answer at the time. Fitting ASCII in as is meant that all of the Unix utilities, sed, grep, awk, etc, had close to no performance hit if you were processing ascii. That's pretty cool when you get that and you can process Japanese et al as well. I kind of cringe when I say it is brilliant to not break what exists already, to me, that's just part of what you do as an engineer. But history has shown that not breaking stuff, fitting the new into the old, is brilliant. So kudos to Rob and Ken for doing that (but truth be told, I'd be stunned if they didn't, they are great engineers). On Mon, Mar 20, 2023 at 07:27:34AM +1100, Rob Pike wrote:

...

-- --- Larry McVoy Retired to fishing http://www.mcvoy.com/lm/boat

Rob Pike

2:52 a.m.

Thanks for your support but C89 didn't specify an encoding. In classic committee fashion, it refused to take a stand about anything that might limit adoption. The problem was that the API it offered was clumsy and made encoding errors hard to ignore. (Grepping a file for a string, do you really care if there is an irrelevant binary blob in the middle that isn't kosher UTF-8?) Also, it provided no support for printing "wide" characters. This is all covered in the paper cited above.* The original UTF was compatible with ASCII but not robust if there was an alignment problem, and also used printable ASCII characters in multibyte sequences. You could find a '/' inside a Cyrillic character encoding, which broke Unix badly. That's why FSS-UTF, File-safe UTF, was the name given to Prosser's variant. It's wrong to give us credit for properties we didn't introduce. But UTF-8 is more regular, simpler to encode and decode, and more robust than its predecessors. Most important, it did introduce the self-synchronization property, which was the key that opened the door for us at X-Open. -rob * In a classic Usenix whoops, the paper had an appendix that described UTF-8's encoding rigorously, but that was dropped when it was published in the conference proceedings. Perhaps that's why the RFC got in the mix and started some of the confusion about its origin. On Wed, Mar 22, 2023 at 1:25 PM Larry McVoy <lm(a)mcvoy.com> wrote:

...

years of work coming to grips with the problem of programming with multibyte characters. What Prosser did do, and what we - all of us - are very grateful for, is start the conversation about replacing UTF with something practical. (Speaking of design by committee, the multibyte stuff in C89 was

atrocious,

and I heard was done in committee to get someone, perhaps the Japanese,

sign off.) Regarding windows, Nathan Myrhvold visited Bell Labs around this time,

and

we tried to talk to him about this, but he wasn't interested, claiming

they

had it all worked out. We later learned what he meant, and lamented. Not the only time someone wasn't open to hear an idea that might be worth hearing, but an educational one. It's important historically to understand how all the forces came

together

that day. The world was ready for a solution to international text, the proposed character set was acceptable to most but the ASCII compatibility issues were unbearable, the proposed solution to that was noxious,

various

committees were starting to solve the problem in committee, leading to technical briefs of varying quality, none right, and somehow a phone call was made one afternoon to a couple of people who had been thinking and working these issues for ages, one of whom was a genius. And it all

worked

out, which is truly unusual. -rob

-- --- Larry McVoy Retired to fishing http://www.mcvoy.com/lm/boat

Mehdi Sadeghi

7:12 a.m.

Rob Pike

7:33 a.m.

...

It's a long shot but is that appendix around by any chance? Mehdi On 3/22/23 03:52, Rob Pike wrote: the paper had an appendix that described UTF-8's encoding rigorously, but that was dropped

arnold＠skeeve.com

7:40 a.m.

Thanks. Is there a link to postscript or pdf of the paper? I undoubtedly read it decades ago, but I doubt that I have it handy. Thanks, Arnold Rob Pike <robpike(a)gmail.com> wrote:

...

Pretty much, as it was the Plan 9 UTF man page at the time. This link will be essentially the same. http://man.cat-v.org/plan_9/6/utf -rob On Wed, Mar 22, 2023 at 6:12 PM Mehdi Sadeghi <mehdi(a)mehdix.org> wrote: > It's a long shot but is that appendix around by any chance? > > > Mehdi > > > On 3/22/23 03:52, Rob Pike wrote: > > the paper had an appendix that described UTF-8's encoding rigorously, but > that was dropped > >

Skip Tavakkolian

10:02 a.m.

http://p9f.org/sys/doc/utf.ps On Wed, Mar 22, 2023, 12:41 AM <arnold(a)skeeve.com> wrote:

...

Thanks. Is there a link to postscript or pdf of the paper? I undoubtedly read it decades ago, but I doubt that I have it handy. Thanks, Arnold Rob Pike <robpike(a)gmail.com> wrote:

Pretty much, as it was the Plan 9 UTF man page at the time. This link

will

be essentially the same. http://man.cat-v.org/plan_9/6/utf -rob On Wed, Mar 22, 2023 at 6:12 PM Mehdi Sadeghi <mehdi(a)mehdix.org> wrote: > It's a long shot but is that appendix around by any chance? > > > Mehdi > > > On 3/22/23 03:52, Rob Pike wrote: > > the paper had an appendix that described UTF-8's encoding rigorously,

but

> that was dropped > >

Skip Tavakkolian

10:09 a.m.

Also here: https://github.com/0intro/plan9/tree/main/sys/doc On Wed, Mar 22, 2023, 3:02 AM Skip Tavakkolian <fariborz.t(a)gmail.com> wrote:

...

http://p9f.org/sys/doc/utf.ps On Wed, Mar 22, 2023, 12:41 AM <arnold(a)skeeve.com> wrote:

Thanks. Is there a link to postscript or pdf of the paper? I undoubtedly read it decades ago, but I doubt that I have it handy. Thanks, Arnold Rob Pike <robpike(a)gmail.com> wrote:

Pretty much, as it was the Plan 9 UTF man page at the time. This link

will

but

> that was dropped > >

Rob Pike

12:02 p.m.

The appendix version named it plain UTF, repurposing the extant name to the new encoding. The -8 came later, as it is in these linked documents, because some people wanted a UTF-7 and a UTF-16. Those people should be punished. -rob On Wed, Mar 22, 2023 at 9:09 PM Skip Tavakkolian <fariborz.t(a)gmail.com> wrote:

...

Also here: https://github.com/0intro/plan9/tree/main/sys/doc On Wed, Mar 22, 2023, 3:02 AM Skip Tavakkolian <fariborz.t(a)gmail.com> wrote: > http://p9f.org/sys/doc/utf.ps > > On Wed, Mar 22, 2023, 12:41 AM <arnold(a)skeeve.com> wrote: > >> Thanks. Is there a link to postscript or pdf of the paper? I undoubtedly >> read it decades ago, but I doubt that I have it handy. >> >> Thanks, >> >> Arnold >> >> Rob Pike <robpike(a)gmail.com> wrote: >> >> > Pretty much, as it was the Plan 9 UTF man page at the time. This link >> will >> > be essentially the same. >> > >> > http://man.cat-v.org/plan_9/6/utf >> > >> > -rob >> > >> > >> > On Wed, Mar 22, 2023 at 6:12 PM Mehdi Sadeghi <mehdi(a)mehdix.org> >> wrote: >> > >> > > It's a long shot but is that appendix around by any chance? >> > > >> > > >> > > Mehdi >> > > >> > > >> > > On 3/22/23 03:52, Rob Pike wrote: >> > > >> > > the paper had an appendix that described UTF-8's encoding >> rigorously, but >> > > that was dropped >> > > >> > > >> >

Steffen Nurpmeso

10:33 p.m.

Rob Pike wrote in <CAKzdPgwYPxK9oYemG5-vPgRR7mSfj_qkjD5-iJnLffP-23PUaQ(a)mail.gmail.com>: |The appendix version named it plain UTF, repurposing the extant name to the |new encoding. The -8 came later, as it is in these linked documents, |because some people wanted a UTF-7 and a UTF-16. Those people should be |punished. I agree, but please with a but. For one especially so since UTF-7 (that i like) then didn't make it all through, but only here and there. Ie, if it would have been used for anything mail and DNS related to keep 7-bit compat. Instead they introduced monstrosities like IDNA for DNS, mUTF-7 (locale charset -> UTF-16BE -> mUTF-7) etc. That i hated: IDNA. If they would have said we give up on backward compatibility around Y2K, and the old stuff grows out; and 255 bytes UTF-8 is surely enough for domain names for some time (even percent encoded) even for those encodings which need four byte for one codepoint, and it simply does not work before. Like so they introduced those backward incompatibilities that they wanted to avoid. I did oppose strongly in the past, but UTF-16 has merits for some languages as well as for coding, even though you have to be able to deal with surrogates, .. and with grapheme boundaries, if you are doing it right, so 1:many is there anyhow. I mean, wchar_t is often 32-bit, and then not even UTF-32, at least possibly. But still you have the 1:many, so it buys you nothing. All-UTF-8 is of course great imho. (Asian people may disagree.) --steffen | |Der Kragenbaer, The moon bear, |der holt sich munter he cheerfully and one by one |einen nach dem anderen runter wa.ks himself off |(By Robert Gernhardt)

segaloco

11:33 p.m.

I've often pondered on the storage differential that non-ASCII languages rack up. Let's say one primarily stores documents in Japanese. This puts you up in 2-bytes-per-character range. If you go simply by character count, the same amount of characters take up twice the amount of actual bytes. Of course, Japanese isn't the greatest case for this being a problem as like many other non-phonetic scripts (and even with kana syllables) it takes less actual characters to convey a thought, cutting the character count for a complete sentence, even katakana/Hepburn stuff, in at least half. All in all, they may break even or even better given multi-syllable kanji. A better example of scripts that would likely suffer data bloat would be Hebrew or Arabic, although being abjads with diacritics to represent vowel sounds, you likewise land somewhere like Japanese kana where a single glyph represents what in the Latin alphabet would be at least two letters. I would imagine Cyrillic users for instance do actually have to take the storage hit involved since their entire script is outside ASCII *and* the language is a full alphabet and not an abjad nor logographic. Can't say I've worked with much Cyrillic text though. That's not even to mention scripts where diacritics may be represented by a separate individual code-plane entry requiring combination with another. This is of course, way off list, so I don't want to start a whole side-chain on it, but linguistic storage in computers has interested me for a long time, especially in my reverse engineering research of old games looking at how different studios implemented various code-pages for non-ASCII scripts. For example, I've seen plenty of older (8/16-bit) Japanese games that obviously don't use UTF-8 due to overhead in constrained console environments (or even being older than UTF-8) but also don't use ShiftJIS or other known encodings, instead opting towards their own custom code-plane to map bytes, usually to kana, although I haven't really peeked into any engines that use kanji. This was uncommon as video games were typically marketed towards children who weren't expected to know enough kanji to read complicated text. You see the same today with text associated with children's media in Japan in that hiragana syllabilary for a given kanji is displayed adjacent to it (furigana). I think one resounding conclusion of this thread though is we all owe Rob and Ken (and colleagues) a great deal for nailing this matter down in such a well-engineered way. Long live UTF-8! - Matt G. ------- Original Message ------- On Wednesday, March 22nd, 2023 at 3:33 PM, Steffen Nurpmeso <steffen(a)sdaoden.eu> wrote:

...

Rob Pike wrote in CAKzdPgwYPxK9oYemG5-vPgRR7mSfj_qkjD5-iJnLffP-23PUaQ(a)mail.gmail.com: |The appendix version named it plain UTF, repurposing the extant name to the |new encoding. The -8 came later, as it is in these linked documents, |because some people wanted a UTF-7 and a UTF-16. Those people should be |punished. I agree, but please with a but. For one especially so since UTF-7 (that i like) then didn't make it all through, but only here and there. Ie, if it would have been used for anything mail and DNS related to keep 7-bit compat. Instead they introduced monstrosities like IDNA for DNS, mUTF-7 (locale charset -> UTF-16BE -> mUTF-7) etc. That i hated: IDNA. If they would have said we give up on backward compatibility around Y2K, and the old stuff grows out; and 255 bytes UTF-8 is surely enough for domain names for some time (even percent encoded) even for those encodings which need four byte for one codepoint, and it simply does not work before. Like so they introduced those backward incompatibilities that they wanted to avoid. I did oppose strongly in the past, but UTF-16 has merits for some languages as well as for coding, even though you have to be able to deal with surrogates, .. and with grapheme boundaries, if you are doing it right, so 1:many is there anyhow. I mean, wchar_t is often 32-bit, and then not even UTF-32, at least possibly. But still you have the 1:many, so it buys you nothing. All-UTF-8 is of course great imho. (Asian people may disagree.) --steffen | |Der Kragenbaer, The moon bear, |der holt sich munter he cheerfully and one by one |einen nach dem anderen runter wa.ks himself off |(By Robert Gernhardt)

Warren Toomey

23 Mar 23 Mar

12:01 a.m.

All, the character encoding thread has drifted well away from Unix. Can we move it to COFF now? Thanks, Warren

Edouard Klein

19 Mar 19 Mar

1:38 p.m.

Hi, I just got off the phone with my father, who was working at the CII, then Bull, in the eighties through to the early nineties. Here is what I learned: Bull ported UNIX to their own line of Mitra 125: - https://fr.wikipedia.org/wiki/Mitra_15 and later to the SPS-9 which is from what I understand a license-built ridge 32, somewhat comparable to a VAX : - https://www.lemonde.fr/archives/article/1984/11/21/bull-va-fabriquer-en-fra… - http://www.histoireinform.com/Histoire/+infos2/chr5infa.htm - https://retrocomputingforum.com/t/ridge-32-a-bitsliced-early-risc-graphical… These efforts included first a translation of all the manuals from english to french. The translated manuals sometimes still wore the AT&T logo. The strings associated with error codes, and such were also tranlated in the source. This proved insufficient and awkward, and then a real internationalization effort was spearheaded by Bull. Anecdote: the abbrevation i18n for "Internationalization" was coined by Pascal Beyls, his boss at the time. My father was the representative for Bull at X/Open. Internationalization was part of the normalization process. I have on my desk volume 3 of the X/Open portability guide, whose section 3 is entirely dedicated to internationalization. The document is dated december 1988 and therefore predates UTF8 (92 if I'm correct) and even unicode (but I was blowing my first birthday candle at the time so my memory of the events may be a bit fuzzy). The document state that 8 bits are enough for western european languages, and 16 bits should do for asian languages. It promotes the use of ISO-8859-1. It notes that UNIX is limited to 7-bit ASCII and is therefore as is unsuitable for internationalization. It promotes the LANG and LC_* env variables as an annoucement mechanism, and gives examples where the locale is set to french. It also explains the "Message catalogues" with stores "messages [...] separate from the logic of a program, to be translated into several languages, and to be retrieved at run-time according to the language requirements of each user". If you want to dig into a fun part of foreign UNIX history, my father mentions that along with the corresponding Bull hardware, UNIX was sold to the USSR, along with the translated (in russian) manuals. So somewhere in a gas field in siberia may exist a russian UNIX manual with the BULL and AT&T logo. They only had the binaries, not the source of the system. There were a few "homegrown unix-like" efforts in europe, and bull was part of a few of them. BULL then moved to Linux, that's another phone call to make. If you want to dig more into that, the CNAM (a school/museum in Paris, if you are there, come see the museum it is absolutely awesome) is documenting early UNIX history and did a conference a few years back about it. Clem Cole was there and spoke. Here are some links about their efforts: - https://technique-societe.cnam.fr/colloque-international-unix-en-europe-ent… - https://www.youtube.com/watch?v=-mCSFSF-i1A <- begins with Clem's talk, but the rest is worth watching as well. Hope that helps, Cheers, Edouard. segaloco via TUHS <tuhs(a)tuhs.org> writes:

...

874

days inactive

878

days old

tuhs@tuhs.org

Manage subscription

21 comments

12 participants

tags (0)

participants (12)

arnold＠skeeve.com
Diomidis Spinellis
Edouard Klein
John Cowan
Larry McVoy
Mehdi Sadeghi
Ralph Corderoy
Rob Pike
segaloco
Skip Tavakkolian
Steffen Nurpmeso
Warren Toomey