[TUHS] Re: Bell Foreign-Language UNIX Efforts

19 Mar 2023

On 19-Mar-23 7:00, segaloco via TUHS wrote:
...
  Good evening or whichever time of day you find
yourself in.  I was reading up on Japanese computer history when I got to thinking
specifically on where UNIX plays in with it all, which then lead to some further curiosity
with non-English UNIX in general.
 In the midst of documentation searches/study, I've spotted French and what I believe
to be Japanese documentation bearing Bell/AT&T logos.  I've also seen a few
things pop up in German although they looked to be university resources, not something
from the Bell System.  In any case, is there any clear historical record on efforts within
the USG/USL line, or research for that matter, towards the end of foreign language support
or perhaps even single polyglot installations?  Would BSD have been more poised for this
sort of thing being more widely utilized in the academic scene?

I think the most significant development that came out of Unix regarding
internationalization was the proposal and adoption of Unicode and UTF-8.
  This was published in 1993 in the USENIX Technical Conference proceedings:
Pike, Rob, and Ken Thompson. "Hello World or Καλημέρα κόσμε or こんにちは世界."
Proceedings of the Winter 1993 USENIX Conference. 1993.
At the time of the decision to adopt Unicode and UTF-8 in Unix (Plan 9
actually) there was no consensus on international character
representations and encodings.  Many systems extended ASCII with 8 bit
characters to represent those required in a particular country  These
"code pages" were standardized in numerous mutually incompatible
ISO-8859-X variants.  My understanding is the for many Asian (Chinese,
Japanese, and Korean) languages the situation was even worse, with ISO
2022 being used to shift mid-string from one character set encoding to
another.
In addition, Unicode was a draft standard for unified 16-bit character
codes promoted by a group US companies.  It was battling against the ISO
10646 draft, which had taken the approach of allocating character set
blocks to national bodies, thus creating a sparse 32-bit representation
with considerable redundancy between similar languages.  Furthermore,
the ISO 10646 standard proposed a (non-required) UTF multibyte encoding
(now known as UTF-1), which was not self-synchronized, because bytes
used for representing ASCII characters were also employed as parts of
multibyte sequences.
The Bell Labs team took the bold approach of adopting the draft Unicode
standard and an X-Open proposal for encoding multibyte characters only
using bytes with the top bit set.  At the time the encoding was known as
UTF-2; it is what we now call UTF-8.  UTF-8 makes it easier to achieve
backward compatibility in existing code; for example code scanning for
the "/" file path separation character in a string, will never encounter
it in the UTF-8 representation of non-ASCII characters.
The Plan 9 choices proved wise and prescient.  I do not know how much
the Plan 9 implementation and the USENIX paper influenced further
developments (its authors may enlighten us), but in the end Unicode
converged with ISO 10646 becoming a single standard, and UTF-8 was
widely adopted.
The Plan 9 team's decision to adopt UTF-8 was by no means a given.
Consider the case of Microsoft, which released Windows NT with Unicode
support in the same year.  Microsoft's Windows NT 1993 offering
supported a wide character encoding, not UTF-8: initially UCS-2 and
later UTF-16.  To achieve backward compatibility the Windows API offers
two functions for each call involving strings: a so-called "ANSI"
version (actually using the currently active code page) and a "Wide"
(Unicode) version.  Furthermore, text files use a byte order mark to
inform programs regarding their character representation, and in C/C++
code strings are often enclosed in a special macro to facilitate porting
to wide characters.  In the end, in 2019 Microsoft yielded, supporting
UTF-8 in its Windows API through code page 65001 (CP_UTF8), and
recommending its use.  The double APIs and BOM files are still with us
as a reminder that deficient technical decisions come at a cost.
Diomidis - https://www.spinellis.gr

2025

2024

2023

2022

2021

2020

2019

2018

2017

2016

2015

2014

2013

2012

2011

2010

2009

2008

2007

2006

2005

2004

2003

2002

2001

2000

1999

1998

1997

1996

1995

1994

1993

1992

1991

1990

[TUHS] Re: Bell Foreign-Language UNIX Efforts