[TUHS] Re: Bell Foreign-Language UNIX Efforts

22 Mar 2023

I've often pondered on the storage differential that non-ASCII languages rack up.
Let's say one primarily stores documents in Japanese.  This puts you up in
2-bytes-per-character range.  If you go simply by character count, the same amount of
characters take up twice the amount of actual bytes.  Of course, Japanese isn't the
greatest case for this being a problem as like many other non-phonetic scripts (and even
with kana syllables) it takes less actual characters to convey a thought, cutting the
character count for a complete sentence, even katakana/Hepburn stuff, in at least half.
All in all, they may break even or even better given multi-syllable kanji.  A better
example of scripts that would likely suffer data bloat would be Hebrew or Arabic, although
being abjads with diacritics to represent vowel sounds, you likewise land somewhere like
Japanese kana where a single glyph represents what in the Latin alphabet would be at least
two letters.  I would imagine Cyrillic users for instance do actually have to take the
storage hit involved since their entire script is outside ASCII *and* the language is a
full alphabet and not an abjad nor logographic.  Can't say I've worked with much
Cyrillic text though.  That's not even to mention scripts where diacritics may be
represented by a separate individual code-plane entry requiring combination with another.
This is of course, way off list, so I don't want to start a whole side-chain on it,
but linguistic storage in computers has interested me for a long time, especially in my
reverse engineering research of old games looking at how different studios implemented
various code-pages for non-ASCII scripts.  For example, I've seen plenty of older
(8/16-bit) Japanese games that obviously don't use UTF-8 due to overhead in
constrained console environments (or even being older than UTF-8) but also don't use
ShiftJIS or other known encodings, instead opting towards their own custom code-plane to
map bytes, usually to kana, although I haven't really peeked into any engines that
use kanji.  This was uncommon as video games were typically marketed towards children who
weren't expected to know enough kanji to read complicated text.  You see the same
today with text associated with children's media in Japan in that hiragana
syllabilary for a given kanji is displayed adjacent to it (furigana).
I think one resounding conclusion of this thread though is we all owe Rob and Ken (and
colleagues) a great deal for nailing this matter down in such a well-engineered way.  Long
live UTF-8!
- Matt G.
------- Original Message -------
On Wednesday, March 22nd, 2023 at 3:33 PM, Steffen Nurpmeso &lt;steffen(a)sdaoden.eu&gt;
wrote:
...
  Rob Pike wrote in
 CAKzdPgwYPxK9oYemG5-vPgRR7mSfj_qkjD5-iJnLffP-23PUaQ(a)mail.gmail.com:
 |The appendix version named it plain UTF, repurposing the extant name to the
 |new encoding. The -8 came later, as it is in these linked documents,
 |because some people wanted a UTF-7 and a UTF-16. Those people should be
 |punished.
 I agree, but please with a but.
 For one especially so since UTF-7 (that i like) then didn't make
 it all through, but only here and there.
 Ie, if it would have been used for anything mail and DNS related
 to keep 7-bit compat. Instead they introduced monstrosities like
 IDNA for DNS, mUTF-7 (locale charset -> UTF-16BE -> mUTF-7) etc.
 That i hated: IDNA. If they would have said we give up on
 backward compatibility around Y2K, and the old stuff grows out;
 and 255 bytes UTF-8 is surely enough for domain names for some
 time (even percent encoded) even for those encodings which need
 four byte for one codepoint, and it simply does not work before.
 Like so they introduced those backward incompatibilities that they
 wanted to avoid.
 I did oppose strongly in the past, but UTF-16 has merits for some
 languages as well as for coding, even though you have to be able
 to deal with surrogates, .. and with grapheme boundaries, if you
 are doing it right, so 1:many is there anyhow. I mean, wchar_t is
 often 32-bit, and then not even UTF-32, at least possibly. But
 still you have the 1:many, so it buys you nothing.
 All-UTF-8 is of course great imho. (Asian people may disagree.)
 --steffen
 |
 |Der Kragenbaer, The moon bear,
 |der holt sich munter he cheerfully and one by one
 |einen nach dem anderen runter wa.ks himself off
 |(By Robert Gernhardt)

2025

2024

2023

2022

2021

2020

2019

2018

2017

2016

2015

2014

2013

2012

2011

2010

2009

2008

2007

2006

2005

2004

2003

2002

2001

2000

1999

1998

1997

1996

1995

1994

1993

1992

1991

1990

[TUHS] Re: Bell Foreign-Language UNIX Efforts