I've often pondered on the storage differential that non-ASCII languages rack up.
Let's say one primarily stores documents in Japanese. This puts you up in
2-bytes-per-character range. If you go simply by character count, the same amount of
characters take up twice the amount of actual bytes. Of course, Japanese isn't the
greatest case for this being a problem as like many other non-phonetic scripts (and even
with kana syllables) it takes less actual characters to convey a thought, cutting the
character count for a complete sentence, even katakana/Hepburn stuff, in at least half.
All in all, they may break even or even better given multi-syllable kanji. A better
example of scripts that would likely suffer data bloat would be Hebrew or Arabic, although
being abjads with diacritics to represent vowel sounds, you likewise land somewhere like
Japanese kana where a single glyph represents what in the Latin alphabet would be at least
two letters. I would imagine Cyrillic users for instance do actually have to take the
storage hit involved since their entire script is outside ASCII *and* the language is a
full alphabet and not an abjad nor logographic. Can't say I've worked with much
Cyrillic text though. That's not even to mention scripts where diacritics may be
represented by a separate individual code-plane entry requiring combination with another.
This is of course, way off list, so I don't want to start a whole side-chain on it,
but linguistic storage in computers has interested me for a long time, especially in my
reverse engineering research of old games looking at how different studios implemented
various code-pages for non-ASCII scripts. For example, I've seen plenty of older
(8/16-bit) Japanese games that obviously don't use UTF-8 due to overhead in
constrained console environments (or even being older than UTF-8) but also don't use
ShiftJIS or other known encodings, instead opting towards their own custom code-plane to
map bytes, usually to kana, although I haven't really peeked into any engines that
use kanji. This was uncommon as video games were typically marketed towards children who
weren't expected to know enough kanji to read complicated text. You see the same
today with text associated with children's media in Japan in that hiragana
syllabilary for a given kanji is displayed adjacent to it (furigana).
I think one resounding conclusion of this thread though is we all owe Rob and Ken (and
colleagues) a great deal for nailing this matter down in such a well-engineered way. Long
live UTF-8!
- Matt G.
------- Original Message -------
On Wednesday, March 22nd, 2023 at 3:33 PM, Steffen Nurpmeso <steffen(a)sdaoden.eu>
wrote:
Rob Pike wrote in
CAKzdPgwYPxK9oYemG5-vPgRR7mSfj_qkjD5-iJnLffP-23PUaQ(a)mail.gmail.com:
|The appendix version named it plain UTF, repurposing the extant name to the
|new encoding. The -8 came later, as it is in these linked documents,
|because some people wanted a UTF-7 and a UTF-16. Those people should be
|punished.
I agree, but please with a but.
For one especially so since UTF-7 (that i like) then didn't make
it all through, but only here and there.
Ie, if it would have been used for anything mail and DNS related
to keep 7-bit compat. Instead they introduced monstrosities like
IDNA for DNS, mUTF-7 (locale charset -> UTF-16BE -> mUTF-7) etc.
That i hated: IDNA. If they would have said we give up on
backward compatibility around Y2K, and the old stuff grows out;
and 255 bytes UTF-8 is surely enough for domain names for some
time (even percent encoded) even for those encodings which need
four byte for one codepoint, and it simply does not work before.
Like so they introduced those backward incompatibilities that they
wanted to avoid.
I did oppose strongly in the past, but UTF-16 has merits for some
languages as well as for coding, even though you have to be able
to deal with surrogates, .. and with grapheme boundaries, if you
are doing it right, so 1:many is there anyhow. I mean, wchar_t is
often 32-bit, and then not even UTF-32, at least possibly. But
still you have the 1:many, so it buys you nothing.
All-UTF-8 is of course great imho. (Asian people may disagree.)
--steffen
|
|Der Kragenbaer, The moon bear,
|der holt sich munter he cheerfully and one by one
|einen nach dem anderen runter wa.ks himself off
|(By Robert Gernhardt)