[TUHS] Character sets
bqt at update.uu.se
Mon Mar 28 09:56:31 AEST 2016
On 2016-03-28 01:30, John Cowan wrote:
> Johnny Billquist scripsit:
>>>> Haha. Yes... Except that you now have multiple representations of each
>>>> character within one character set. So what has improved???
> Mojibake, though not unknown, is now much less common, and the number
> of documents on the web that are in UTF-8 (including its ASCII subset)
> is at 85% and rising.
>>> In the Good Old Days, characters were all the same size, and you could
>>> do nice, simple things like
>>> while (*c && *c++ != " ");
> That particular piece of code still works if the encoding is UTF-8.
> Fundamentally, Unicode is complicated because human writing systems
> are complicated.
While true, I do not agree that Unicode is complicated because of
writing systems. Unicode have surpassed the writing systems...
>> Another one I noted a while ago was that functions and command in
>> Unix, such as lpq, which try to print things in nice columns now
>> fail, because the code don't actually know how many characters have
>> been output.
> Well, if the font isn't fixed-width, you're screwed anyway. But if
> it is, there is information in the Unicode tables that tells you which
> characters have widths of 0, 1, or 2. Print programs can be modified
> to use that information.
Yeah, you just need to suck in a few gigabytes of Unicode libraries in
your 4K program. I'm not sure I agree that this is an acceptable solution.
>> And let's not even talk about such wonderful concepts as colors in
>> the character set definition... Unicode seems to have it all...
> Colors are optional.
Really. So how should Green Book (U+1F4D7) be rendered differently than
Blue Book (U+1F4D8), or Orange Book (U+1F4D9) ?
Curious minds want to know...
>> I wonder how many code points exist for 'A'. It's definitely more than
> Other than Greek and Cyrillic A letters, there are the math letters, which
> are used *in plain text* to designate semantic differences: plain A,
> italic A, and bold A mean different things mathematically. Using the
> math italics for emphasis or book titles is a Bad Thing.
And what are your thoughts on FULLWIDTH LATIN CAPITAL LETTER A (U+FF21).
What is the semantic difference in having more whitespace around the
letter? (It should semantically be decomposed to LATIN CAPITAL LETTER A
(U+41), so for all unicode string comparisons, it is equal to A, but
it's still a different code point.)
Johnny (Yes, I do not like Unicode...)
Johnny Billquist || "I'm on a bus
|| on a psychedelic trip
email: bqt at softjar.se || Reading murder books
pdp is alive! || tryin' to stay hip" - B. Idol
More information about the TUHS