On 2016-03-28 01:30, John Cowan wrote:
Johnny Billquist scripsit:
>> Haha. Yes... Except that you now have
multiple representations of each
>> character within one character set. So what has improved???
Mojibake, though not unknown, is now much less common, and the number
of documents on the web that are in UTF-8 (including its ASCII subset)
is at 85% and rising.
> In the Good Old Days, characters were all the
same size, and you could
> do nice, simple things like
>
> while (*c && *c++ != " ");
That particular piece of code still works if the encoding is UTF-8.
Fundamentally, Unicode is complicated because human writing systems
are complicated.
While true, I do not agree that Unicode is complicated because of
writing systems. Unicode have surpassed the writing systems...
Another one I
noted a while ago was that functions and command in
Unix, such as lpq, which try to print things in nice columns now
fail, because the code don't actually know how many characters have
been output.
Well, if the font isn't fixed-width, you're screwed anyway. But if
it is, there is information in the Unicode tables that tells you which
characters have widths of 0, 1, or 2. Print programs can be modified
to use that information.
(...or 3)
Yeah, you just need to suck in a few gigabytes of Unicode libraries in
your 4K program. I'm not sure I agree that this is an acceptable solution.
And let's
not even talk about such wonderful concepts as colors in
the character set definition... Unicode seems to have it all...
Colors are optional.
Really. So how should Green Book (U+1F4D7) be rendered differently than
Blue Book (U+1F4D8), or Orange Book (U+1F4D9) ?
Curious minds want to know...
I wonder how
many code points exist for 'A'. It's definitely more than
one...
Other than Greek and Cyrillic A letters, there are the math letters, which
are used *in plain text* to designate semantic differences: plain A,
italic A, and bold A mean different things mathematically. Using the
math italics for emphasis or book titles is a Bad Thing.
And what are your thoughts on FULLWIDTH LATIN CAPITAL LETTER A (U+FF21).
What is the semantic difference in having more whitespace around the
letter? (It should semantically be decomposed to LATIN CAPITAL LETTER A
(U+41), so for all unicode string comparisons, it is equal to A, but
it's still a different code point.)
Johnny (Yes, I do not like Unicode...)
--
Johnny Billquist || "I'm on a bus
|| on a psychedelic trip
email: bqt(a)softjar.se || Reading murder books
pdp is alive! || tryin' to stay hip" - B. Idol