Re: [TUHS] Character sets

27 Mar 2016

On 2016-03-28 01:30, John Cowan wrote:
...
  Johnny Billquist scripsit:
 >> Haha. Yes... Except that you now have
multiple representations of each
>> character within one character set. So what has improved??? 
 Mojibake, though not unknown, is now much less common, and the number
 of documents on the web that are in UTF-8 (including its ASCII subset)
 is at 85% and rising.
 > In the Good Old Days, characters were all the
same size, and you could
> do nice, simple things like
>
>    while (*c && *c++ != " "); 
 That particular piece of code still works if the encoding is UTF-8.
 Fundamentally, Unicode is complicated because human writing systems
 are complicated. 
While true, I do not agree that Unicode is complicated because of
writing systems. Unicode have surpassed the writing systems...
...
   Another one I
noted a while ago was that functions and command in
 Unix, such as lpq, which try to print things in nice columns now
 fail, because the code don't actually know how many characters have
 been output. 
 Well, if the font isn't fixed-width, you're screwed anyway.  But if
 it is, there is information in the Unicode tables that tells you which
 characters have widths of 0, 1, or 2.  Print programs can be modified
 to use that information. 
(...or 3)
Yeah, you just need to suck in a few gigabytes of Unicode libraries in
your 4K program. I'm not sure I agree that this is an acceptable solution.
...
   And let's
not even talk about such wonderful concepts as colors in
 the character set definition... Unicode seems to have it all... 
 Colors are optional. 
Really. So how should Green Book (U+1F4D7) be rendered differently than
Blue Book (U+1F4D8), or Orange Book (U+1F4D9) ?
Curious minds want to know...
...
   I wonder how
many code points exist for 'A'. It's definitely more than
 one... 
 Other than Greek and Cyrillic A letters, there are the math letters, which
 are used *in plain text* to designate semantic differences: plain A,
 italic A, and bold A mean different things mathematically.  Using the
 math italics for emphasis or book titles is a Bad Thing. 
And what are your thoughts on FULLWIDTH LATIN CAPITAL LETTER A (U+FF21).
What is the semantic difference in having more whitespace around the
letter? (It should semantically be decomposed to LATIN CAPITAL LETTER A
(U+41), so for all unicode string comparisons, it is equal to A, but
it's still a different code point.)
        Johnny (Yes, I do not like Unicode...)
--
Johnny Billquist                  || "I'm on a bus
                                   ||  on a psychedelic trip
email: bqt(a)softjar.se             ||  Reading murder books
pdp is alive!                     ||  tryin' to stay hip" - B. Idol

2025

2024

2023

2022

2021

2020

2019

2018

2017

2016

2015

2014

2013

2012

2011

2010

2009

2008

2007

2006

2005

2004

2003

2002

2001

2000

1999

1998

1997

1996

1995

1994

1993

1992

1991

1990

Re: [TUHS] Character sets