Thanks for your support but C89 didn't specify an encoding. In classic
committee fashion, it refused to take a stand about anything that might
limit adoption. The problem was that the API it offered was clumsy and made
encoding errors hard to ignore. (Grepping a file for a string, do you
really care if there is an irrelevant binary blob in the middle that isn't
kosher UTF-8?) Also, it provided no support for printing "wide" characters.
This is all covered in the paper cited above.*
The original UTF was compatible with ASCII but not robust if there was an
alignment problem, and also used printable ASCII characters in multibyte
sequences. You could find a '/' inside a Cyrillic character encoding, which
broke Unix badly. That's why FSS-UTF, File-safe UTF, was the name given to
Prosser's variant.
It's wrong to give us credit for properties we didn't introduce. But UTF-8
is more regular, simpler to encode and decode, and more robust than its
predecessors. Most important, it did introduce the self-synchronization
property, which was the key that opened the door for us at X-Open.
-rob
* In a classic Usenix whoops, the paper had an appendix that described
UTF-8's encoding rigorously, but that was dropped when it was published in
the conference proceedings. Perhaps that's why the RFC got in the mix and
started some of the confusion about its origin.
On Wed, Mar 22, 2023 at 1:25 PM Larry McVoy <lm(a)mcvoy.com> wrote:
The brilliance of UTF-8 was to encode ASCII as is.
That seems obvious in
retrospect but as Rob says, the multibyte crud in C89 was just awful,
and that was the answer at the time. Fitting ASCII in as is meant
that all of the Unix utilities, sed, grep, awk, etc, had close to no
performance hit if you were processing ascii. That's pretty cool when
you get that and you can process Japanese et al as well.
I kind of cringe when I say it is brilliant to not break what exists
already, to me, that's just part of what you do as an engineer. But
history has shown that not breaking stuff, fitting the new into the
old, is brilliant. So kudos to Rob and Ken for doing that (but truth
be told, I'd be stunned if they didn't, they are great engineers).
On Mon, Mar 20, 2023 at 07:27:34AM +1100, Rob Pike wrote:
As my mail quoted in
https://www.cl.cam.ac.uk/~mgk25/ucs/utf-8-history.txt says,
Ken worked out a new packing that avoided all the problems with the
existing ones. He didn't alter Prosser's encoding. UTF-8, as it was later
called, was not based on anything but it was deeply informed by a couple
of
years of work coming to grips with the problem of
programming with
multibyte characters. What Prosser did do, and what we - all of us - are
very grateful for, is start the conversation about replacing UTF with
something practical.
(Speaking of design by committee, the multibyte stuff in C89 was
atrocious,
and I heard was done in committee to get someone,
perhaps the Japanese,
to
sign off.)
Regarding windows, Nathan Myrhvold visited Bell Labs around this time,
and
we tried to talk to him about this, but he
wasn't interested, claiming
they
had it all worked out. We later learned what he
meant, and lamented. Not
the only time someone wasn't open to hear an idea that might be worth
hearing, but an educational one.
It's important historically to understand how all the forces came
together
that day. The world was ready for a solution to
international text, the
proposed character set was acceptable to most but the ASCII compatibility
issues were unbearable, the proposed solution to that was noxious,
various
committees were starting to solve the problem in
committee, leading to
technical briefs of varying quality, none right, and somehow a phone call
was made one afternoon to a couple of people who had been thinking and
working these issues for ages, one of whom was a genius. And it all
worked
out, which is truly unusual.
-rob
--
---
Larry McVoy Retired to fishing
http://www.mcvoy.com/lm/boat