So a project I'm working on recently includes a need to store UTF-8 Japanese kana
text in source files for readability, but then process those source files through tools
only guaranteed to support single-byte code points, with something mapping the UTF-8 code
points to single-byte points in the destination execution environment. After a bit of
futzing, I've landed on the definition of iconv(1) provided by the Single UNIX
Specification to push this character mapping concern to the tip of my pipelines. It is
working well thus far and insulates the utilities down-pipe from needing multi-byte
support (I'm looking at you Apple).
I started thumbing through my old manuals and noted that iconv(1) is not a historic
utility, rather, SUS picked it up from HP-UX along the way.
Was there any older utility or set of practices for converting files between character
encodings besides the ASCII/EBCDIC stuff in dd(1)? As I understand it, iconv(1) is just
recognizing sequences of bytes, mapping them to a symbolic name, then emitting them in the
complementary series of bytes assigned to that symbolic name in a second charmap file.
This sounds like a simple filter operation that could be done in a few other ways.
I'm curious if any particular approach was relatively ubiquitous, or if this was an
exercise largely left to the individual and so solutions were wide and varied? My tool
chain doesn't need to work on historic UNIX, but it would be cool to understand how
to make it work on the least common denominator.
- Matt G.