Re: CJK combining components

Seán Ó Séaghdha wrote:

> I'm not sure what you mean by 'distinguish' here, but from
> following this
> thread from the Unicode list I don't think it should be an
> issue. The idea
> as I understand it was to make font creation easier and font
> size smaller by
> composing the *glyphs* from components, not to represent the encoded
> characters as components.

As I understand it, the discussion is about both things -- but a
hypothetical "decomposed encoding" is the focus, rather than not.

What I said on the Unicode List was that such an *encoding* could have a few
benefits, one of which is simplifying the design of fonts.

David Starner, on the Unicode List, said that the design of fonts can indeed
benefit from such a decomposition technique, but this doesn't necessarily
imply that the decomposition must be at the encoding level: it could be a
thing internal to the font.

I agreed with this, but I also said that this does not even imply the
opposite: i.e. that a decomposed encoding is impossible or undesirable.

> Each character would still have a Unicode
> codepoint (for backwards compatibility with Unicode &
> national standards if
> nothing else) and so would be easily distinguishable. Doing
> the encoding by
> components seems to me to be introducing unnecessary
> complexity,

This is the main objection to this idea. I tend to agree with it, but only
by the pragmatic point of view: breaking the status quo is always going to
cause troubles -- and often one concludes that this is not worth doing.

But if we ignore compatibility with existing standards (or we pretend that
we were back in 1950, when no standards existed), then the theorical idea of
an encoding ideographs by their components becomes more appealing.

It is exactly because of the theoretical and speculative character of this
discussion that I proposed to move it from the Unicode List to Qalam.

Discussing "Unicode as it could have been if it was not as it is" could be
quite dangerous on a mailing list where people come to seek practical and
precise advice on Unicode "as it actually is", but it may be interesting in
a less pragmatic environment, such as Qalam.

> for instance
> wouldn't you then need a non-spacing character separator to
> show where a character ended?

In the hypothetical scenario that I have been making, some sort of
hanzi-terminator would definitely be necessary. I am not totally sure
whether this is also true for the scenarios that other people have in mind,
however.

What I have in mind is that there could be a set of "combining radicals" and
"invisible operators" that combine to encode hanzi's. Any characters not in
this set (punctuations, alphabetic letters, digits, whatever) would not take
part in this combination process, and act as terminators for components
sequences.

When two hanzi's immediately follow each other, with no intervening
punctuation (etc.), a special invisible separator character would be
inserted. In Unicode terms, this would probably be a "zero-width space".

There are also a few details that I should add sooner or later; they have to
do with, let's say, the "associtivity and precedence" of components and of
operators.

_ Marco