Thomas Chan wrote:
> On the sci.lang newsgroup, someone gave an even better
> answer[1] to Jon
> Babcock's query. U+54E1 and U+5504, U+6688 and U+6689, and U+4F1A and
> U+4F1D were given as three examples where: 1) the components were the
> same, but 2) meanings different, as well as 3) writing sequence was
> identical--the third restriction not among the criteria that Jon asked
> for, nor present in my own examples. (I don't completely agree with
> the third pair, though.)

(( To put in context my comments below: I am still wondering about an
hypothetical computer encoding for Chinese characters whose units (the
"characters") are components, rather than complete hanzi. Sorry if I am
misunderstanding Jon's question and/or Thomas's reply... I'm afraid that
this list change quite broke the thread of the discussion. ))

Cases like the pair U+54E1 and U+5504 (員 and 唄) is what made me say before
that positioning "operators" can not be totally avoided in our hypothetical
encoding.

However, it is possible to structure the encoding so that these operators
are only used sparingly, and are totally avoided in the majority of hanzi.

In the context of an encoding system, U+5504 (唄) could be taken as the
"normal" combination, on the basis that the first component 口 ("mouth") is
in its most frequent position. So, this hanzi could be encoded by a plain
sequence of its two components: 口 + 貝 ("mouth" component + "shell"
component).

The computer would be able to draw a 2-component hanzi following this
algorithm:

1. Set the drawing area to the whole hanzi square. Set the current component
as the first component.

2. Draw the "combining" version of the current component in the drawing
area. (in the example, a small 口 "mouth" that occupies slightly less that
the left half of the hanzi square).

3. Set the drawing area to the component's specific "free area" (in this
case, slightly more that the right half of the hanzi square).

4. If the current component is not the last one, go back to step 2 above.

5. Draw the "final" (or "non combining") version of the current component
(in this case, a 貝 "shell") in the drawing area.

6. End.

The case of U+54E1 員 would consequently be taken as an "exceptional"
combination, on the basis that the first component is *not* in its most
frequent position. The same sequence as above would be used in the encoding,
with the addition of a placement "operator". These operators would be a
small set of invisible control characters that influence the placement of
components. So, our example would be: 口 + TOP + 貝 ("mouth" component + "on
TOP the next" operator + "shell" component).

The same algorithm as above would then be used to draw this other sequence,
but the sequence 口 + OVER ("mouth" component + "on TOP the next" operator)
would be treated as it was a single component.

The "combining" form of this on-the-fly component is identical to the
"final" form of 口 ("mouth"), but squeezed in the top half of the drawing
area, and its "free area" (see step 3 above), is the bottom half of the
area.

The fixed measures determined by operators (top *half*, bottom *half*) are
not always very optimal (it is not in this case, for instance), so
high-quality fonts would contain a large set of "ligature glyphs" that do
not correspond to single encoding components.

I leave the other cases mentioned by Thomas to step directly to the Japanese
example: U+4F1A and U+4F1D (会 and 伝).

The same process described above could be used also in this case, with the
only difference that the "combining" and "final" version of the 人 ("person")
component are much more different in appearance than the two versions of the
口 ("mouth") component.

However, in this case, it could be a better idea to consider the two shapes
as *different* components, and call them, e.g., "side person" and "top
person". Please, I ask the linguists on the list not to lynch me for this
blasphemy. (There are already the Unicoders that would lynch me and others
for what we are discussing here. :-)

In insisted on this example because I would like to stress that a system as
the one we are talking about doesn't necessarily have to be correct by the
etymological point of view, nor it necessarily have to take in account the
actual "meaning" or "sound" of the components.

It rather has to be economic and understandable. And the 人 ("person")
component is so common in both positions, and the two shapes are so
different, that by artificially separating the two, a saving of "operators"
and of misunderstanding can be achieved.

Similarly, I would often unify two or more components that share an
identical shape, regardless that they are historically unrelated to each
other.

_ Marco

P.S.: People who are not familiar with Unicode may be quite puzzled by
references like "U+54E1", so allow me to explain that it is a notation used
to refer to an Unicode character. "U+" is just a prefix to say "Unicode
character", and "54E1" is the relevant character code (a 4-digit hexadecimal
number). This practice is used because it is still quite complicated to use
real Chinese characters in e-mails. See page
http://www.unicode.org/charts/unihan.html for instructions on how to see
on-line images of these characters. ))

_ Marco