Re: number of Chinese components (was Digest Number 442)

Stage Linguistique wrote:

> Marco wrote:
> > not to mention the 70,000+ Chinese "characters"
> > which could have been avoided with a "composing
> > radicals" approach).
>
> A Chinese character is written as <radical>+<something
> else>

Sorry, I used the word "radical" in a loose sense: I meant a "component" of
any kind: signific, phonetic, whatever.

> The number of <radical>'s is a limited set (214),
> but the number of <something else>'s is not.

Well, the number of existing hanzi *is* a closed set (although very
numerous), so how could the set of components on which hanzi are built not
be a closed set?

For that to be true, there should exist at least one hanzi composed of an
infinite number of components... :-)

> What you are envisioning would be either (a) impossible
> to implement

Why?

The possible combinations of components in a hanzi are a handful of
recursive composition types, such as:

- side by side
- stacked on top of each other
- enclosed in each other

Even Unicode's overabundant "Ideographic Description Characters" set
(http://www.unicode.org/charts/PDF/U2FF0.pdf) doesn't exceed a dozen
operators.

Moreover, for many components you can indicate a default composition with
the next component, which would make redundant to explicitly indicate the
composition type. E.g., the 4-stroke "dog radical" is always on the left
side of the following component; the "roof radical" is always on top of the
following component, etc.

All summed up, Chinese composition is not much more complex than Hangul
composition, and definitely simpler than the composing rules of several
Indic scripts.

As for the number of atomic components, the "significs" (or "radicals") are
a bit more than the classic 214 Kang Xi radicals, but they probably don't
exceed 300 in number.

"Phonetics" and other components probably don't exceed 1,000 items (Wieger
counted only around 800 of them), but notice that many of them are already
counted as "significs", or are themselves decomposable.

> or (b) far too complex to use by end-users
> (how many Chinese use CangJie?)

Input methods are an entirely separate issue.

First of all, there is no reason why current hanzi-based input method could
not be adapted also to a "decomposed encoding". The only difference would be
that a certain pinyin (cangjie, wubi, etc.) combination would convert into a
sequence of codes, rather than into a single code.

(Notice that there also are word-based or sentence-based input methods,
although of course no word-based or sentence-based encoding for Chinese or
Japanese ever existed. Also notice that, on cell phones, Chinese-like input
method are used also for inputting words in Western languages although, of
course, these words are encodes as sequences of letters.)

But, of course, you could also come up with a component-based input method.
Imagining a 25-keys system such as cangjie, two-key sequences should be
fairly enough to index most components unambiguously.

The benefit of a component-based encoding would be to allow typing
unanticipated rare hanzi. The evil side is, of course, that it also allows
the accidental typing of inexistent hanzi.

_ Marco