--- In
qalam@yahoogroups.com, "Nicholas Bodley" <nbodley@...> wrote:
> That Web site <http://acharya.iitm.ac.in/index.html> defines an
akshara,
> and as I understand it, starting from sounds of aksharas to render a
given
> script works better than starting from Unicode elements and combining
> them. (Please hang on, for a bit!) They propose a large set of code
> points that permit rendering aksharas more directly, rather than by
> combining their elements. They state, iirc, that rendering text from
> Unicode elements can work well, but the text files used to create that
> text are likely to be almost useless for such processes as sorting,
> searching, and the like. The reason this site's ideas seem worthy of
> attention is that they propose a multilingual basis for creating text
> files that can be searched and sorted with reasonably-simple routines.
Many of the arguments presented are weak. While I can't comment on the
ISCII, they are clearly not aware of how half-forms and mandatory
virama are represented (- by ZWJ and ZWNJ). They imply that
<<t>><<e>><<r>><<virama>><<ind. vowel .r>><<t>><<i>> would be rendered
with a repha over the independent vowel. I can't force that
behaviour, and even in Khmer, where the corresponding mark is encoded
explicitly, I can only force a simulacrum with the consonant qa. (In
Khemr the independent vowels are so consonant-like that they can be
subscripts in conjuncts.) If the behaviour could occur, it could
always be prevented by inserting ZWNJ after the virama.
The discussion on Tamil is almost totally invalidated because of a
belief that /hoo/ is encoded <<ee>><<h>><<aa>>. It is actually
encoded <<h>><<ee>><<aa>> (decomposed) or <<h>><<au>> (composed). In
particular, the primary Unicode representations have been designed so
that context can be *ignored* when searching.
The sorting order is not defined by the numerical of code points, but
it has to be tailored for each _language_ - for example English, Welsh
and Spanish have different sorting orders for combinations of lower
case unaccented letters! (For example, Welsh and Spanish both treat
double 'l' as a separate letter.) Having said that, the default
collation table for Unicode (see
http://www.unicode.org/reports/tr10/
for the concepts and
http://www.unicode.org/Public/UCA/latest/allkeys.txt for the details)
ought, I believe, to adjust the collation sequence of the Tamil
letters rahter than just following the code points. It's been done
for the suplementary Canadian Aboriginal Syllabics. I don't think
there's any reason why it can't be done.
> As to the number of aksharas they propose, it could be maybe 13,000,
but
> that seems essentially unnecessary; a pracical set seems to be more
like a
> few hundred.
They've tallied 800 conjunct forms, which is not unreasonable for
languages with 30 odd consonants. Mutliply that by the number of
vowels - they say about 16 - and you reach 13,000. Is their encoding
defined anywhere? It could start getting very unwieldy if they have
to add extra ligatures. I wonder how amenable it is to searching for
fractions of aksharas.
Is their encoding published? It could get very unwieldy if they have
to add new combinations.
Richard.