Browsing through the Unicode mailing list, I noticed a remark by Peter
Constable that the Thai collation order had been changed by removing
its dependency on how a word was pronounced. I'd like to hear more
about this change, e.g. its date.
The Thai script is a South Indian script, and like others, e.g. Tamil,
it has many vowel symbols that are written before the consonant. When
sorting Thai, one implicitly swaps these 'preposed' vowels with the
following consonant, and then, ignoring tone marks and the like, does
a 'lexicographic' sort as for English. (The consonants come before
the vowels.)
I have seen hints of an algorithically more complex system, whereby
one swaps the preposed vowel to after the whole of the initial
consonant cluster (ignoring any 'anaptyctic' vowel), and compares
syllable by syllable. I think the comparison of syllables proceeds by
comparing initial consonants, vowel cluster, final consonant, and then
tone mark, but I am not sure of the details. This can't be done
without a knowledge of the pronunciation of the word, for clusters are
not marked in Thai. Thus <ae><h><n> can be /hE:n_R/ or /nE:_R/, and
more importrantly, <e><ph><l><a:> can be /phlau_M/ 'axle' or /phe:_M
la:_M/ 'time'. There are also examples where consonant clusters in
the middle of a word cannot be correclty split between syllables
without knowing the pronunciation of the word. This more complex
system is, I presume, the one that was replaced. My description may
be wrong - I have had to reconstruct it from hints and misdescriptions
of the current system.