Re: Theory of transliteration?

--- In qalam@yahoogroups.com, "Peter T. Daniels" <grammatim@...> wrote:

> (preposing chevrons requires inserting hard line breaks as well)

Matter of taste. I'm happy if the paragraphs are left intact. A lot
depends on the mailer.

I've also included my reply to
http://tech.groups.yahoo.com/group/qalam/message/6695 here.

> Peter T. Daniels grammatim@...

>> From: Richard Wordingham <richard@...>

>> --- In qalam@... com, "Peter T. Daniels" <grammatim@ ..>

wrote:

>> It did raise one interesting question. How are casing scripts
>> transliterated into non-casing scripts?

> MAYBE WITH A NOTE THAT SENTENCES AND PROPER NAMES BEGIN WITH A

CAPITAL? FOR ENGLISH, PROPER ADJECTIVES ALSO, FOR GERMAN COMMON NOUNS ALSO

That doesn't enable you to recover the original capitalisation of
'diesel' or 'doppler'. There are also ambiguous cases like 'the
Queen' and the third person pronouns for the divinity.

> WHY WOULD YOU WANT TO EXCLUDE THE WORLD'S HANDFUL OF SAMARITANS FROM

READING SOMETHING IN ENGLISH? SAMARITAN IS MERELY A GRAPHIC VARIANT OF
HEBREW, SO THE CONVENTIONS ALREADY EXIST.

I was excluding freemasons who are native English speakers.

> A _TRANSCRIPTION_ TELLS YOU EXACTLY HOW TO PRONOUNCE IT.

Is there a technical term (as opposed to a derogatory term) for
something that resembles a transcription but drops tone marks, vowel
length, and merges a pair of vowels and a pair of consonants?

> A _TRANSLITERATION_ TELLS YOU EXACTLY HOW TO SPELL IT.

>> > A transliteration is a 1-to-1 correspondence between the characters
>> > of one script and the characters of another script.

>> I thought the key feature of a transcription was that it was
>> reversible, i.e. you could get back to the original. (I would allow
>> tagging used for 'context-sensitive' information to be lost.)

> ABSOLUTELY NOT.

> A TRANSLITERATION, HOWEVER, IS PERFECTLY REVERSIBLE.

Agreed. I accidentally wrongly wrote 'transcription' for
'transliteration'.

>> But there are implicit [syllable-boundary] markers! A preposed

vowel only occurs at the

>> start of a syllable and sara a only at the end. If one moves the
>> preposed vowel to its phonetic position, one needs a way of recovering
>> its position. If one does not have one, one has neither a
>> transliteration nor a faithful transcription.

> IF THAT PREPOSED VOWEL CAN ONLY OCCUR AT THE START OF A SYLLABLE,

THEN THERE IS NO AMBIGUITY. IF THE SAME MARK HAS A DIFFERENT FUNCTION
IF IT DOESN'T APPEAR IN SYLLABLE-INITIAL POSITION, THEN IT WILL HAVE A
DISTINCTIVE TRANSLITERATION (SEE ABOVE).

You may have misunderstood me. In Thai (and, to a large extent, in
Lao) the consonants of an akshara are simply written in sequence with
no explicit indication of the end of what in Brahmi would be a
consonant stack. There may be an implicit indication, for tone marks,
the vowel marks that go above, below or to the right, and the
shortness mark (mai taikhu) all associate with the final consonant of
the akshara. Similarly, the start is only indicated by its being
preceded by a preposed vowel. A syllable may be composed of one or
two aksharas; the second has no vowel or tone marks, but may have a
silencing mark on the final consonant. (In loans from English, it may
also be on the first consonant.) A final complication is that the
last consonant of a syllable may also be the first consonant of the
next syllable - 'double action'.

In most Indic scripts, an akshara may straddle a syllable boundary -
this does not happen in Thai unless you count double acting consonants.

I am ignoring what may be regarded as anaptyctic vowels - they give
Thai an element of the sesquisyllabic structure found in Mon and Khmer
and currently vanishing from Cham.

Splitting Thai text into syllables can generally be done by natural
intelligence, but it is a computationally insoluble problem, at least
if the input is no more than isolated words. 90% accuracy is not too
difficult to achieve without recourse to dictionaries.

>> a 1-to-1 correspondence of characters raises the question of what you
>> do with preposed vowels and multi-part vowels in Indic scripts. There

> IN INDIC, THERE IS ONLY ONE POSSIBLE PLACEMENT OF EACH VOWEL MARK,...

Thank you for that clarification. I'd been wondering whether CVC and
CCV aksharas could be written differently when the vowel went above
and the second consonant ascended to the baseline from which the
consonants hang. I know of a Khmer font where they are displayed
differently, but I wasn't sure whether that was a design flaw.

I have seen a difference in anusvara placement in a CVCV akshara.
I've taken this as unwelcome evidence that in that writing style the
sole such combination of anusvara and final vowel was a vowel symbol
in its own right.

> ... EVEN THOUGH THE VOWEL DESIGNATED BY SUCH A VOWEL MARK ALWAYS

GOES AFTER THE LAST CONSONANT IN ITS GROUP. HENCE NO AMBIGUIITY IN
EITHER DIRECTION.

The problem most clearly lies in the Indic 'o' vowel. In the Bengali
script, South Indian scripts and many SE Asian scripts, it is composed
of the preposed symbol for 'e' and the postposed symbol for 'a:'
(basically the length mark). An extra mark above, variously
interpreted, turns it into the 'au' vowel. For the two-part compound
vowel, do we have one symbol or two symbols? Unicode gives different
answers for different scripts:

one character: Khmer, Limbu
two characters: Myanmar, Thai, (Lao)
up to you: Bengali, Kannada, Malayalam, Oriya, Sinhala, Balinese

I've bracketed Lao because that only has a three part symbol.

>> positioning for Thai (e.g. homographs such as _peelaa_ 'appointed
>> time' and _plao_ 'axle').

(Whoops! Should have remembered the aspiration - 'ph', not 'p'.
Incidentally, it's written with Indic <b>.)

> IF THEY'RE HOMOGRAPHS IN THAI, THEN THEIR TRANSLITERATIONS WILL BE

IDENTICAL.

This is where an additional problem arises in Thai. In __pheelaa_ the
two vowel symbols belong to different aksharas; in _phlao_ thay are
this 'o' vowel, which is normally transcribed as 'ao' for Thai, and
_phlao_ is a single akshara.

However, are you prohibiting the reordering of symobls? That's a bit
difficult when some scripts advance characters in two directions,
possibly sometimes even with branching.

> IF IN THAI THE SAME SYMBOL MEANS DIFFERENT THINGS WHEN (SAY) BEFORE

OR AFTER AN AKSHARA, THEN IT WILL COUNT AS DIFFERENT SYMBOLS AND HAVE
DIFFERENT TRANSLITERATIONS.

No such issue.

>> > You seem to have been asking about a transcription of Thai (and
>> > above of English), which yields the pronunciation of the language --
>> typically, in phonemic terms.
>
>> > For transliterating Thai into roman (say), you could use a variety
>> > of diacritics to distinguish the khs, or you could go historical and
>> > use both gh and kh, or you could use numeric indices ...
>
>> I like the historical approach, but I feel uneasy about using <'b> and
>> <'d> for bo bai mai and do dek. It feels silly to write the
>
> HEY, I DIDN'T SAY ANYTHING ABOUT APOSTROPHES, THAT'S YOUR CONTRIBUTION!
>
> UNLESS, OF COURSE, YOU'RE REFERRING TO A STANDARD TRANSLITERATION

THAT'S BEEN IN USE FOR A CENTURY, IN WHICH CASE, WHY IS THERE A QUESTION?

The problem is that Thai has added a fifth stop consonant (originally
preglottalised, now largely just voiced) to three of the vargas. I
can't find a standard way of handling this, and some of the older
Indic loans in Thai have replaced the original initial voiceless stop
by this fifth stop.

>> apostrophe when they are the final consonants of native words, even
>> though they are preglottalised as in much British English. (Not in
>> Australian English, though.) Also, writing <v> for fo fan is likely
>> to be misunderstood. I've got Griswold's book on order from the

library.

>
> IT COULD ONLY BE MISUNDERSTOOD IF MORE THAN ONE THAI SYMBOL IS

TRANSLITERATED WITH <v>,

That's the problem. Thai wo wan (= Devanagari 'v') is sometimes
transliterated with 'w', sometimes with 'v'.

> TRANSLITERATIONS SPECIFICALLY DO _NOT_ CONVEY "IMPLICIT" INFORMATION.

>> transliterating unpointed Arabic or syllabification and phonetic vowel
>
> UPOINTED ARABIC IS TRANSLITERATED WITHOUT SHORT VOWELS. IF YOU

WANTED TO CONFUSE, YOU COULD TRANSLITERATE ALIF, WAAW, AND YAA' WITH a
u i, BUT IT WOULD BE HARDER TO RECOGNIZE THE WORDS IF YOU DID SO.

>
>> For English a transliteration might choose
>> to differentiate homgraphs such as 'sow', 'lead' and 'read'. A
>
> THEN IT'S NOT A TRANSLITERATION.

Do you have a name for it?

>> secondary point is that it does allow one to reject sequences of
>> characters that do not appear in any way to be English. Practical

>> ?

>> examples from Thai include unrecognised combinations of vowel symbols
>> and misplaced tonemarks.

It may be argued that various combinations of Thai vowel marks
constitute single symbols. However, arbitrary combinations are not
permitted by Thai orthography - one may wish to define the domain of a
transliteration to exclude these combinations. Similarly certain
sequences of characters are 'incorrect' and will not work even in
sophisticated string matching operations, even though visually there
may appear to be nothing wrong with them. (Some rendering systems do
insert an error indicator - typically the dreaded dotted circles,
thoguh for Thai I've seen an obliterating black square.)

> THERE ARE EXACTLY AS MANY SYMBOLS IN THE TRANSLITERATION AS THERE

ARE IN THE ORIGINAL (OF COURSE ONE SYMBOL MIGHT COMPRISE MORE THAN ONE
UNIT, AS IN A TRANSLITERATION OF CHINESE SUCH AS PINYIN, WHERE MOST
CHARACTERS ARE TRANSLITERATED WITH THREE LETTERS AND A NUMBER OR A
DIACRITIC).

Deciding what is one symbol is not always so easy. Traditionally,
several of the Thai superscript vowels were not regarded as atomic:

nikkhahit (= anusvara) is atomic.
sara i (short high front vowel) is atomic.

sara ii (long high front vowel) was regarded as sara i plus a mark
whose name I forget.

sara ue (short high back unrounded vowel) was regarded as sara i plus
nikkhahit, and I have actually seen it used for this combination. The
position now is that in Unicode, one uses sara i and nikkhahit for
Pali, where it represents /ing/, and sara ue for Thai, where it is a
pure vowel.

sara uue (long high back unrounded vowel) was regarded as sara i plus
'rat's teeth'. Now this is interesting, for 'rat's teeth' is an
indepedent symbol in Khmer.

Now in Khmer, we have the same vowel symbol as Thai sara uue, and it
may well be that the Cambodians see it the same way as did the Thais.
Rat's teeth, on their own, convert a consonant to from series 2
(originially voiced) to series 1 (originally voiceless). Now, the
sequence <rat's teeth, i> does not occur in Khmer, for it is
automatically replaced by, to use a historical transliteration, <kpias
krom, i>. 'Kpias krom' is a symbol identical to the vowel symbol for
Indic /u/, and substitutes for treisap (whcih converts from series 2
to series 1) in the same situation. Apart from one rare possibility
for which I lack data, one can always determine whether the apparent
vowel symbol represents a vowel, conversion from series 1 to series 2,
or conversion from series 2 to series 1. If you insist on one-to-one
conversion of symbols for transliteration, does this mean that the
significance cannot be resolved in transliteration, but must be
resolved by the reader? FWIW, Unicode requires the use of whichever
of the three characters is appropriate.

There is another, nastier case of 'dictionary-based' analysis in
Khmer. In Khmer, the subscript forms of Indic <t> (Series 1 /t/ in
Khmer) and Indic <.t> (Series 1 (implosive) /d/ in Khmer) are
identical. Unicode requires that they be encoded according to the
pronunciation. Now, may strict transliteration preserve this
distinction? (For computing, of course, one would want to preserve
this distinction, as the point is that one may wish to recover the
original encoding.)

The conservation of symbol count looks distinctly dodgy in
transliterating Indic scripts. The word 'Devanagari' is 8 symbols in
Devanagari - <d><e:><va><n><a:><ga><r><i:>. Are we seriously claiming
that 'va' and 'ga' constitute single symbols in the transliteration to
the Roman alphabet? And how does the transliteration of final <t,
virama> as 't' preserve the number of symbols?

Matters get even worse if we consider the issues of forced half forms
and internal halants. If such are to be represented, as far as the
glyphs are concerned, we are using extra symbols to record the forced
shaping.

For your Chinese example, do you mean more than the pinyin
transcription? You would also need a disambiguating number for
homophones. Are there not variant readings even for Chinese? Would
you select one 'arbitrarily'?

Waht would be the Japanese analogue? Many (most?) kanji have variant
readings.

Richard.