Re: Automatic clustering of languages

From: Francesco Brighenti
Message: 48200
Date: 2007-04-02

--- In cybalist@yahoogroups.com, "mkelkar2003" <swatimkelkar@...>
wrote:

> Agreed that Sanskrit coding is not that good.

No, it's the sources for the lexical materials these Slovenians
researchers have used that are flawed, with this preventing them to
achieve any significant results by means of their "automatic
clustering" game. The key passage in their paper is, in my opinion,
the following one:

"The data were provided from a variety of sources such as native
speakers and dictionaries. However, ***transliterations were not
checked*** [emphasis mine -- FB]. The translations were not given by
experts; hence it is quite likely that there are several
inconsistencies present both in translations and in
transliterations. Obviously the choice of a particular method of
transliteration and translation may influence the outcome."

Since I am not a linguist, I will confine myself to pointing out to
the members of this List the flawed transliterations and choices of
terms in languages/dialects spoken in Italy (my ow country) and
northern India (whose languages I know better than other IE
languages).

"BAD"
Bengali/Oriya <kharap>, Hindi/Rajasthani <kharab>, are Perso-Arabic
loans; why do the authors of the study regard them as native Indo-
Aryan words?

"BLACK"
Italian Venetii dialect "caif" is a nonexistent word (I am from that
part of Italy); I'd like to see the authors' source for this word!
Sanskrit <ka:la> 'black' is attested much later that <kr.s.n.a>,
which also means black; had the authors chosen the older term, they
would have seen that it nicely clusters with its Slavic cognates
such as <crn> etc.

"DRINK"
Oriya "pieeba" is a very bad spelling; the correct form should be
<piiba:> (there's a double /i/, not a long /i/ here; the spelling is
wrong because the authors' informant transliterated the second /i/
as "ee" in the English fashion!).
Sanskrit "peena" is also nonexistent; this verb is from Hindi, and
should be transliterated as <pi:na:>! Same for Rajasthani "peeno".

"EYE"
Oriya "ahkee" should be transliterated as <a:khi>!
Moreover, "aankh" is not Sanskrit, it's Hindi! (Same for "kaan",
which Hindi, not Sanskrit for 'ear'.) Couldn't these Slovenians
check a Sanskrit dictionary (even online!) before making these
attribution mistakes? And note that the Hindi forms they give for
the same meanings are transliterated , respectively, as "ankh"
and "kan" (as if they had a different vowel quantity from that of
the supposedly corresponding -- but actually wrong -- "Sanskrit"
forms! :^).

"FIVE"
Why the Italian Northern Lombardy dialectal term is spelt as "chinq"
if the corresponding Italian form is correctly spelt as <cinque>?
The /c/ is palatal in either case...

"FOOT"
Sanskrit "pea'r" is a nonexixtent word. I don't know where the
authors drew it from -- it appears to be the wrong transliteration
of some New Indo-Aryan word meaning 'foot'.

I am sure that, if the words from the other languages included in
this "Slovenian experiment" are scrutinized the same way, much of
their lexical data will be proved flawed.

But... Kelkar doesn't care about this... because his only aim is to
show that "Indic is not related to Iranian"!

Cheers,
Francesco