Re: Automatic clustering of languages

From: Richard Wordingham
Message: 48168
Date: 2007-04-01

--- In cybalist@yahoogroups.com, "mkelkar2003" <swatimkelkar@...> wrote:

> --- In cybalist@yahoogroups.com, "Richard Wordingham" <richard@> wrote:

> They claim that the results are similar to the more well know study by
> Kruskal, Dyen and Black.
>
> "We can mention that clusters we found with cluster analysis are very
> close to the
> language families established in linguistics (Kruskal, Dyen, and Black
> 1971)."

The clusters they found were Slavic, Germanic, Romance and Indic.
They say nothing about the clusters they didn't find. They should
have been disappointed by the failure to cluster Arabic and Maltese.
That puts the failure to pick up Indo-Iranian into context. Note also
that their performance with Celtic (Welsh and Irish) is not good.

> Tamil and Kannada are clustering with Hindi and Sanskrit! (rather
> than Persian)

Take another look! Look at the trees, not just at the orders in which
the languages are listed in the results. The Dravidian languages are
in the 'others' group. With the exception of the anomalous behaviour
of Telugu under the insertion/deletion/substitution metric, the
Dravidian languages, Tamil, Kannada, Telugu and Malayalam, form a
subcluster within the 'others' group.

Incidentally, I'm not sure that it is reasonable to say that Sanskrit
was included in their list. Their Sanskrit list has several Hindi
forms in it, which results in the similarity of Hindi and Sanskrit
being overstated.

Did you notice that English comes out as a North Germanic language?

Richard.