Re: Automatic clustering of languages

--- In cybalist@yahoogroups.com, "Daniel J. Milton" <dmilt1896@...> wrote:

> OK, but accepting the words just as they find them, I still don't
> see how they got their results.
> The first cluster in table B is Maori-Persian-Finnish.

> That's only half the list, but the other eight don't look any better.
> Would someone explain why their computer finds this a cluster under
> its insertion-deletion rules?

Ward's method! At least, I can't find any evidence the method's been
misprogrammed, but there are some weird effects.

Most of us have been looking at the quality of the data and of the
comparisons. We've overlooked how the entries are grouped.

The clustering algorithms work by successively associating pairs of
clusters, starting with clusters of a single element. When two
clusters have been associated, the 'distance'of the combined cluster
from the other clusters has to be calculated, and there are many ways
of doing this.

The chosen clustering method, Ward's method, has a strange effect. As
nearby points are collected into a single cluster, the distance of the
cluster from more distant points increases. One can think of it as
'warping' the distances. It's rather like the illustrations of
general relativity in which masses increases distances in their
vicinity compared to Newtonian distances.

For example, removing Indonesian from the languages while still using
Ward's method suddenly allows Malay and Maori to be associated!
Austronesian re-established! It has the knock on effect of removing
the clustering of Germanic and Romance.

Using Ward's method, Maori, Persian and Finnish were all isolated
until, with most of the lingusitically valid clusters formed, they
were some of the few languages with the original distances, which
Ward's method's warping of distances suddenly made seem relatively close.

I've looked at a number of other incremental clustering methods, just
using the insertion/deletion metric. The two that gave the best
results were the 'average link' - distance of two clusters is the
average pairwise distance between elements of a cluster - and
'McQuitty's method' - when two clusters are merged, the distance from
the clutter is the average of the distance from the two constitutents.
'Average link' reduces the il-effects of an aberrant family memember,
while McQuitty's method is little affected by adding a dialect to the
data.

With 'average link', some further families appear - Austronesian
(Malay, Indonesian and Maori), Indo-Iranian (Persian plus the Indic
languages), and possibly Balto-Slavonic. The latter is not very
clear, for it manages to include Hungarian! At the next level
Balto-Slavonic unites with Romance, but that may have no significance.

With McQuitty's method, matters are even better. Hungarian drops out
of ?Balto-Slavonic - to be replaced by Greek! (Hungarian has cluster
with Slavonic - Greek clusters with Baltic.) We see the following
other families not visible with Ward's method:

Celto-Italic, decomposing to Celtic plus Romance.
Indo-Iranian, decomposing to Persian plus Indic.
Austronesian
Turanian! (= Hungarian, Turkish & Japanese)

Germanic comes out oddly - the highest level split is Bavarian versus
the rest!

The 'diameter method' was not too bad. In this method, the distance
between two clusters is the maximum distance between any pair of
elements, though interlopers appear in some of the groups, and
Celto-Italic vanishes - Celtic becomes an independent cluster.
?Balto-Slavonic acquires Albanian, clustering with Baltic.
Indo-Iranians looks suspect, for Swahili joins the Iranian branch!
Germanic still has the Bavarian v. non-Bavarian as its highest level
split.

I don't know whether all this comes under the heading of having a poor
mathematic model of the system. It does suggest that the paper's
authors had a poor grasp of what different clustering methods would do
for them.

Richard.