Re: [tied] Re: Automatic clustering of languages

From: Rick McCallister
Message: 48270
Date: 2007-04-05

Aren't Malay and Indonesian the same language?
Malaysians and Indonesians tell me they are.

--- Richard Wordingham <richard@...>
wrote:

> --- In cybalist@yahoogroups.com, "Daniel J. Milton"
> <dmilt1896@...> wrote:
>
> > OK, but accepting the words just as they find
> them, I still don't
> > see how they got their results.
> > The first cluster in table B is
> Maori-Persian-Finnish.
>
> > That's only half the list, but the other eight
> don't look any better.
> > Would someone explain why their computer finds
> this a cluster under
> > its insertion-deletion rules?
>
> Ward's method! At least, I can't find any evidence
> the method's been
> misprogrammed, but there are some weird effects.
>
> Most of us have been looking at the quality of the
> data and of the
> comparisons. We've overlooked how the entries are
> grouped.
>
> The clustering algorithms work by successively
> associating pairs of
> clusters, starting with clusters of a single
> element. When two
> clusters have been associated, the 'distance'of the
> combined cluster
> from the other clusters has to be calculated, and
> there are many ways
> of doing this.
>
> The chosen clustering method, Ward's method, has a
> strange effect. As
> nearby points are collected into a single cluster,
> the distance of the
> cluster from more distant points increases. One can
> think of it as
> 'warping' the distances. It's rather like the
> illustrations of
> general relativity in which masses increases
> distances in their
> vicinity compared to Newtonian distances.
>
> For example, removing Indonesian from the languages
> while still using
> Ward's method suddenly allows Malay and Maori to be
> associated!
> Austronesian re-established! It has the knock on
> effect of removing
> the clustering of Germanic and Romance.
>
> Using Ward's method, Maori, Persian and Finnish were
> all isolated
> until, with most of the lingusitically valid
> clusters formed, they
> were some of the few languages with the original
> distances, which
> Ward's method's warping of distances suddenly made
> seem relatively close.
>
> I've looked at a number of other incremental
> clustering methods, just
> using the insertion/deletion metric. The two that
> gave the best
> results were the 'average link' - distance of two
> clusters is the
> average pairwise distance between elements of a
> cluster - and
> 'McQuitty's method' - when two clusters are merged,
> the distance from
> the clutter is the average of the distance from the
> two constitutents.
> 'Average link' reduces the il-effects of an
> aberrant family memember,
> while McQuitty's method is little affected by adding
> a dialect to the
> data.
>
> With 'average link', some further families appear -
> Austronesian
> (Malay, Indonesian and Maori), Indo-Iranian (Persian
> plus the Indic
> languages), and possibly Balto-Slavonic. The latter
> is not very
> clear, for it manages to include Hungarian! At the
> next level
> Balto-Slavonic unites with Romance, but that may
> have no significance.
>
> With McQuitty's method, matters are even better.
> Hungarian drops out
> of ?Balto-Slavonic - to be replaced by Greek!
> (Hungarian has cluster
> with Slavonic - Greek clusters with Baltic.) We see
> the following
> other families not visible with Ward's method:
>
> Celto-Italic, decomposing to Celtic plus Romance.
> Indo-Iranian, decomposing to Persian plus Indic.
> Austronesian
> Turanian! (= Hungarian, Turkish & Japanese)
>
> Germanic comes out oddly - the highest level split
> is Bavarian versus
> the rest!
>
> The 'diameter method' was not too bad. In this
> method, the distance
> between two clusters is the maximum distance between
> any pair of
> elements, though interlopers appear in some of the
> groups, and
> Celto-Italic vanishes - Celtic becomes an
> independent cluster.
> ?Balto-Slavonic acquires Albanian, clustering with
> Baltic.
> Indo-Iranians looks suspect, for Swahili joins the
> Iranian branch!
> Germanic still has the Bavarian v. non-Bavarian as
> its highest level
> split.
>
> I don't know whether all this comes under the
> heading of having a poor
> mathematic model of the system. It does suggest
> that the paper's
> authors had a poor grasp of what different
> clustering methods would do
> for them.
>
> Richard.
>
>




____________________________________________________________________________________
TV dinner still cooling?
Check out "Tonight's Picks" on Yahoo! TV.
http://tv.yahoo.com/