Re: Automatic clustering of languages

From: Richard Wordingham
Message: 48217
Date: 2007-04-03

--- In cybalist@yahoogroups.com, "Francesco Brighenti" <frabrig@...>
wrote:

> > If you are just doing similarity comparisons on languages, there is
> > no justification for excluding loan words. The question is rather
> > whether they are now the 'typical' words for the meaning.

> Excuse my ignorance, but why should loan words, even those which have
> in course of time become typical for a given meaning, be included in
> these Swadesh-like lists created for the sake of historical
> comparison? As minimum, one should in this case consider carefully
> the time depth at which the loan took place (e.g.: was it in the
> prehistoric period? in the early historical period? three centuries
> ago? etc.).

If you are using Swadesh lists for glottochronology, then the answer
is simple - they are examples of word replacement and help date the
notional splits.

For lexicostatistics, I would again say that they reflect the nature
of the vocabulary. For example, the English words chosen to define
the 100-word list include several North Germanic and Romance words -
the following at least:

North Germanic: bark, skin, egg, give
Romance: person, mountain, round

The verb 'die' might be of native origin, or a native word revitalised
by Danish influence. The IE collection of word lists (Dyen et al.)
marks 'bird' as a loanword, I don't know why. From its history and
phonetics, 'big' looks North Germanic, but the North Germanic cognates
are lacking. In my usage, native 'belly' survives only in set phrases
- I would normally use a word of Greek origin, 'tummy', 'stomach' or
even 'abdomen'. 'Breasts' is not the normal word in most Britons'
speech - Romance(?) 'tits' is the usual (plural) word, though a lot of
substitutes are used. Pushing things further back, 'path' is a
Scythian loan, and 'long' looks like a Celtic loan.

This is a fair reflection of the fact that English vocabulary has been
heavily influenced by North Germanic and Romance. If one wants to
exclude such features for some reason, it is generally better to use
an older form of the language.

> If one includes loan words like these in Swadesh-like lists such as
> those used by "our" Slovenian researchers, will English and Arabic
> cluster close to each other after they have put the data into their
> shaker? :^)

Well, 'die' and 'egg' probably helped English come out as aberrant
North Germanic rather than aberrant West Germanic. Other contributory
factors may have been the use of 'to' in the infinitives, possibly
better matching the 'att' etc. of the Scandinavian forms, as opposed
to the zero of the Dutch and German forms. The High German consonant
shift may also have helped with this misclassification. Note that in
the schemes presented, regular correspondences get no discount - each
word pays the full cost of the sound change!

Using spelling similarity with differences assessed word by word to
identify genetic relationships is an attempt to automate naïve mass
comparison. Need one say more?

Richard.