Re[2] Computational Historical Linguistics

I'm not an expert in computational linguistics, but I am in computational economics, and I think there are some useful parallels.

Up until the 1960's, the science of economics had amassed a huge volume of folklore. Most of this was developed on the basis of anecdotal evidence or by extrapolation from existing trends. Often by laborious processes involving piles of newspaper and slide rules. The problem is that its easy, even for an expert, to "spot" spurious trends in random data.

In about 1960, the statisticians got going, with newly invented computational techniques. They found that many supposed sure fire ways of making profits were not in fact the money machines they were previously made out to be. On the face of it, this rigorous approach had discredited much of what was previously thought to be gospel. Of course, there was plenty of fair criticism of the statisticians. Their computer models were, initially, unsophisticated, and missed out a whole raft of extra information that an investment professional would have at their fingertips.

Now, 40 years later, there is something of a meeting of the minds. The computer models are clever enough to take into account most of what economists thought was important. The fact that rigorous statistical analysis of large data sets gives the last word for testing hypotheses, is now something nobody disputes.

So what has this to do with linguistics? There are similarities between some linguistics texts (for example, Beekes) and 1950s economics literature. Numerous patterns are observed, and these are then organised into laws by finding ways of dismissing exceptions (eg invoking supposed borrowings, or sub-laws followed by sub-sub laws to cover the exceptions). Hypotheses are often based on anecdotal observations, and we have no way of carrying out an exhaustive search for exceptions that might disprove the hypothesis. Some remote language connections (eg Nostratic hypothesis) seem more speculative, and there are genuine differences between experts on whether apparent similarities are due to chance or not.

All of this seems ripe for a more structured statistical analysis. I assume nobody is fluent in all living Indo-European languages, let alone in the many languages where we only have fragments. That is why human discovery of linguistic laws has been a bit hit and miss. But the data exists, and such data rich problems are ideal for a more statistical approach. A computer might detect laws of which we are not yet aware, and may instead reveal that some questionable relationships are probably due to chance.

It seems to me that the UPenn groups are falling into some of the traps which fooled early computational economists. For example, they seem overly confident in simple models. But over time, models will get better. The UPenn restriction to trees is an obvious first step, but there is no inherent computational restriction against investigating more complex structures. I expect that 30 years from now, linguistic questions will be routinely addressed by computer-aided statistical tests, just as economic questions are today. This is likely to be less error prone than today's anecdotal approach. The plausibility of hypotheses such as the Nostratic hypothesis should then be amenable to statistical quantification.

Andrew Smith

-----Original Message-----
From: Piotr Gasiorowski <gpiotr@...>
To: cybalist@egroups.com <cybalist@egroups.com>
Date: 04 April 2000 17:57
Subject: Re: [cybalist] Re: Computational Historical Linguistics

----- Original Message -----
From: Gregory L. Eyink <eyink@...>

To: <cybalist@egroups.com>

Sent: Monday, April 03, 2000 11:28 AM

Subject: [cybalist] Re: Computational Historical Linguistics

Dear Greg,

I absolutely agree with your criticism. It seems to me the UPenn group apply the methods of computational taxonomy mechanically and blindly, largely ignoring the crucial differences between biological and linguistic evolution.

Most obviously, areal effects, bilingualism and creolisation play a more serious role in historical linguistics than horizontal gene transfer, symbiosis and hybridisation do in evolutionary biology, especially if one takes into account the different rates of change involved. The tree model has some validity in linguistics, but the tree is certainly "fuzzier" than in biology and more difficult to reduce to a bifurcating structure. If we were dealing with infinite "Lebensraum" in which languages may spread forever, each drifting away from its sister after the split, there would be no further interaction. But Eurasia is not spacious enough, and there HAS been a lot of secondary interaction. Unlike organisms, even distantly related languages may freely influence each other; there is no reproductive barrier to prevent them. If one wants to conduct a cladistic analysis of a group of languages, some PRINCIPLED way of predicting and recognising the effects of language contact should be proposed. It may be the case that phonological and morphological features (being less borrowable) should be trusted more than lexical "characters", but one has to bear in mind that many similarities in these domains are coincidental, typological or due to parallel development rather than common origin.

In biology, it's more or less clear what counts as a genomically determined character and what character states may be assumed to represent synapomorphies. In linguistics, it would seem, anything goes. Unique and trivial sound changes and morphological traits are lumped together as "characters". I share your feeling that proposing a retrograde process of de-satemisation for Germanic (with the simultaneous re-labialisation of former labiovelars!) is absurd. No mechanism of linguistic change that I'm familiar with would work that way. It would be like a modern species of bird growing clawed fingers on its forelimbs, a long tail and a set of teeth. Of course synapomorphic features may be secondarily lost, like hairs in the Cetacea, but it's a NEW SYNAPOMORPHY of that clade, not a return to proto-synapsid hairlessness (which was, anatomically, something very different from cetacean hairlessness). The Romance languages have developed conditioned fricative and/or affricate reflexes of Latin velars and labiovelars, but French is not a satem language though it has [sa~] for *k@...óm.

The "findings" of the project are hardly spectacular. They are partly trivial ("confirming" what everybody knows), partly too obviously fallacious (such as the "Graeco-Armenian" grouping, resulting no doubt from a traditionalist bias in the choice of characters). As for Germanic, the bottom line is that they simply have no idea how it fits anywhere. Inflated claims such as quoted in your posting raise serious doubts as to the researchers' ability to be self-critical and don't invite an informed reader to take their model seriously. Which is a pity, in a way, since as you rightly point out the method, if applied with more caution, could be of some legitimate use in analysing genetic relationship.

Piotr

Greg wrote [responding to Mark]:

> [...] I understand that their algorithm produces rootless trees. What I am
> really questioning is the value and meaning of such a model, HOWEVER
> it is produced. If an evolutionary biologist has a theory of bird
> origins which postulates that the maniraptorian clade divided into
> aves, dromaeosaurs, troodontids, therizinosaurs,and oviraptors, then
> he may or may not be correct, but I know exactly what he means.
> According to the papers at the Upenn website they propose that the
> tree model should apply to linguistic families which are
> geographically spreading, so that separating members should no longer
> interact. Fine. However, when the UPenn tree shows that
> Germano-Balto-Slavic and Indo-Iranian divided from a common ancestor,
> what does this really mean? If I take their model literally, then one
> must assume that either (i) Indo-Iranian and Balto-Slavic
> independently underwent a satemic shift or (ii) the satem shift
> occurred in the common proto-language and Germanic underwent a
> retrograde centum shift! (Very unlikely phonologically.) It is not
> just an issue with the placement of Germanic either. For example,
> Armenian also has satemic features. The proper conclusion really seems
> to be that the tree model is just not a valid representation of
> linguistic facts for the IE family. I still don't see the value
> of using fancy mathematics to "correctly" produce a bad model!
> This linguistic theory produces contradictions whose explanation
> requires going outside the tree model itself, e.g. invoking areal
> change, or wave theory.
>
> >
> > Essentially, all they are saying is that OE (which is NOT a satem
> language) is nonetheless best placed inside the group which did
> undergo satemization. The literature I've read says there are
> incompletely explained peculiarities in Germanic which largely
> disappear if you posit a strong genetic (but pre-satemic) relationship
> with the B-S and I-I branches.
> >
>
>
> The UPenn group certainly emphasizes the anomalous position of
> Germanic in construction of the optimal tree. For example, they have
> found that if they removed Germanic from the tree construction, then
> they obtained a "perfect phylogeny", i.e. a tree for which all
> linguistic characters employed were compatible. But does this mean
> that this "perfect" tree should be regarded as having established
> validity of the tree concept, and representing an historic fact?
> There are many aspects of the "perfect" tree that are still very
> controversial. For example, it supports the existence of a
> Greco-Armenian proto-language. Yet very credible arguments have been
> presented against such an hypothesis, e.g. by James Clackson in his
> 1994 Oxford monograph, "The Linguistic Relationship Between Armenian
> and Greek." There is a lot hidden in the UPenn results in terms
> of the linguistic characters employed and the values assigned to
> compatibilities.
>
>
> > Much of what they are doing seems to be 'tinkering'. They are
> attempting to find those linguistic features which can accurately
> predict known relationships, and then apply the same methodology to
> unknown relationships. One can only wish them success.
>
> I wish any honest scientist well. I certainly did not intend my
> remarks to be "nasty", just an honest criticism. However, my negative
> remarks are a reaction against what I see as the UPenn's group
> tendency to "oversell" their method, or to "intimidate" with
> sophisticated mathematics. An example: In their IRCS report they claim
> that one of the important results of their methodology is "the ability
> to detect and handle loanwords that are not distinguishable from
> cognates by traditional methods." This sounds really wonderful, right?
> Isn't it great that they can feed well-known linguistic data into
> their miraculous mathematical machine and get such striking
> conclusions as the output? However, an examination of their work shows
> otherwise. The above claim is based upon their difficulty in fitting
> Germanic into the tree. They found that using linguistic characters
> based upon phonology and morphology gave the tree in with Germanic
> was
> grouped with Indo-Iranian and Balto-Slavic. However, if they used
> characters based upon vocabulary, then Germanic was best grouped with
> Italo-Celtic. To explain the discrepancy, they THEORIZED a non-tree
> effect: that Germanic at an early stage had borrowed much of the
> distinctive common Western vocabulary (e.g. Goth. `fisks', Lat.
> `piscis', OIr. `iasc') from Italo-Celtic. This is not an automatic
> output of their mathematical apparatus, but an independent speculation
> on their part. It is also not the only possible explanation
> (e.g. the items in question could be independent borrowings
> from a western-European pre-IE substrate.) What the example really
> shows, again, is that the tree model breaks down. If the underlying
> linguistic theory of separate development were correct, then it
> wouldn't matter which set of linguistic characters were employed
> (phonological-morphological vs. vocabulary) and the same tree would
> result. The fact that it doesn't just means that the tree model is
> insufficient.
>
> I can see some value in using the UPenn algorithms as a way of testing
> the limits of validity of the tree model. I see a lot of their
> conclusions as being not so different from what traditional linguistic
> methods have produced using the same data, but perhaps better
> quantified. For example, it could be useful to have "compatibility
> scores" for different possible trees, or to see that different trees
> result from different linguistic characters. This would all be
> valuable if used correctly. However, this is not the UPenn attitude.
> They take it as a CRITICISM of lexicostatistics that the
> "best-informed mathematical linguist who attempted such work makes
> notably modest and reserved claims for the method." Instead, they make
> very arrogant and overblown claims for theirs. They claim to "resolve
> longstanding open problems" such as the Indo-Hittite and Italo-Celtic
> hypotheses. They boast that their method "has been able to construct
> a robust evolutionary tree of the IE languages" whereas "traditional
> methods failed." This is just not honest.

Tired of missing calls while online? Now you can surf the Internet without worrying about missing important calls! CallWave's FREE Internet Answering Machine lets you hear who's calling while online. http://click.egroups.com/1/2322/0/_/2431/_/954867461/