[cybalist] Re[2] Computational Historical Linguistics

Andrew D. Smith writes:

I'm not an expert in computational linguistics, but I am in
computational economics, and I think there are some useful parallels.

Up until the 1960's, the science of economics had amassed a huge volume
of folklore. Most of this was developed on the basis of anecdotal
evidence or by extrapolation from existing trends. Often by laborious
processes involving piles of newspaper and slide rules. The problem is
that its easy, even for an expert, to "spot" spurious trends in random
data.

In about 1960, the statisticians got going, with newly invented
computational techniques. They found that many supposed sure fire ways
of making profits were not in fact the money machines they were
previously made out to be. On the face of it, this rigorous approach had
discredited much of what was previously thought to be gospel. Of course,
there was plenty of fair criticism of the statisticians. Their computer
models were, initially, unsophisticated, and missed out a whole raft of
extra information that an investment professional would have at their
fingertips.

Now, 40 years later, there is something of a meeting of the minds. The
computer models are clever enough to take into account most of what
economists thought was important. The fact that rigorous statistical
analysis of large data sets gives the last word for testing hypotheses,
is now something nobody disputes.

So what has this to do with linguistics? There are similarities between
some linguistics texts (for example, Beekes) and 1950s economics
literature. Numerous patterns are observed, and these are then organised
into laws by finding ways of dismissing exceptions (eg invoking supposed
borrowings, or sub-laws followed by sub-sub laws to cover the
exceptions). Hypotheses are often based on anecdotal observations, and
we have no way of carrying out an exhaustive search for exceptions that
might disprove the hypothesis. Some remote language connections (eg
Nostratic hypothesis) seem more speculative, and there are genuine
differences between experts on whether apparent similarities are due to
chance or not.

All of this seems ripe for a more structured statistical analysis. I
assume nobody is fluent in all living Indo-European languages, let alone
in the many languages where we only have fragments. That is why human
discovery of linguistic laws has been a bit hit and miss. But the data
exists, and such data rich problems are ideal for a more statistical
approach. A computer might detect laws of which we are not yet aware,
and may instead reveal that some questionable relationships are probably
due to chance.

It seems to me that the UPenn groups are falling into some of the traps
which fooled early computational economists. For example, they seem
overly confident in simple models. But over time, models will get
better. The UPenn restriction to trees is an obvious first step, but
there is no inherent computational restriction against investigating
more complex structures. I expect that 30 years from now, linguistic
questions will be routinely addressed by computer-aided statistical
tests, just as economic questions are today. This is likely to be less
error prone than today's anecdotal approach. The plausibility of
hypotheses such as the Nostratic hypothesis should then be amenable to
statistical quantification.
-----------------------------

Gerry: Your economics background has certainly helped me understand the
structuring of Beekes text, the future of Nostratic, the merits of U
Penn's work and has allowed me to place the past 30 years in vivid
perspective. Thank you.

Gerry
--

Gerald Reinhart
Independent Scholar
(650) 321-7378
waluk@...
http://www.alekseevmanuscript.com