Trask's rebuttal of the 'Celtic Found to Have Ancient Roots'

Date: Mon, 07 Jul 2003 15:26:02 +0100
From: Larry Trask <larryt@...>
Subject: Re: 14.1825, Media: NYT: Celtic Found to Have Ancient Roots

Last week Anthony Aristar drew our attention to a story in the New
York Times about a recent article in the Proceedings of the National
Academy of Sciences of the USA (Linguist 14.1825). The article
proposes a new way of drawing family trees in historical linguistics,
and it presents a (partial) tree for Indo-European, with particular
emphasis on Celtic. The authors' conclusions are surprising in two
respects. First, they conclude, contrary to the usual view, that the
Insular Celtic languages form a single taxon within Celtic. Second,
they claim to be able to estimate absolute dates for the splits in
their tree, and they propose dates that are vastly earlier than those
commonly accepted: ca. 8100 BC for the break-up of PIE, and ca. 3200
BC for the split of Insular from Continental Celtic.

Anthony notes that one of the authors is quoted as making some
insulting remarks in the Times article, implying that we linguists are
too dumb to understand his important work. Anthony asks if anyone has
had a look at the article, which is this:

Peter Forster and Alfred Toth. 2003. 'Toward a phylogenetic chronology
of ancient Gaulish, Celtic, and Indo-European'. PNAS. Available on
line: www.pnas.org/cgi/doi/10.1073/pnas.1331158100 .

Well, I've now worked though the article very carefully, and I have a
great deal to say about it. Depending on taste, you may find my
account shocking, depressing, or just funny.

1. The linguistic background

This is an attempt at drawing genetic trees using character-states. A
character is a linguistic "slot", a meaning or a function that must be
provided by some linguistic material. A state is a piece of
linguistic material filling that slot.

For example, the meaning 'man' is a character. If -- as is commonly
done -- we take cognation (common ancestry) as the basis for assigning
states, then Latin <vir> is one state, English 'man' and German <Mann>
are a second state, French <homme>, Spanish <hombre> and Italian
<uomo> are a third state, modern Greek <andras> is a fourth state,
Basque <gizon> is a fifth state, and so on.

Once determined, these character-states are used to construct trees
according to criteria selected by the investigators. One criterion
commonly used is this: each innovation should appear at only one
branching point in a tree. Another is this: the tree should be robust
-- that is, it should be little altered by a different choice of
characters.

Now, linguists have been working on methods of this sort for some
time. Most prominently, the group based at the University of
Pennsylvania has been developing such methods for quite a few years
now, and it has published a number of reports, including at least one
presenting a tree for IE. But the present authors do not cite any of
this work, and they seem to be unaware of its existence. They appear
to believe that they are the first people ever to try such an approach
in linguistics. This is not a good sign.

The authors, who are geneticists, propose an approach taken from
genetics, where they say it has been very successful, and they believe
it should be just as successful in our field. We'll see.

The authors also claim that their method does not force trees, but
that it is consistent with the presence of networked structures
(reticulations), in which branches of the tree are cross-connected.
They tell us, in fact, that their method marries the tree model, with
its rigid binary branching, with Father Schmidt's wave model.

However, it is clear that they do not understand what the wave model
is. Their account of it is badly confused, and they describe it
repeatedly as no more than an account of borrowing between distinct
speech varieties. But this misses the point altogether. Wave theory
is an alternative to binary branching, in which innovations spread out
from any number of different centers, producing a dialect continuum,
rather than a tree-like arrangement with sharply distinct varieties.
Again, this bad misunderstanding is not encouraging.

2. Why in PNAS?

Why have the authors published in PNAS? PNAS is not a journal of
linguistics, and few linguists read it. Since the only possible
readership for the article is historical linguists, why did they not
submit it to a journal of linguistics, where it would be seen?

There's more. PNAS enforces strict limitations on space. As a
result, the writing style of this article is uncomfortably terse and
compressed. In a number of cases I found myself wanting more
information, more detail -- but I didn't get it, because of the space
limitation.

Even worse, some critical information on the authors' procedure is
absent altogether from the article, and has been relegated to a
Website, which the reader must consult in order to find out why the
authors have made some important decisions. This is infuriating.
That information is central to the authors' case, and it really ought
to be in the article.

PNAS was a very bad idea.

3. Choosing the characters

The languages chosen include 13 IE languages, ancient and modern, plus
Basque, described as a "negative control".

The authors have a special interest in Celtic, including the extinct
and sparsely recorded Gaulish. They therefore expressly choose
characters which they believe are well represented in Gaulish, and
more particularly in Gaulish-Latin bilinguals. But the characters
they choose are very odd.

Their first choice is the difference between Subject-Verb word order
and Verb-Subject order. Now, I have never before seen it suggested
that SV versus VS is of any great linguistic interest, and the authors
seem to have chosen this odd item expressly because it separates the
VSO Insular Celtic languages from all the others. But it is clearly
out of order to choose eccentric characters solely in order to single
out groupings you hope to pick out (see below).

Another character is the presence or absence of the cluster /ps/,
described as present in Greek, Latin and English but absent elsewhere.
But this is absurd. The authors appear to believe that /ps/ was
present in PIE, and that it survived in these three languages but was
lost elsewhere. However, I know of no evidence for a cluster /ps/ in
PIE. The Latin and Greek instances, I think, are all acquired, not
inherited. Some instances arise by borrowing; others arise at
morpheme boundaries; and still others arise by various phonological
changes. And the native English examples all occur at morpheme
boundaries, as in 'cups', 'slaps' and 'upside-down'. This character
could hardly be more pointless.

Some of their other "characters" are no such thing. They choose
entire phrases, like 'and to men'. But this is not a character: it's
a whole cluster of lexical and grammatical characters. What happens
when two languages match in some respects but differ in others? How
can states be sensibly assigned?

All in all, the list of characters is poorly thought out. And there
are further problems with this list, as we'll see later.

4. Assigning the states

It's at this point that the authors' method falls completely to
pieces.

States are usually assigned on the basis of cognation. But the
authors reject this approach. Why? Because, they say, appealing to
cognation automatically implies a particular tree, and a tree is what
they're trying to find, so working with cognates is "circular".

But this is nonsense. Establishing cognation requires no appeal to
trees at all. In fact, we don't even try to draw any trees until we
have first established enough cognates to give us material to work
with. The authors could not be more confused than they are.

Anyway, having rejected cognation, the authors now require some other
criteria for assigning states. What criteria do they come up with?

Nothing. Nothing at all. They offer *no* criteria. Instead, they
make it up as they go along.

What they do is to appeal to an unexplained and wholly subjective
notion of "similarity". Two items are assigned to the same state if
the authors judge them to be similar, but to different states if the
authors judge them to be dissimilar. Let's see what that means in
practice.

Latin <filia> 'daughter' and its Spanish descendant <hija> are
assigned to different states, because the authors judge tham to be
dissimilar. But the Gaulish inflected form <teuo-> 'to gods' and the
Scottish Gaelic prepositional phrase <do dhiadhan> are assigned to the
same state, because the authors judge them to be similar. Why are
they similar?

Breton <forn> 'oven' is assigned to the same state as Spanish <horno>,
but to a different state from Irish <sorn>. Italian <e> 'and' is
assigned to a different state from its Spanish cognate <y>, but to the
same state as the unrelated Basque <eta>. (Spanish <y> has a
positional variant <e>, but apparently that doesn't matter.) On the
other hand, the Gaulish genitive suffix <-i> is assigned to the same
state as Greek <-ou>. So, /i/ resembles /u/ but not /e/. How do the
authors come by these remarkable insights?

Normally, an overt suffix is counted as different from zero
suffix. However, Latin feminine <-a> is assigned to the same state as
French <-e>, even though in Parisian French that orthographic <-e> is
purely decorative, and the suffix is zero.

I could go on in this vein, but you get the idea. There is no rhyme
or reason in the assignment of states, and the authors' procedure is
as capricious as it is unexplained.

At this point, the work under discussion abandons the discipline of
linguistics altogether, and in fact it ceases to be anything
recognizable as serious scholarship. Linguistics cannot be done in
terms of subjective notions of similarity. This is the kind of sludge
we see in those lurid articles claiming to have reconstructed
"Proto-World", and in those delightful Websites announcing "Latvian --
the key to all languages".

Whatever other virtues the authors' method may have, this shocking
procedure is enough to reduce their proposal to worthlessness.

An observation. Since <forn> and <sorn> are assigned to different
states, even though they differ only in their initial segments, it
appears to me that our authors must, if they want to maintain any
consistency at all, assign the Chicago and London pronunciations of
the word 'herb' to different states. After all, these have nothing in
common beyond their final /b/: Chicago has /Vrb/, with /r/ but no /h/,
while London has /hVb/, with /h/ but no /r/. Of course, this outcome
is ridiculous, but that's what happens when you make it up as you go
along.

Out of curiosity, I tried to apply the authors' method to English and
German. But I couldn't, because I had no way of knowing what should
be counted as similar. Are 'first' and <erste> similar, like <forn>
and <horno>, or dissimilar, like <forn> and <sorn>? What about 'day'
and <Tag>? These words have no segments in common at all. That makes
them more different than <forn> and <sorn>, but then recall that
<teuo-> and <do dhiadhan> are counted as similar. It appears that I
can't apply the authors' method unless I have the authors looking over
my shoulder and telling me what to count as similar. And this is
supposed to be science?

There is much more. The authors assign to the same state practically
all of the words for 'crane', including Latin <grus>, Irish <corr
mhóna>, Breton <garan>, and Basque <kurrillo> -- but *not* Welsh
<crychedd>, for some reason. Now, I have seen it seriously suggested
that at least some of these names are imitative in origin, reflecting
the bird's distinctive cry. I don't know if that's true or not, but
clearly the authors have taken no steps to exclude imitative forms --
a serious shortcoming in comparative work.

Further, the authors mysteriously assign Welsh <mam> 'mother' to a
different state from Latin <mater>, but to the same state as Basque
<ama>. However, the Welsh and Basque words are mama/papa words, and
every linguist knows that mama/papa words are useless in comparison.
But our authors don't know this, and they solemly report these items
and assign them to states, as though they were doing something
sensible.

The authors assign English 'day' and Latin <dies> to the same state,
but these words are unrelated, and they resemble each other purely by
chance. Like everybody who tries to work with similarities, the
authors are helpless to exclude chance resemblances.

So, imitative words, mama/papa words, chance resemblances -- the
authors have committed every schoolboy howler I can think of. They
badly need a course in historical linguistics.

I'm not done yet. The Basque data presented here contain a number of
errors, some of them very serious. For example, they report a
"nominative singular suffix" <-a> for Basque, and they assign this to
the same state as the nominative singular feminine <-a> of Latin and
some other languages. But Basque, with its wholly ergative morphology,
doesn't even have a category of nominative, let alone a nominative
ending, and what the authors are reporting is merely the definite
article <-a>.

What the authors report as the Basque "dative" suffix is in fact the
benefactive suffix (and even this is given wrongly). The real dative
ending, <-i>, happens to occur twice in the phrases 'to gods' and 'and
to men', but the authors fail to notice this.

Of course, Basque is only the control language here, but the errors in
the Basque data are so numerous and so serious that I have to wonder
whether similar errors might be lurking in the data for the IE
languages I don't know, like Occitan and Scottish Gaelic.

5. Drawing the tree

The authors draw their tree by hand. Their first step is to throw
away all the characters which, in their opinion, produce results
results that are too messy -- that is, insufficiently tree-like. Of
their 35 characters, they throw away seven for this reason, and those
characters are not used at all. Well, I'm exceeding my competence
here, but this doesn't look very principled to me. Is it really OK to
throw away all the data that give you results you don't like?

By the way, having jettisoned seven of their 35 characters, the
authors announce that they have 29 left. This is a trivial point, of
course, but it does nothing to intill confidence in the care and
attentiveness of the authors.

The authors go on to begin their tree with the binary characters
(those with only two states, which include the really silly ones like
SV and /ps/). Then they decompose the ternary characters into binary
segments. Any character which fails to give a sufficiently tree-like
graph is postponed for later use. In short, they do everything they
can to force binary branching and thus a conventional tree. Only at
the last is a little reticulation admitted.

Only one tree is drawn. There is no searching of tree space, and so
this is not a "best tree" method. The authors do not check their tree
for robustness, by testing it with different characters. One tree is
all we get.

The tree is rootless, but the authors insert a root, representing PIE,
at a point of their choosing.

I'll have to leave it to someone more competent to pass judgement on
this tree-drawing procedure. But it looks fishy to me.

By the way, I don't understand the function of the control
language. Having reported the Basque data (badly), and having solemnly
assigned states, the authors then forget all about the language.
Since it is reported as sharing a few character-states with some of
the other languages, why is it not included in the tree? Earlier, the
authors rejected the use of cognation on the ground that it supposedly
implies something about the tree which they are trying to draw. But
they seem to have omitted Basque from their tree for no reason apart
from an *a priori* belief that Basque shouldn't be in the tree. Is
this consistent?

6. The Celtic results

The authors report that their tree shows that the Insular Celtic
languages form a unit, separated from the rest of Celtic, represented
here by Gaulish. But this is a *big* blunder.

Apart from a single character which is unique to Gaulish and so
irrelevant, the Insular Celtic languages as a group are separated from
Gaulish by only three character-states. Let's look at these.

The first is that VS word order. But I complained earlier that this
character was introduced *ad hoc* merely to do the job of singling out
Insular Celtic. This is out of order.

But it's much worse than that. Gaulish is recorded many centuries
earlier than our first records of the Insular languages. Since
Proto-Celtic is widely thought to have had SOV order, it is highly
possible, and even likely, that VSO order had not yet developed in the
Insular languages at the time when Gaulish was recorded. (Note that
our earliest records of Irish show some evidence of SOV order.) It is
also possible that Gaulish itself would have developed VSO order if it
had survived until the time when our records of Insular Celtic begin.
Note that, while Gaulish is predominately SVO, it exhibits VSO order
in certain constructions. Very likely these constructions represent
the first steps in the process which led eventually to the
introduction of VSO order in the Insular languages, many centuries
later.

The other two character-states separating the Insular languages from
Gaulish are both ancestral case-suffixes which survived in Gaulish but
are gone in the Insular languages. But, again, note the huge time
difference between Gaulish and Insular Celtic. Very likely, those
suffixes were still present in the Insular languages, too, at the time
when Gaulish was recorded. And quite possibly they would have
disappeared in Gaulish as well, if Gaulish had survived to the time
when Irish and Welsh were recorded.

The authors' case does not stand up. They have in fact presented no
evidence at all for an Insular Celtic unity. All they have done is to
note the passage of time and the linguistic changes which go with it.
I think I am not going too far when I say that ther claim for an
Insular unity is foolish and linguistically naive.

One more point, involving that silly /ps/ character. Noting the
absence of /ps/ from Celtic, and noting that Latin /ps/ has
disappeared in the Romance languages, the authors attribute this
disappearance to a Celtic substrate! Er -- a Celtic substrate in
Tuscany? I don't think so.

Anyway, the authors, focusing on their beloved /ps/, have failed to
notice that quite a number of original and secondary Latin clusters
were eliminated in Romance, including at least one -- medial /kl/ --
wehich is present in Gaulish. So much for a Celtic substrate. Gad.

7. The dating

The authors claim to be able to assign moderately reliable absolute
dates to branching events in their tree, and they do this, producing
the astoundingly early dates given earlier.

But it is clear that they have merely re-invented glottochronology.
They claim that the rate of replacements is approximately constant --
a position known to be false. Drawing a parallel with genetics, they
insist that any fluctuations in the rate of replacement will average
out successfully over time. Well, this may be true in genetics, where
the time depths in question are millions or tens of millions of years.
But they have given us no reason to suppose that it is true in
linguistics, where the time spans are only a few thousand years.
Their dating claims are based upon unsupported assertions, and
assertions which are extremely unlikely to be true.

Anyway, recall that all this rests upon what has gone before. At
least glottochronology operates with a principled and rigorous
definition of replacement. But our authors don't: their notion of a
replacement is wholly unprincipled and capricious. They are therefore
claiming that the rate of occurrence of the events they capriciously
choose to call "replacements" is constant. Is there any point in
taking this seriously?

8. Summing up

This paper is a disaster. There is no reason for any linguist to pay
any attention to it. The procedure described is capricious,
unprincipled, arbitrary, *ad hoc* and subjective from beginning to
end.

One of the authors remarks about this work, in the Times article, "To
be honest, [linguists] don't understand it, most of them. They don't
even know what I'm talking about." Well, for my part, I don't
understand how this mess made it into print. I can't believe that a
competent referee would allow this stuff to pass. But, interestingly,
PNAS is not peer-reviewed. Hmmm. This is not the first time that
PNAS has published some extremely dubious work in historical
linguistics.

In this case, I'm afraid, our scientific colleagues have nothing to
teach us. Instead, they have a great deal to learn from us, about the
use of principled and rigorous procedures, and even, it would seem,
about collecting accurate data. And they could certainly do with a
few lessons in linguistics.

Larry Trask
COGS
University of Sussex
Brighton BN1 9QH
UK

larryt@...

---------------------------------------------------------------------------
LINGUIST List: Vol-14-1876