Re: Genetic Tree for Language Matching

From: x99lynx@...
Message: 13726
Date: 2002-05-16

Way, way back, I wrote:
<<These are cladistic tree (at least they was generated by cladistic
phylogenic software).>>

"Michal Milewski" <milewski@...> replied (Thu, May 2, 2002 6:57 pm):
<<Sorry, but what you posted is in no way a classical cladistic tree. When you
draw a cladistic tree it cannot be based on arbitraly selected similarities
between analysed groups, but exclusively on the position of their common
ancestors.>> 

I did not get a chance to reply to Michal when this note was sent, but I
thought I would just drop this quick reply in case he is reading or anyone
else is interested.

First, I just want to note that the tree I posted (under this topic name last
month) is from one of the articles (Underhill's) that Michal cited and that
he sent to me.

That tree or rather network makes "Central Asia" an off-shoot rather than a
main branch, as a directional and quantitative vector tabulated from the
genetic data and groupings in the Underhill research, which I think is one of
Michal's problem with it, given his understanding of the data. Otherwise I
don't think the tree looks very unconventional and it seems to mainly match
up with what many of us might have guessed the general direction of the
spread Out-of-Africa might have been, given what climatic conditions might
have been at any point circa 97,000-37,000BC (ice age and all that). I
personally don't think the Underhill study is much help with historical
linguistics, because I think genes and language are a bit of a false mix and
one that has been misused time and time again, based on what is always the
"latest" genetic science.

The diagram I posted is described as a "Maximum likelihood network inferred
from the haplotype frequencies reported in Table 1."

Michal says this is not a cladistic tree. I'm just going to point out that a
"maximal likelihood" tree or network is one form of cladistic tree. The tree
Michal described in his post is one form based on the criteria of
"parsimony", the other cladistic criteria is "maximun likelihood." Technic
ally, both trees are cladistic. See. e,g.,
http://www.med.nyu.edu/rcr/rcr/course/phylo-cladistic.html.

Maximum likelihood trees maximises the statistical likelihood that the
specified evolutionary model produced the observed character-state data; the
models specify the probabilities of character-state changes through
evolutionary time. A parsimonous tree is the optimal tree in requiring the
fewest evolutionary character-state changes implied by the observed
distribution of states across taxa.

Underhill's tree, which used software I think normally used in complex
DNA-sequence-phylogenics, essentially calculated the most probable ancestral
sequence for one "region" at a time based on the haplotypes in those regions.
The optimized "maximum likelihood" tree portrays the overall "distances"
between the regions in terms of most probable immediate ancestry at each
node. The tree did not have a high capture of the individual variances (I
think 18%), but the nature of the phylogeny-building analysis means that the
resulting tree was the "best match" for the data, within the evolutionary
sequences describe in the original Y-Chromosome tree.

Michal wrote:
<<No classical cladistic tree can be formed, if the groups you try to
correlate include members that are more closely related (phylogeneticaly) to
some members of other groups than to themselves, as is obviously the case
when you try to compare selected geographical regions (these are groups)
containing different Y chromosome haplotypes (these are members).... what is
the "defining mutation" that defines the branch called "Europe" on fig.2 of
Underhill? And what about the branches called "Mideast" and "America"? And
also, what is the "defining mutation" for the whole branch that includes
subbranches "Mideast", "Morocco", "Basque" and "Europe"?>>

The way the software works I believe is that assumes the sequences in the
original tree and then applies them to the "regional" clades based on the
distribution in those clades. So that each "member" = haplogroup (defining
mutation) + region. Given the total optimized "tree" of the various
haplotypes in a region, that region finds a place in a sequence of regions.
This happens in a backwards direction until a single most probable ancestral
sequence is calculated at the root. The probability was checked against a
number of "bootstraps" (alternate) networks for probability of sequencing.
Once again this is not an exact match of the data, it is the best match
available.

One reason that the Central Asian region did so poorly as a "origin" region
was I think that most of the value of many of its haplotypes were reduced or
eliminated by the high number of samples from that region. And that meant
that more defining haplotypes predominated in other regions, extending
Central Asia's phylogenic "distance" from the root region, Africa, to a
sub-branch of a sub-branch of a sub-branch.

Steve