Mathematical and computational approaches to linguistic phylogeny

<http://www.pims.math.ca/birs/birspages.php?
task=displayevent&event_id=06frg044>

May 27 - June 3, 2006
Organizers: Steve Evans (University of California, Berkeley).

Confirmed Participants

Objectives
Quantitative approaches to linguistic phylogeny have been the subject
of considerable recent activity. For example, all of the participants
were invitees to a meeting on "Phylogenetic Methods and the
Prehistory of Languages'' that was held in July 2004 at the McDonald
Institute for Archaeological Research, Cambridge, U.K. Ringe, Warnow,
Evans, Nichols and Nicholls are invitees to a meeting on a similar
theme in March 2005 at the Program for Evolutionary Dynamics at
Harvard.

Mathematically, there has been an absence of sensible stochastic
models for the evolution of lexical, phonological and morphological
linguistic characters. "Off -the-shelf'' models from biological
sequence evolution are clearly not appropriate, and new models are
required that address issues of effectively infinite state spaces,
lack of reversibility, and differing degrees of homoplasy (that is,
back-mutation or parallel evolution). Recent work by Warnow, Evans
and Ringe has begun to address this issue. Links to work by the group
of Evans, Nakhleh, Ringe, Warnow and their collaborators can be found
at their Computational Phylogenetics in Historical Linguistics web-
site:

http://www.cs.rice.edu/~nakhleh/CPHL/

Statistically, linguistic phylogeny raises problems with the quality
of data (the data has gone through considerable human pre-processing,
so there are important sampling questions to be addressed) and its
heterogeneity (the evolutionary processes for different linguistic
characters are probably quite different and so there are difficult
issues to resolve around how one can accommodate such variation
without introducing models that are too parameter-rich for adequate
inference). The Evans, Nakhleh, Ringe and Warnow group has also made
some progress in this direction.

Computationally, the number of taxa (that is, languages) involved in
most data sets of interest is sufficiently great that naive
approaches to model fitting and inference by exact maximum likelihood
or Bayesian methods is infeasible. This is even the case for non-
statistical reconstruction procedures such as maximum parsimony and
maximum compatibility. There is thus a need for clever heuristic
divide-and-conquer strategies for the optimizations inherent in
maximum likelihood, maximum compatibility and maximum likelihood, and
for appropriate Markov chain Monte Carlo (MCMC) techniques in
Bayesian analysis. Warnow has been the main developer of the family
of disk covering methods (DCMs), the most competitive divide-and-
conquer algorithms. Nicholls is a major figure in the field of MCMC
applied to Bayesian inference, particularly with respect to dating
problems.

All of the above quantitative work needs to be performed in close
collaboration with linguists who are not only familiar with the
primary data but are also sufficiently mathematically literate that
they can participate in the development of models and inferential
strategies. Moreover, there need to be several such linguists with
different perspectives -- be they on different language families (for
example, Ringe works on Indo-European languages whereas Poser mainly
studies North American languages) or on ``deep time'' relationships
between different language families (an interest of Nichols). Poser,
Ringe, Embleton and Nichols have all done major work on the
applications of statistical methodology to linguistic questions and
are extremely well-placed to play such a role. Having a group
balanced between four mathematicians/statisticians/computer
scientists and four quantitively inclined linguists is the right mix
to make serious inroads into the large number of difficult
outstanding problems in this field