This question of stratifying the Canon has fascinated me for the last
six months. But I've been frustrated to learn that whatever
methodology scholars have proposed, someone else has been be able to
point to flaws in it.
For example, archaic word-endings in the verses were at one time
seized upon as evidence of antiquity. But then, as Peter and Sean
have pointed out to me, these archaisms may have been used either to
fit the meter, or because the use of the occasional archaism adds to
the literary quality of verse.
With the prose sections, as we all know there are stock phrases and
even entire stock sections throughout the nikaaya-s, pointing to
there having been a redaction of earlier material. So linguistic
analysis of the nikaaya-s as we now have them would tell us only
about the final texts as they now appear in the Canon, and not about
the original, underlying material.
As for vocabulary, a further problem is that some words
(e.g. "dhamma") change their meaning over time. Your computer program
would have to be sophisticated enough to realize that this has
happened, and not lump all texts that contain the word "dhamma" into
one cluster.
Another approach people have tried is to look at place names that
occur in the sutta-s. But there are problems with this too. For
exmaple, I think there are four sutta-s in the MN -- can't remember
which ones -- where absolutely identical sutta-s are ascribed to four
different locations.
I have Warder's "Pali Metre" where he attempts to provide a chronlogy
of the verse material based on changing fashions in meter, but my
Pali isn't yet at the point where I can understand his arguments.
So, I don't want to put you off -- I'd love to see a project like
that succeed, and to know its results -- but I think you would have
to do a lot of thinking about the soundness of your algorithms before
you actually built your program.