Yong Peng wrote: "Programming languages such as Perl and Ruby have strong parsing and text handling capabilities that may lead one to think it is a simple task. However, with my understanding of Pali inflections, I appreciate its simplicity that most verbs and nouns follow simple rules to express usages (case, tense, number, person), but there is a long list of exceptions. Further, identifying grammatical gender (of nouns) and conjugational group (of verbs) is not an easy task. As for Sandhi, it is even more challenging, given the fact that any two arbitrary words can be joined. I find this an interesting idea, and please do let us know when you make progress."

Thank you for all the info everyone gave and sorry for the delay.

1. I am aiming to generate language rather than parse language (at first) which is a lot easier than parsing.

2. I am aiming to store words and pieces of words in Ruby objects rather than as raw strings. For each type of object (phoneme, morph, word) there are a set of methods given by the rules of phonology and grammar. Some aspects such as Sandhi that require complicated rules and exceptions should be table driven, driven from a table of rules.

3. First goal is online dictionary with a lot more coverage than Buddhata's with a scrolling display that shows a word next to its alphabetrical neighbors:

http://www.dicts.info/dictionary.php?k1=1&k2=442

4. How to extend existing dictionaries: Starting with the entries of Buddhadata's Pali-English dictionary, generate possible forms (decline nouns, conjugate verbs, etc) and then find and check off the generated words in the "list of all Pali words found in the Tipitaka" file.

5. Hope to use the programming as a way of learning about aspects of Pali that I have been too afraid to look under the hood and investigate because of complicated rules (like i have ignored internal sandhi, phonology, and derivation of basic forms from roots). Stepping through the rules with a computer program, allows one an opportunity to grasp the complex nature of the rules and how they generate language.

6. Roderick Bucknell's Sanskrit manual reduces generation of Sanskrit to operations on tables, seems like a good approach to emulate, but is a little short on sandhi (is there an exhaustive list or study of Pali Sandhi somewhere?).

7. Search simplification: when searching against the Tipitaka corpus, give root of word, generate forms (e.g. decline noun), then do search with all forms.

Thanks,

Jon Fernquest