Dear Yong Peng;
> Again, anyone keen in helping to complete the entire tipitaka, please
> let me know.
Yes, I would very much like to help out.
But it will have to be slowly because working
at the online part of a newspaper I publish 65 articles
a month on economics and business in Thailand so
my brain is usually resembles scrambled eggs.
For an interesting example of lemmatization, check out volume I
(nouns.pdf) on the following site at INRIA:
http://sanskrit.inria.fr/
Every entry is a little information preserving map between
the inflected Sanskrit form and the uninflected form with
the grammatical information that specifies the inflection.
Unfortunately, this doesn't exist for Pali yet.
Have you seen the following sites?
http://sanskrit.inria.fr/
http://ralyx.inria.fr/2007/Raweb/signes/uid43.html
http://ralyx.inria.fr/2007/Raweb/signes/uid39.html
Author of tools:
http://pauillac.inria.fr/~huet/
Sanskrit has all the necessary tools written already,
albeit in CAML, a functional language.
including a declension engine that adds grammatical inflections
and a lemmatiser that maps them to their basic (stem) form.
They don't appear to be open source.
Institut national de recherche en informatique et en automatique (INRIA)
(English: National Institute for Research in Computer Science and
Control) is a French national research institution focusing on computer
science, control theory and applied mathematics.
http://en.wikipedia.org/wiki/INRIA
For looking up Pali words a full dictionary
with fully inflected forms similar to this one
would be highly useful. Right now there is just
too many missing entries:
http://www.dicts.info/dictionary.php?k1=1&k2=442
With metta,
Jon Fernquest