Re: Norse Software: Morphological Parsing and What the List Could Do

Actually, I am only suggesting that we get the Text version of the ON
dictionary completed and that we consider assembling a corpus of Old
Norse texts.

I already have the software tools to generate the Surface form lists
once we have the Corpus. Also there are other concordance and text
analysis tools available that are far more sophisticated than my
simple program. Kwic Keyword in Context tools are but one example.

As I mentioned to Keth, I can already generate the Word Lists he
proposes once we have the Dictionary in text format and I've imported
it into a database.

I'd want to use normalized texts myself but the ideal corpus would
have every text in every available format for any conceivable use.

The parser was designed as a component of a toolset for bulding rich
versions of ON texts for students and more casual readers. It
complicates things quite a bit if you try to allow for variations in
spelling and inflection (non-normalized).

I've already forwarded all my files to Midnott Sol and thanks to Keth
now know of the existence of the zoega project email list, so I'll
follow up on that.

Regards,

Tom

--- In norse_course@..., "Steven T. Hatton" <hattons@...> wrote:
> Tom,
>
> Wow, that was a lot to think about! If I understand correctly, you
would
> like to see a collective effort directed toward completing the
proof of your
> OCR of Zoëga's dictionary. Also you seem to be suggesting that we
create a
> collection of all SURFACE FORMS appearing in the existing ON
texts. These
> SURFACE FORMS could then be linked to the appropriate entries in
the
> electronic version of Zoëga's dictionary. You mentioned that
Midnott Sol are
> working on a similar task. I've noticed they are looking for help
in
> completing. I see no reason not to combine efforts. I suggest we
contact
> Midnott Sol and try to coordinate efforts.
>
> If I'm not understanding what you want from us neophytes, please
let me know.
> As far as normalizing the ON, I have to admit, I'm not sure what is
involved.
> These are the two places I go for the ON originals:
>
> http://www.forn-sed.org/n-text/index.htm
> http://www.snerpa.is/net/fornrit.htm
>
> Are these normalized?
>
> What do we do next? Who will contact Midnott Sol? How can we
access the
> Zoëga files you have? How do we keep things coordinated?
>
> Steven
>
> On Tuesday 07 August 2001 12:54, you wrote:
> > <html><body>
> > <tt>
> > I am very interested in your work here Steven. 
> > 
> > My Master's Thesis was to create an Old Norse Morphological
Parser. 
> > Specifically, I created a two-step generative Morph. Parser for
Old
> > Norse. I'll explain this directly. Sorry if this is old
hat for
> > the linguists on the list. I am a better Computer
scientist than
> > ON Scholar or linguist... 
> > The idea of a Morph Parser is to mimic the process that a student
uses
> > in getting through a text of the language. The student
encounters
> > words in the text that he or she is unfamiliar with and needs
to look
> > them up in the dictionary. These words are known as
SURFACE
> > FORMS. The problem is that in Indo-European languages
like ON the
> > occurring SURFACE FORM will be some inflected form of the
entry in the
> > dictionary which is called the HEAD FORM. (It will have case
or verb
> > endings for instance and may have stem mutation.) The student
has to
> > learn the inflection rules of the language in order to make a
good
> > guess about the HEAD FORM from the SURFACE FORM in order to
find it in
> > the dictionary. 
> > 
> > SURFACE FORM
> > ????----> HEAD FORM 
 
> > My program takes a SURFACE FORM and returns a set of possible HEAD
> > FORMS. I used Gordon's as the source for the inflection
rules for
> > my work. 
> > In a two-step generative Morph. Parser there is an extra
step.
> > The collection of possible HEAD FORMS is filtered by checking
it
> > against a machine readable dictionary called a LEXICON.
HEAD
> > FORMS that don't occur in the language are rejected reducing
the number
> > of false returns. 
> > The original work on this was done by a Finnish Computer
Scientist 
> > Kimmonnenn (?) and there was a DOS shareware implementation called
> > PC-KIMMO that you could use to create Morph Parsers for
whatever
> > language you wanted. You had to write the rules for the
inflections and
> > create the lexicon. 
> > Using this method it is a trade off on how accurate you want it
to be
> > and some other factors. Clearly, we could create an
electronic
> > corpus of every occurring ON text and generate a Lexicon that
had every
> > surface form that occurs in the literature. Then there
would be
> > no need to try to encode the inflection rules, one would
simply look up
> > the surface forms... (Given the fact that electronic
storage is
> > now very cheap, this is not necessarily a bad approach...) 
> > 
> > As some on the list pointed out, (and I am sure they are far
better 
> > linguists than I am), Gordon's is not the definitive description
of ON 
> > morphology. Any description of same however is simply an
observation
> > of how the language works and will require exceptions for
irregular
> > words etc. 
> > The focus of my interest in this was to create tools that would
allow me
> > to generate very rich hyperlinked Norse Texts along the model
of the
> > one created by the Aussie professor that was noted in the
previous msgs
> > on this topic. 
> > 
> > (BTW, I only have one year of College ON and that was taken
after I
> > had completed the MS Thesis, so the work which is proof of
concept,
> > definitely has its limitations... GRIN.) 
> > 
> > What I think would be useful for students of Norse would be to
complete
> > the project of building an electronic dictionary in text
format for the
> > language and creating a corpus of normalized e-texts.
 
> > 
> > The current e-dictionaries are not text but consist of scanned
images of
> > the pages of Zoega. For my thesis, I OCRd the entire
Zoega but
> > the accuracy was only about 85% requiring that they be
painstakingly
> > corrected by manual editing. I got the first 160 pages
done in MS
> > Word format (about a year's work...) and then wrote a VB
program that
> > used Office Automation to open the word files and parse the
individual
> > definition entries so they could be stored in an Access table
to create
> > the lexicon. 
> > 
> > I found that it simplified the writing of the parser rules to
create a 
> > shortcut lookup table that simply looked up very short words
without
> > trying to parse them: 
> > 
> > at, á, in, inn, of, etc... 
> > 
> > I believe that Midnott Sol intends to create a text version of
Zoega, but
> > I am not clear as to how organized this is and whether or not
there is
> > any timeline for its completion... I certainly
appreciate that
> > they have made the dictionary available in the image
format. 
> > 
> > There are some issues with standardization of the language.
I choose
> > to use the normalized ON version that I thought was most
likely for a
> > student to encounter and which I believe is to be found in the
> > definitive scholarly collection of the Sagas, the Islensk
Fornrit (I'm
> > sure I spelt that wrong...) The texts that one finds on
the net
> > range from Modern Icelandic versions to completely un-
normalized
> > versions like the Bugge. 
> > 
> > There is also (at least) one orthographical issue. That is
the
> > hooked-o character. Some normalized texts use this
character and
> > others use the umlaut-o to represent it. I think Zoega
uses the
> > later and Gordon the former... 
> > 
> > Anyway, a useful goal for us would be the creation/assembling of
the
> > corpus and the text version of the dictionary. The
later could be
> > used for many things besides the creation of rich ON
hypertexts. 
> > 
> > Not to be disrespectful, but why do the Icelanders convert the ON
texts
> > into Modern Icelandic? I don't understand that... 
> > 
> > Regards, 
> > 
> > Tom Wulf 
> > 
> > 
> >