I am very interested in your work here Steven.

My Master's Thesis was to create an Old Norse Morphological Parser.
Specifically, I created a two-step generative Morph. Parser for Old Norse.
I'll explain this directly. Sorry if this is old hat for the linguists on
the list. I am a better Computer scientist than ON Scholar or linguist...

The idea of a Morph Parser is to mimic the process that a student uses in
getting through a text of the language. The student encounters words in the
text that he or she is unfamiliar with and needs to look them up in the
dictionary. These words are known as SURFACE FORMS. The problem is that in
Indo-European languages like ON the occurring SURFACE FORM will be some
inflected form of the entry in the dictionary which is called the HEAD FORM.
(It will have case or verb endings for instance and may have stem mutation.)
The student has to learn the inflection rules of the language in order to
make a good guess about the HEAD FORM from the SURFACE FORM in order to find
it in the dictionary.

SURFACE FORM ????----> HEAD FORM

My program takes a SURFACE FORM and returns a set of possible HEAD FORMS. I
used Gordon's as the source for the inflection rules for my work.

In a two-step generative Morph. Parser there is an extra step. The
collection of possible HEAD FORMS is filtered by checking it against a
machine readable dictionary called a LEXICON. HEAD FORMS that don't occur
in the language are rejected reducing the number of false returns.

The original work on this was done by a Finnish Computer Scientist
Kimmonnenn (?) and there was a DOS shareware implementation called PC-KIMMO
that you could use to create Morph Parsers for whatever language you wanted.
You had to write the rules for the inflections and create the lexicon.

Using this method it is a trade off on how accurate you want it to be and
some other factors. Clearly, we could create an electronic corpus of every
occurring ON text and generate a Lexicon that had every surface form that
occurs in the literature. Then there would be no need to try to encode the
inflection rules, one would simply look up the surface forms... (Given the
fact that electronic storage is now very cheap, this is not necessarily a
bad approach...)

As some on the list pointed out, (and I am sure they are far better
linguists than I am), Gordon's is not the definitive description of ON
morphology. Any description of same however is simply an observation of how
the language works and will require exceptions for irregular words etc.

The focus of my interest in this was to create tools that would allow me to
generate very rich hyperlinked Norse Texts along the model of the one
created by the Aussie professor that was noted in the previous msgs on this
topic.

(BTW, I only have one year of College ON and that was taken after I had
completed the MS Thesis, so the work which is proof of concept, definitely
has its limitations... GRIN.)

What I think would be useful for students of Norse would be to complete the
project of building an electronic dictionary in text format for the language
and creating a corpus of normalized e-texts.

The current e-dictionaries are not text but consist of scanned images of the
pages of Zoega. For my thesis, I OCRd the entire Zoega but the accuracy was
only about 85% requiring that they be painstakingly corrected by manual
editing. I got the first 160 pages done in MS Word format (about a year's
work...) and then wrote a VB program that used Office Automation to open the
word files and parse the individual definition entries so they could be
stored in an Access table to create the lexicon.

I found that it simplified the writing of the parser rules to create a
shortcut lookup table that simply looked up very short words without trying
to parse them:

at, á, in, inn, of, etc...

I believe that Midnott Sol intends to create a text version of Zoega, but I
am not clear as to how organized this is and whether or not there is any
timeline for its completion... I certainly appreciate that they have made
the dictionary available in the image format.

There are some issues with standardization of the language. I choose to use
the normalized ON version that I thought was most likely for a student to
encounter and which I believe is to be found in the definitive scholarly
collection of the Sagas, the Islensk Fornrit (I'm sure I spelt that
wrong...) The texts that one finds on the net range from Modern Icelandic
versions to completely un-normalized versions like the Bugge.

There is also (at least) one orthographical issue. That is the hooked-o
character. Some normalized texts use this character and others use the
umlaut-o to represent it. I think Zoega uses the later and Gordon the
former...

Anyway, a useful goal for us would be the creation/assembling of the corpus
and the text version of the dictionary. The later could be used for many
things besides the creation of rich ON hypertexts.

Not to be disrespectful, but why do the Icelanders convert the ON texts into
Modern Icelandic? I don't understand that...

Regards,

Tom Wulf



-----Original Message-----
From: Steven T. Hatton [mailto:hattons@...]
Sent: Tuesday, August 07, 2001 8:02 AM
To: norse_course@yahoogroups.com
Subject: Re: [norse_course] Database Project

On Tuesday 07 August 2001 06:47, you wrote:
> <html><body>
> <tt>
> Hello Steven!<BR>
> Your method of learning Old Norse is obviously to create<BR>
> software that mimics the morphology. Well, that certainly seems to be <BR>
> as valid a way of learning as any other. It also indicates a certain<BR>
> paradigm shift: "Understanding something means that you can write<BR>
> software that mimics it."<BR>

Keth,

Actually all it seem to mean is I can make a computer understand ON. {;-)>
Just because I transcribe the content of Gordon into some kind of computer
code doesn't mean *I* understand it. I will say that my learning style is
different from most people's. What I want is a learning tool that I and
others can use.

Imagine being able to read the Hávamál online and click on a word to get
links to definitions, explanations, cultural information, images, etc. Take

just the example of "sitja á fleti fyrir" from the first verse. This cannot

be correctly understood until the reader has a mental image of 10th century
Scandinavian houses. If one could click on some links and end up with a
description and picture representing "fleti" it would be very helpful, as
well as fun.

I haven't actually used this very much, but I do find it interesting:
http://www.engl.virginia.edu/OE/OEA/
Unfortunately I've had lots of problems using it with Linux, and I try to
avoid Microsoft products as much as I can.

> >That should give you some idea where I'm going. One place that looks
> like it <BR> >will be some work to figure out is the adverbs. 
<BR>
> <BR>
> But the adverbs aren't declined, are they?<BR>
> So they would be the easiest words to deal with.<BR>

Gordon does provide some categories and such for adverbs. I'm just not sure

exactly which divisions I should incorporate into the structure of my
outline. I have only glanced at that section. I'm still in the nouns where

I started.

> >I believe they can be broken <BR>
> >down in a way similar to the nouns.  I'm simply going through
> Gordon's <BR> >*Accidence* chapter and trying to map things out until I
> have a place for all <BR> >the words.  I'm sure things will come
to
> mind as I go along.  I would very <BR> >much like to be able to
> link to and from a dictionary, but that is way down <BR> >the
> road.  I would also like to add some descriptive text similar to that
> <BR> >found in Gordon.  I can't take too much directly from his
> book lest I commit <BR> >plagiarism. <BR>
> <BR>
> I have compared many Old Icelandic grammar books.<BR>
> Gordon's is among the shorter ones. It is very<BR>
> well done, and gives a very good overview, but<BR>
> compared to the others it is a bit on the short side.<BR>
> <BR>
> Merely from the observation of its brevity, I think <BR>
> one can draw certain conclusions about its adequacy,<BR>
> and that is that you will discover sooner or later that<BR>
> it is not complete. (you will find there is a lot<BR>
> of grammar problems that you meet in practical work, that<BR>
> it fails to answer)<BR>

My feeling is that Gordon's grammar section is very condensed. There's a
lot
of information, but a person who is not already skilled in learning
languages
will find it difficult to use. What I hope to do is expand it in such a way

that a person can get a better feel for it's structure. I also want to
create a complete list of ON words and their gramatical roles and
morphology.
I once wrote a program which ran through all the verses in the Eddas
creating a list of all the different words. Believe it or not, the list is
finite!

> The most complete one I have seen is the one by Adolf Noreen.<BR>
> (Swedish linguist), But it is of course much more difficult<BR>
> than Gordon's book. (btw the "grammar" in Gordon is only<BR>
> a small section of his book, which is primarily an ON reader,<BR>
> but with a rather well written reference section for the grammar)<BR>

It's hard to find good ON resources in the States. Gordon is all I've found

so far.

> I think it is definitely a book that is very well suited for<BR>
> course work, where the instructor gives weekly (or daily?)<BR>
> assignments, and gives hints about how to use the reference<BR>
> section for solving the assigned problems. (for example<BR>
> weekly hand in problems would be a good way to run such a course)<BR>
> But it is not meant as a book for self study. (unless you<BR>
> are already well versed in grammar studies from other contexts)<BR>

I fully agree. I spent a lot of time looking for definitions and
alternative
explanations for the concepts in Gordon.

My biggest problem is limiting the scope of my interests. I have a vision
of
a database system that would hold gramatical information and a complet
vocabulary for all old Germanic languages (a sane person would never even
mention the idea of including all living Germanic languages!) Gordon's work

is obviously based on a model of Proto-Germanic with which I am not
familiar.
It would be interesting and helpful to be able to click on a word in an
electronic text and pull up an interactive dictionary which would provide
the
capability of exploring the connections the word has in other Germanic
languages.

> Here is a simpler project:<BR>
> Create an Old Norse spelling checker.<BR>
> <BR>
> Funny thing:<BR>
> I tried to look at the data file that is used by one of my<BR>
> English spell checkers. BUT: I did *not* find the words<BR>
> I had entered in the file.

There's is probably a different file stored in your home directory with the
list you added.

> Apparently the spell checkers use<BR>
> a kind of algorithmic/numeric/binary tree-model for storing<BR>
> the data about what words are "valid" words in the language.<BR>
> Does any one know what the algorithm is?<BR>
> <BR>
> Best regards<BR>
> Keth<BR>

There are some open source spell checkers such as ispell. If you are brave
enough, and have an infinite amount of time to spend on the project, you may

want to crack that open and look at the code. I suspect there are some
highly sophisticated programs which deal with the structure of languages. I

certainly don't have time to explore the subject (unless I get paid for it),

but products such as babelfish and this http://www.translate.ru/eng ,
probably represent a well developed discipline. My biggest problem is that
the scope these projects tend to grow exponentially and quickly. I end up
putting a lot of effort into something which doesn't amount to much in the
end. Be careful.


Steven


Sumir hafa kvæði...
...aðrir spakmæli.

- Keth

Homepage: http://www.hi.is/~haukurth/norse/

To unsubscribe from this group, send an email to:
norse_course-unsubscribe@egroups.com



Your use of Yahoo! Groups is subject to http://docs.yahoo.com/info/terms/