Tom,

Wow, that was a lot to think about! If I understand correctly, you would
like to see a collective effort directed toward completing the proof of your
OCR of Zoëga's dictionary. Also you seem to be suggesting that we create a
collection of all SURFACE FORMS appearing in the existing ON texts. These
SURFACE FORMS could then be linked to the appropriate entries in the
electronic version of Zoëga's dictionary. You mentioned that Midnott Sol are
working on a similar task. I've noticed they are looking for help in
completing. I see no reason not to combine efforts. I suggest we contact
Midnott Sol and try to coordinate efforts.

If I'm not understanding what you want from us neophytes, please let me know.
As far as normalizing the ON, I have to admit, I'm not sure what is involved.
These are the two places I go for the ON originals:

http://www.forn-sed.org/n-text/index.htm
http://www.snerpa.is/net/fornrit.htm

Are these normalized?

What do we do next? Who will contact Midnott Sol? How can we access the
Zoëga files you have? How do we keep things coordinated?

Steven

On Tuesday 07 August 2001 12:54, you wrote:
> <html><body>
> <tt>
> I am very interested in your work here Steven.<BR>
> <BR>
> My Master's Thesis was to create an Old Norse Morphological Parser.<BR>
> Specifically, I created a two-step generative Morph. Parser for Old
> Norse.<BR> I'll explain this directly.  Sorry if this is old hat for
> the linguists on<BR> the list.  I am a better Computer scientist than
> ON Scholar or linguist...<BR> <BR>
> The idea of a Morph Parser is to mimic the process that a student uses
> in<BR> getting through a text of the language.  The student encounters
> words in the<BR> text that he or she is unfamiliar with and needs to look
> them up in the<BR> dictionary.  These words are known as SURFACE
> FORMS.  The problem is that in<BR> Indo-European languages like ON the
> occurring SURFACE FORM will be some<BR> inflected form of the entry in the
> dictionary which is called the HEAD FORM.<BR> (It will have case or verb
> endings for instance and may have stem mutation.)<BR> The student has to
> learn the inflection rules of the language in order to<BR> make a good
> guess about the HEAD FORM from the SURFACE FORM in order to find<BR> it in
> the dictionary.<BR>
> <BR>
> SURFACE FORM        
> ????---->       HEAD FORM<BR> <BR>
> My program takes a SURFACE FORM and returns a set of possible HEAD
> FORMS.  I<BR> used Gordon's as the source for the inflection rules for
> my work.<BR> <BR>
> In a two-step generative Morph. Parser there is an extra step. 
> The<BR> collection of possible HEAD FORMS is filtered by checking it
> against a<BR> machine readable dictionary called a LEXICON.  HEAD
> FORMS that don't occur<BR> in the language are rejected reducing the number
> of false returns.<BR> <BR>
> The original work on this was done by a Finnish Computer Scientist<BR>
> Kimmonnenn (?) and there was a DOS shareware implementation called
> PC-KIMMO<BR> that you could use to create Morph Parsers for whatever
> language you wanted.<BR> You had to write the rules for the inflections and
> create the lexicon.<BR>   <BR>
> Using this method it is a trade off on how accurate you want it to be
> and<BR> some other factors.  Clearly, we could create an electronic
> corpus of every<BR> occurring ON text and generate a Lexicon that had every
> surface form that<BR> occurs in the literature.  Then there would be
> no need to try to encode the<BR> inflection rules, one would simply look up
> the surface forms...  (Given the<BR> fact that electronic storage is
> now very cheap, this is not necessarily a<BR> bad approach...)<BR>
> <BR>
> As some on the list pointed out, (and I am sure they are far better<BR>
> linguists than I am), Gordon's is not the definitive description of ON<BR>
> morphology.  Any description of same however is simply an observation
> of how<BR> the language works and will require exceptions for irregular
> words etc.<BR> <BR>
> The focus of my interest in this was to create tools that would allow me
> to<BR> generate very rich hyperlinked Norse Texts along the model of the
> one<BR> created by the Aussie professor that was noted in the previous msgs
> on this<BR> topic.  <BR>
> <BR>
> (BTW,  I only have one year of College ON and that was taken after I
> had<BR> completed the MS Thesis, so the work which is proof of concept,
> definitely<BR> has its limitations...  GRIN.)<BR>
> <BR>
> What I think would be useful for students of Norse would be to complete
> the<BR> project of building an electronic dictionary in text format for the
> language<BR> and creating a corpus of normalized e-texts.  <BR>
> <BR>
> The current e-dictionaries are not text but consist of scanned images of
> the<BR> pages of Zoega.  For my thesis, I OCRd the entire Zoega but
> the accuracy was<BR> only about 85% requiring that they be painstakingly
> corrected by manual<BR> editing.  I got the first 160 pages done in MS
> Word format (about a year's<BR> work...) and then wrote a VB program that
> used Office Automation to open the<BR> word files and parse the individual
> definition entries so they could be<BR> stored in an Access table to create
> the lexicon.<BR>
> <BR>
> I found that it simplified the writing of the parser rules to create a<BR>
> shortcut lookup table that simply looked up very short words without
> trying<BR> to parse them:<BR>
> <BR>
> at, á, in, inn, of, etc...<BR>
> <BR>
> I believe that Midnott Sol intends to create a text version of Zoega, but
> I<BR> am not clear as to how organized this is and whether or not there is
> any<BR> timeline for its completion...  I certainly appreciate that
> they have made<BR> the dictionary available in the image format.<BR>
> <BR>
> There are some issues with standardization of the language.  I choose
> to use<BR> the normalized ON version that I thought was most likely for a
> student to<BR> encounter and which I believe is to be found in the
> definitive scholarly<BR> collection of the Sagas, the Islensk Fornrit (I'm
> sure I spelt that<BR> wrong...)  The texts that one finds on the net
> range from Modern Icelandic<BR> versions to completely un-normalized
> versions like the Bugge.<BR>
> <BR>
> There is also (at least) one orthographical issue.  That is the
> hooked-o<BR> character.  Some normalized texts use this character and
> others use the<BR> umlaut-o to represent it.  I think Zoega uses the
> later and Gordon the<BR> former...<BR>
> <BR>
> Anyway, a useful goal for us would be the creation/assembling of the
> corpus<BR> and the text version of the dictionary.  The later could be
> used for many<BR> things besides the creation of rich ON hypertexts.<BR>
> <BR>
> Not to be disrespectful, but why do the Icelanders convert the ON texts
> into<BR> Modern Icelandic?  I don't understand that...<BR>
> <BR>
> Regards,<BR>
> <BR>
> Tom Wulf  <BR>
> <BR>
> <BR>
> <BR>