Re: Stryzhak scans - General IT Strategy

From: Richard Wordingham
Message: 37401
Date: 2005-04-26

--- In cybalist@yahoogroups.com, "Peter P" <roskis@...> wrote:
>
> --- In cybalist@yahoogroups.com, "tgpedersen" <tgpedersen@...>
wrote:
> >
> > This is advice on the general upload situation: The thing to do
is
> > to get some OCR software, eg ABBYY
> > http://www.abbyy.com/
> > or IRIS
> > http://www.irisusa.com/
> > use it to recognize the text, then produce a .html,.txt or
whatever
> > file and upload that instead. It reduces Mb's to kb's. Also, if
it's
> > properly edited, there will no longer be any trouble interpreting
> > the contents.

> It is possible to produce a .pdf of the whole thing that would fit
in
> the 'files' area, but since Richard has already provided a solution
I
> think enough has been done.

Would it be any smaller? JPEG files are pretty well compressed to
begin with, and zipping made only a tiny difference to the size. The
only helpful trick I can think of would be to discard some of the
data, but that's tricky as scanners don't wrap around bound books and
there was some blurring at the edges.

> Text files of course are preferable, but then there is the same
> problem with 'special characters' and possible 'character
encodings'.
> Many text based programs have real problems with reading anything
> that is not ASCII or some related version of it. Do we use UTF-8
> here? No.

UTF-8 works if you e-mail it to the group, and is permitted *in
extremis*. It doesn't work if you post via the web interface - it
gets mangled by a partial Windows-1252 to ISO-8859-1 conversion to
eliminate smart quotes. (6% or more of non-ASCII character sets get
mangled, with some characters being merged.) UTF-7 works well if you
want to exclude those who use the Interent Explorer browser (UTF-7
semms to work with the Outlook e-mail programs). However, in this
case, one issue was the mixture of scripts - and I wouldn't be
surprised if Cyrillic letters not in modern Russian added to the
confusion.

At least one useful member (e.g. Peter Gray) can only read posts in
ASCII.

However, if someone did manage to convert the files to the post, the
conversion would belong in the files section.

Incidentally, note that attachments are *not* archived nowadays -
those who treat Cybalist as a web-board rather than a mailing list do
not see attachments.

Richard.