--- In qalam@yahoogroups.com, "Nicholas Bodley" <nbodley@...> wrote:
> On Fri, 25 Mar 2005 11:02:56 -0500, Richard Wordingham
> <richard.wordingham@...> wrote:
>
> > I've been doing some experiments (results at
> > http://groups.yahoo.com/group/JRW_test/messages ). The conclusion
is
> > that for general character sets, the only general purpose workable
way
> > from a browser window is for both sender and receiver to manually
select
> > UTF-7. Unfortunately, this is not available from Internet
Explorer 6.0,
> > at least not on Windows XP. (It is from Firefox, but not everyone
may
> > use the browser they prefer.)

> I'm willing to select UTF-7 for some Qalam messages, if they don't
cause
> horrid messes with ASCII folk. As well, I'd be glad to send a few
> characters, using any U+ code points (not sure whether 5-digit hex
> works...)

> IE is considered by many experts to be a serious security risk; I
> recommend not using it if you can possibly avoid doing so. I like
Firefox;
> it has a particularly "clean", humane, notably-"common-sense"
design...
> Make that "uncommon sense". (Please upgrade FF to 1.0.1, btw!) Also
has a
> big bunch of extensions.

Our firm's IT support doesn't seem to be in any hurry to move to
Firefox, though. It would probably break too many Intranet applications.

> Summary results *on this computer*, reading the test messages, by
message
> no.:
>
> I shouldn't have expected Opera's automatic encoding selector to
work on
> these, and it didn't. However, selecting a matching UTF encoding
(View -->
> Encoding ... ) worked nicely.
>
> 1. Some chars. rendered fine; I saw some peculiar pairs, as if there
were
> trouble with utf-8. (Screen shots to JRW on request)

The fault should be affecting the 2nd to 5th characters in the 2nd
line of every block of 4 rows. That's how I identified the hex byte
values of 91 to 94 as those with problems. Various types of quote are
being inserted in their place, thus resulting in invalid byte sequence
for UTF-8. I therefore see question mark followed by some sort of
quote. The question marks aren't produced when I read the messages as
e-mail.

Quick summary of UTF-8:
1) ASCII characters are unchanged.
2) Otherwise, each byte big-endianly codes for up to 6 bits; the high
bits are 10 for continuation bytes. The first byte of an n-byte
encoding has the high n bits set to 1 and the next bit set to 0. The
remaining bits are the high bits of the codepoint value. Thus an
n-byte sequence (n>1) could encode values up to but excluding
2**(6*(n-1) + 7 - n) = 2**(5*n + 1)
3) Only the shortest possible sequence of bytes is a legal UTF-8
encoding for a codepoint. The members of 'surrogate pairs' are not
codepoints (unlike the Standard Compression Scheme for Unicode, where
they do seem to be legal).

Thus UTF-8 collides with C1 control codes.
> 2. Probably identical with 1.; I didn't check carefully, char. by char.

That was what I say, thus exonerating the browsers.

> 3. First 8 rows rendered fine. Second 8 were "no-such-glyph-avail."
> symbols; I don't know why. Perhaps ArialUni limitations? I doubt that.

It's possible - it's the boundary between Latin Extended-A and Latin
Extended-B. I got my indices wrong when I generated the test
characters. I've now done a full 16 by 16 array, and no new problems
have shown up. You can probably check with the 'Character Map' utility.

It's getting difficult to do mass checks now, as it's getting
difficult to make a glyph undisplayable. Notepad searches for an
accommodating font, and Word 2002 won't change a character's font to a
font where the character doesn't exist. I have to look very carefully
to check that a character is supported by a font. Notepad seems to be
a better font finder than Internet Explorer, just as it's a better
glyph renderer. Probably memory overflow somewhere - could even be in
the font!

> > Sending e-mails in UTF-7 is a good way of making Internet Explorer
users
> > feel excluded. All they can get is mujibake!

On the other hand, Outlook [Express] users can read UTF-7!

Richard.