Re: UTF-8 vs. ISO-8859-1(5) ?

On Thu, 20 Feb 2003, Arlie Stephens wrote:

> On Thu, Feb 20, 2003 at 06:21:20PM -0500, Steven T. Hatton wrote:
> >
> > I've been informed that others may have difficulty reading my posts
> > because I've been using UTF-8 encoding. Is this true?
>
> Some of your posts arrive as illegible gooblety gook on my linux system.

The same is true on my Linux system.

> > I don't do Windoze so I really don't understand why people would
> > would not be using, or atleast be able to use UTF-8, and switch
> > between these encodings as required.

Windows handles Unicode and UTF-8 quite nicely. (At least, Windows 2000
and XP do. 98/ME can be a little balky.) I suspect many of the problems
that are turning up here are not from Windows systems, where Outlook and
Outlook express should (I haven't tested) be fully Unicode aware, but from
intermediate systems that are doing translations, or as is more likely,
the mangling.

> I have no idea what I would need to do to understand UTF-8. Given the number
> of people who apparantly can't read it, I wouldn't want to do anything that
> might result in my system sending out messages in that form.
>
> I'm running redhat linux 7.1, using mutt as my mail client, with North
> American settings.

Mostly because I haven't been willing to upgrade everything on my
production server system to the latest and greatest, and thus my tools
don't suppor UTF-8 properly. It's only been in the last few months that
Perl and many other commonly used tools really got good UTF-8 handling,
for instance. I have no idea when or if the Linux console will, and when
or if my favorite Telnet applications will.

Also, I've heard a number of problems from folks on "pure" UTF-8 systems
like the new Red Hat apparently is, that converting back and forth is a
terrible pain, because the system doesn't udnerstand that all your
existing data isn't UTF-8 already.

> > I'm able to read everything on the list except for my own posts when
> >they are quoted back to me by a person not using UTF-8. For me, the
> >idea of not using UTF-8 is dangerous. If I stick to UTF-8, I don't
> >need to make special accomodations when I write code such as XML
> >processing programs. I have lost weeks of valuable time fighting
> >character encoding problems.

I've lost weeks of valuable time doing the same thing. Unlike you, I tend
to stay with the MOST restrictive. Only upon reading this group did I
decide that anything besides plain-vanilla ASCII was worth struggling
with. Now that the ISO-8859-1/15 is more commonplace, that's been a
manageable struggle.

Sadly, the world we live in is usually too complicated to simply say,
"Well, this is the only <x> I'll use." If that were the case, I wouldn't
still have to manhandle carriage returns back and forth from Windows to
Unix to Mac, would I? =)

> Why is UTF-8 better than whatever I'm getting by default? (One of the
> ISO standards; I'm having trouble remembering the number right now,
> but it's probably the ISO-8859-1 that you mention below.)

UTF-8 allows more languages, and requires less horrible translations of
stuff from one character set to another. In the long run it is
technically superior. At the moment, I suspect there's too few people set
up to handle it properly to make it quite ready for widespread adoption.
As more software picks up support, it will become easier to use
transparently.

UTF-8 is more complex from a software point of view; in the ISO
translations, each character takes one byte (8-bits) of space. This means
each character set can only have 255 characters in it. UTF-8 gets around
this by having ways to use more than one byte for a character. It only
does that when it needs to, so most letters are still one byte long.

Unless your software understands how to read it, though, it'll probably
see the extended characters as two (or three, or four, even) useless
pieces of junk. Most software uses full Unicode in memory, and stores two
bytes (16-bits) per character, and only uses UTF-8 for trasporting data.

> > Are there other people who feel as strongly about this issue as I
> >do? Is it the consensus of the participants on this list that we
> >should use ISO-8859-1(5)? This will be difficult for me to
> >accomplish, but If I must, so be it.

Personally, I'm not strongly bothered by it. It just means that I see
your messages, and go, "Ack, garbage." and skip them unread. I suspect a
lot of folks will do that. (At least I knew /why/ they were garbage,
which I suspect many didn't.)

My mail client noted the character set and got on with things, and that's
pretty much what I did too. Because most of the characters map exactly
(all the normal letters) it's not a big deal if I do want to read the bulk
of what you have to say. However, this group is particularly likely to
include a lot of special characters, and at that point things get
unreadable quickly.

If you'd rather be understood, then you'll try and find a way to post that
will match the apparent bulk of the other readers. Perhaps a web-mailer,
so you don't have to reconfigure your entire system?

--
Louis Erickson - wwonko@... - http://www.rdwarf.com/~wwonko/

When a Banker jumps out of a window, jump after him -- that's where the
money is.
-- Robespierre