Nicholas Bodley <nbodley at speakeasy dot net> wrote:

> These started as single "smart quotes", a dubious term. Looking at
> page source, I see Win-1252 encoding (implicitly); the single quotes
> were ‘ and ’ (wrong syntax?), proprietary Microsoft
> encodings. E-mail programs that don't gracefully handle proprietary
> Microsoft encodings expand them into short strings of nonsense, but
> each time the phrase is quoted in e-mail, the number of nonsense
> characters at least doubles! As I type, there are about 38 characters
> for the closing quote.

Two issues here, neither particularly related to "writing systems" (but
that's typical for Qalam nowadays).

First, the Windows code pages (first 1252, then others) have been around
for over a decade, well before Unicode gained commercial popularity.
1252 is a proprietary encoding based on Latin-1, but up until the
mid-'90s, proprietary encodings were the only way to get the directional
curly quotes, which are required for good typography (not exhibited in
this e-mail).

By using the term "smart quotes" to refer to these curly quotation
marks -- as well as to the technique by which Microsoft Word and other
editors figure out which straight quotes are "left" and which are
"right," and replace them with directional quotes as appropriate --
Microsoft seems to have created a terminological rope with which its
detractors delight in hanging them. Nicholas's "a dubious term" is mild
compared to the things some people have said about "smart quotes"; John
Walker of Fourmilab is perhaps the most infamous in his caricature of
them.

But the real problem with "smart quotes" is not the existence of the
proprietary encoding, nor the software that automatically converts
straight quotes to curly quotes; but rather e-mail clients that label
this 1252-encoded text as "Latin-1" or "ISO-8859-1." Microsoft software
used to be particularly bad at this, but my understanding is that recent
versions (available for use on any 32-bit Windows system, nut just XP)
now tag text correctly.

Second, Nicholas's example demonstrates what happens when *any* e-mail
software misinterprets the encoding of a string, over and over again.
The e-mail software in question is reading UTF-8 code sequences as if
they were 1252 characters, then writing each of those "characters" back
in UTF-8. Using SC UniPad to copy Nicholas's "nonsense" text as UTF-8
and paste it back as 1252, three or four times, I can unwind the
sequence and retrieve the original single quotes.

Importantly, this has NOTHING to do with the Microsoft CP1252
extensions. This would happen with any Latin-1 characters outside the
ASCII range. It only shows up in this case because the English text in
question doesn't use any "extended" characters except for the curly
quotes. A simple "e with acute," encoded in Latin-1 (or 1252), would
get mangled in exactly the same way, although the expansion would be
less because "e with acute" is encoded with only 2 bytes in UTF-8, while
the curly quotes use 3.

-Doug Ewell
Fullerton, California
http://users.adelphia.net/~dewell/