Nicholas Bodley <nbodley at speakeasy dot net> wrote:

> Also of interest:
>> The e-mail software in question is reading UTF-8 code sequences as if
>> they were 1252 characters, then writing each of those "characters"
>> back in UTF-8.
> So *that* is what was happening.
> That code space, from ~129 to 159 decimal, was originally to be used
> for more control characters; there's a relatively- obscure ISO
> standard for them.

Yes, but that's not the point. *None* of the code points starting with
128 are represented the same in UTF-8 as in Latin-1 or in CP1252. Any
such characters will be mangled by the process I described.

For example -- desperately trying to get back onto the subject of
writing systems -- here is what happens when you take a well-known
Spanish exclamation associated with bullfighters, and mangle it by
reading UTF-8 bytes and writing them out as CP1252 characters:


Notice that the "Ol" in the middle and the exclamation point at the end
remain intact, while the inverted exclamation point and the e-with-acute
turn into progressively worse bit hash. (If you are not reading this
message in UTF-8 to begin with, then even the first line will be
damaged!) Notice also that *no* ISO 6429 control characters were
present in the original line, or even in the second line, so the result
up to this point would be the same for either Latin-1 or CP1252.

-Doug Ewell
Fullerton, California