--- suzmccarth <suzmccarth@...> wrote:

> Thanks for that hint about Vietnamese. It was on my
> list for this
> fall. However, I took second look at all those
> accents and decided
> it would just have to wait.

The diacritics aren't too bad, and for vietnamese you
need to distinguish between diacritics that indicate a
unique vowel from those diacritics that indicate tone.

And with Google, the problem isn't so much that there
are diacritics in Vietnamese, more that different
keyboard layouts may represent the same letters using
different sequences of Unicode characters.

The key problem with Vietnamese Unicode searching on
Google is that there is no normalization. Most
Vietnamese input software use single discrete unicode
characters for each paossible combination of vowel
and tone. Relatively straight forward.

Microsoft's Vietnamese keyboard in Win2000/XP uses
precomposed characters for each vowel and combining
diacritics for each tone.

No real problem for our internal projects since we
normalize input.

But a problem using unicode tools and services where
normalization doesn't occur. Since there are web pages
out there using both input scenarios ... its necessary
to search Google using both Microsoft's keyboard
layout and a non-microsoft keyboard layout (of which
there are more web pages).

A��d into the mix a number of older character
encodings/character sets out there ... ummm ... lots
of fun ...maybe six or seven different searches of
google using the same search term?

Well, maybe later this year, when i'm playing with
federated searchng more, I might be able to


