[Previous entry: "Multilingualism for Cultural Diversity and Participation of All in Cyberspace"] [Next entry: "Graphite enabled Firefox and Thunderbird"]
05/09/2005: "Reflections on Vietnamese Unicode input"
Microsoft's Vietnamese keyboard layout has taken a different approach to Vietnamese keyboard layouts developed by third party developers. This has an interesting, if problematic effect on searching some sites such as Google and BBC Tiếng Việt.
Microsoft's Vietnamese layout, which they introduced in Windows 2000, uses unique characters for each discrete vowel in Vietnamese and uses combining diacritics for the five tone markers. For instance, with the Microsoft keyboard layout, the letter "ế" would be entered as [U+00EA U+0301], while the same Vietnamese letter would be entered as a single Unicode character by alternative keyboard layouts: [U+1EBF]. Alternatively, it would be possible to fully decompose the letter and represent it as a base vowel and two combining diacritics [U+0065 U+0302 U+0301], although this form is rarely used.
Although, according to the Unicode standard, these three representations should be canonically equivelent and software should treat them the same. The Unicode standard introduced the concept of normalization. Normalization is a process of converting text to a fully precomposed string, or a fully decomposed string. In the case of the letter "ế", converting it to [U+1EBF] or [U+0065 U+0302 U+0301].
This would allow processing of text independant of the format the string that was entered using, i.e. Vietnamese typed by te Microsoft keyboard or by a non-Microsoft keyboard would be normalised and processed using the same format. This is an over simplification, but hopefully sufficient to get the gist of what should happen.
Websites like Google, or the BBC, do not use normalization. This means that your choice of keyboard layout or input software will affect the results you obtain when you search.
For instance, I carried out a search on Google this morning using a phrase search: "tiếng việt". the results were:
Microsoft keyboard: about 14,000 pages
Normalization Form C (fully precomposed): about 1,510,000 pages
Normalization Form D (fully decomposed): 0
Some alternative Vietnamese input software include:
* UniKey
* WinVNKey
