Alphabet Soup: The Internationalization of Linux, Part 1

Mr. Turnbull takes a look at the problems faced when different character sets and the need for standardization.
Text Input

Text input is in some senses the inverse of text display. But, because computers are much better at displaying graphics than reading them, it presents problems of its own.

Probably the easiest method of text input would be voice. However, voice-input technology is still in its infancy, and there are times when a direct textual representation is preferable, such as for mathematics and computer programming. A highly adaptive system could be created, but typed keyboard input will be faster and more accurate for some time. Similarly, although optical character recognition (OCR) and handwriting recognition are improving rapidly, keyboard input will also remain more efficient for large bodies of text. The difference between OCR and handwriting recognition is that OCR treats static two-dimensional data, whereas handwriting recognition has the advantage of dynamics; this is particularly important in recognizing handwritten ideographic characters in Oriental languages. However, both of these would be quite close to an exact inverse of text display; the system would map from the physical inputs to the internal encoding directly, without user intervention. The fact that voice and optical technologies are not available for Linux systems makes the point moot for the moment.

For practical purposes, most Linux systems are limited to keyboard input. Several problems are related to internationalizing keyboard input. The first is that most computer keyboards are well-designed for at most one language. U.S. computer keyboards are not well-adapted to produce characters in languages that use accents. But the obvious solution, which is to add keys for the accented characters, is inefficient for those languages which have many accented characters (Scandinavian) or context-dependent forms (Arabic). It is impossible for languages which use ideographic character sets such as Chinese Hanzi or complex syllabaries like Korean Hangul. What is to be done for languages such as Greek and Russian with their own alphabetic scripts that can conveniently be mapped on a keyboard, but which cannot be used for programming computers?

The solution is the creation of input methods which translate keystrokes into encoded text. For example, in GNU Emacs with Mule, the character “Latin small letter u with umlaut” can be input on U.S. keyboards via the keystrokes "u. However, this presents the problem that the text upper cannot be directly input one keystroke per character. Furthermore, umlauts are not appropriate accents for consonants; even for vowels, languages vary as to which vowels may be accented with umlauts. So this usage of the keystroke " must be context dependent within the input stream and must be conditioned by the language environment.

In the case of ideographic Oriental languages, the process is even more complex. Of course, it is possible to simply memorize the encoding and directly input code points, e.g., in hexadecimal. It is more efficient for ISO-2022-compatible encodings to memorize the two-octet representation as a pair of ASCII characters. Although this method of input is very efficient, it takes intense effort and a lot of time to memorize a useful set of characters. Educated Japanese adults know about 10,000; Unicode has 20,902. Furthermore, if you need a rare character that you have not memorized, the dictionary lookup is very expensive, since it must be done by hand. For these languages, the most popular methods involve inputting a phonetic transcription which the input method then looks up in an internal dictionary. The function which accepts keystrokes, produces the encoded phonetic transcription and queries the dictionary is often called a front-end processor, while the dictionary lookup is often implemented as a separate server process called the back end, dictionary server or translation server.

Dictionary servers often define a complex protocol for refining searches. In Japanese, some ideographic characters have dozens of pronunciations and some syllables correspond to over 100 different characters. The input method must weed out candidates using context, in terms of characters that are juxtaposed in dictionary words and by using syntactic clues. Even so, it is not uncommon that rather sophisticated input methods will produce dozens of candidates for a given string of syllables. Japanese has many homonyms, often with syntactically identical usage; occasionally, even with the help of context, the reader must trust that the author has selected the right characters. An amusing example occurred recently in a church bulletin, where the Japanese word “megumi”, meaning “(God's) grace”, was transcribed into a pair of characters that could easily be interpreted as a suffix meaning “gang of rascals”. The grammatical usage was different from the noun “megumi”, but as it happened, it would have been acceptable in the context of that issue. Only the broader context of the church bulletin made the typographical error obvious.

Obviously, substantial user interaction is necessary. Most input methods for Japanese involve presenting the user with a menu of choices; however, the interaction goes beyond this. The input methods will give the user a means to register new words in the dictionary and often a way to specify the priority in candidate lists. Furthermore, dictionaries are pre-sorted according to common usage, but sophisticated input methods will keep track of each user's own style, presenting the candidates used most often early in the menu.

Users often have preferences among input methods even for the relatively simple case of accented characters in European languages, so each user will want to make the choice himself. Furthermore, no current input method is useful for more than two or three languages. Wnn, a dictionary server originally developed for Japanese, also handles Chinese and Korean with the same algorithms, although each language is served by a separate executable. The implication for internationalization is that protocols for communicating between applications and input methods will be very useful, so that users may select their own favorite and even change methods on the fly if the language environment changes. In X11R6, this protocol is provided by the X Input Method (XIM) standard; however, no such protocol is currently available for the console.

Although detailed discussion of the input methods themselves is beyond the scope of this article, I will describe the most common approaches to user interfaces for input methods. First of all, for non-Latin alphabetic scripts, the keyboard will simply be remapped to produce appropriate encoded characters. Both X and the Linux console provide straightforward methods for doing this. For novice users, the key-caps will need to be relabelled as well; touch-typists won't even need that.

For accented scripts, unless the number of accented characters is very small, it will not be possible to assign each one to its own key. One method of handling accents is the compose key, a special key which does not produce an encoded character itself but introduces a sequence of keystrokes which are interpreted as an accented character. Compose key methods typically need not be invoked or turned off by the user; they are simply active all the time. Since a special key is used, they do not interfere with the native language of the keyboard. The accent may be given a key of its own, but commonly some mnemonic punctuation mark is used, e.g., the apostrophe is mapped to the acute accent.

An alternative to the compose key is the dead-key method. Certain keys are called dead keys because they do not produce encoded characters; instead, they modify a contiguous character by placing an accent on it. Dead-key methods can be either prefix methods or postfix methods, depending on whether the modifier is entered before or after the base character. Obviously, these methods do interfere with input in other languages; a means of toggling them on and off is necessary.

Compose key methods are analogous to the use of a shift key to capitalize a single letter; dead-key methods are like the use of shift lock. Which is better depends on user preference and the task. Keyboard remapping, combined with either the compose key or the dead-key method, is sufficient to handle all of the ISO 8859 family of character sets.



Comment viewing options

Select your preferred way to display the comments and click "Save settings" to activate your changes.

Re: Alphabet Soup: The Internationalization of Linux, Part 1

Anonymous's picture


Geek Guide
The DevOps Toolbox

Tools and Technologies for Scale and Reliability
by Linux Journal Editor Bill Childers

Get your free copy today

Sponsored by IBM

Upcoming Webinar
8 Signs You're Beyond Cron

Scheduling Crontabs With an Enterprise Scheduler
11am CDT, April 29th
Moderated by Linux Journal Contributor Mike Diehl

Sign up now

Sponsored by Skybot