Alphabet Soup: The Internationalization of Linux, Part 1

Mr. Turnbull takes a look at the problems faced when different character sets and the need for standardization.

What is Linux? Since you are reading this in the Linux Journal, you probably already know. Still, it is worth emphasizing that Linux is an open-source software implementation of UNIX. It is created by a process of distributed development, and a primary application is interaction via networks with other, independently implemented and administered systems. In this environment, conformance to public standards is crucial. Unfortunately, internationalization is a field of information processing in which current standards and available methods are hardly satisfactory. The temptation to forfeit conformance with (international) standards in favor of accurate and efficient implementation of local standards and customs is often high.

What is internationalization? It is not simply a matter of the number of countries where Linux is installed, although that is certainly indicative of Linux's flexibility. Until recently, although their native languages varied widely, the bulk of Linux users have been fluent in certain common not-so-natural languages, such as C, sh and Perl. Their primary purpose in using Linux has been as an inexpensive, flexible and reliable platform for software development and provision of network services. Of course, most also used Linux for text processing and document dissemination in their native languages, but this was a relatively minor purpose. Strong computer skills and hacker orientation made working around the various problems acceptable.

Today, many new users are coming to Linux seeking a reliable, flexible platform for activities such as desktop publishing and content provision on the World Wide Web. Even hackers get tired of working around software deficiencies, so now a strong demand exists for software to make text processing in languages other than English simple and reliable, and permitting text to be formatted according to each user's native language and customs.

This process of adapting a system to a new culture is called localization (abbreviated L10N). Obviously, this requires provision of character encodings, display fonts and input methods for the input and display of the user's native language, but it also involves more subtle adjustments to facilities such as the default time system (12 hour or 24 hour) and calendar (are numerical dates given MM/DD/YY as in the U.S., or YY/MM/DD as in the international standard, or DD/MM/YY?), currency representation and dictionary sorting order. APIs for automatic handling of these issues have been standardized by POSIX, but many other issues, such as line-wrapping and hyphenation conventions, remain. Thus, localization is more than just providing an appropriate script for display of the language and, in fact, more than just supporting a language. American and British people both use the same language as far as computers can tell, but their currency symbols are different.

Localization is facilitated by true internationalization, but can also be accomplished by patching or porting any system ad hoc. To see the difference, consider that a Chinese person who wishes to deal with Japanese in the Microsoft Windows environment has two choices: dual booting a Japanized Windows and a Sinified Windows, or using the rather unsatisfactory and generally unsupported by applications Unicode environment. This is a localization; it is non-trivial to port applications from Japanized Windows to Sinified Windows, as the same binaries cannot be used. In an internationalized setup, one would simply need to change fonts, input methods and translate the messages; these would be implemented as loadable modules (or separate processes). With respect to applications, the situation in Linux is, at best, somewhat better (especially from the standpoint of Asian users). However, the future looks very promising, because many groups are actively promoting internationalization and developing internationalized systems for the GNU/Linux environment.

Internationalization (abbreviated I18N) is the process of adapting a system's data structures and algorithms so that localizing the system to a new culture is a matter of translating a database and does not require patching the source. Of course, we would prefer the binaries to be equally flexible, but for reasons of efficiency or backward compatibility, localized versions may implement different data structures and algorithms. Although internationalization is more difficult than localization, once it is complete, the process of localizing the internationalized software to a new environment becomes routine. Furthermore, localization by its nature is not a strong candidate for standardization, because each new system to be localized to a particular environment brings its own new problems. Internationalization, on the other hand, is by definition a standard independent of the different cultural environments. An obvious extension is to jointly standardize those facilities common to many systems.

Internationalization can be contrasted with multilingualization. Multilingualization (abbreviated M17N) is the process of adapting a system to the simultaneous use of several languages. Obviously more difficult than localization or even internationalization, multilingualization requires that the system not only deal with different languages, but also maintain different contexts for specific parts of the current data set.

Note that the operating system can be localized, internationalized or multilingualized while some or all applications are not, and vice versa. In a certain sense, Linux is a multilingual operating system; the kernel presents few hindrances to use of different languages. However, most utilities and applications are limited to English by availability of fonts and input methods, as well as their own internal structures and message databases. Even the kernel panics in English. On the other hand, GNU Emacs 20, both the FSF version and the XEmacs variant, incorporate the Mule (MUlti-Lingual Extensions Emacs) facilities (see “Polyglot Emacs” in this issue). With the availability of fonts and, where necessary, internationalized terminal emulators, Emacs can simultaneously handle most of the world's languages. Many GNU utilities use the GNU gettext function (see “Internationalizing Messages in Linux Programs” in this issue), which supports a different catalog of program messages for each language.



Comment viewing options

Select your preferred way to display the comments and click "Save settings" to activate your changes.

Re: Alphabet Soup: The Internationalization of Linux, Part 1

Anonymous's picture