Alphabet Soup: The Internationalization of Linux, Part 1
Processing text is an extremely complex and diverse field of application. Currently, most aspects of localization and therefore internationalization have to be handled by each application according to its own needs. Programmers who want their applications to have the broadest possible utility should pay attention to internationalization issues, using standard techniques such as gettext wherever possible. They should avoid optimizations, such as using high bits in bytes or unused code points in an encoding for non-character information, that might conflict with extension to a new character set. Where standards have not yet evolved, internationalization demands that programmers design their own protocols for application-specific functionality that needs localization. Wherever possible, complex operations on text should be localized to a few functions that can be generalized to other languages or made conformable if a new standard is adopted.
Many areas come up which have been standardized already: numeric formatting, date formatting, monetary formatting and sorting. Yes/no answers have also been standardized, but this is superseded by GNU gettext.
Each of these functions has one or more functions provided by the POSIX standard for the C standard library. Linux's libc has not historically provided them, but they are all more or less fully provided in version 2 of the GNU C library. These functions are controlled by locale, an environmental parameter encoding various cultural aspects of text processing.
Locale is explicitly set in a program using the setlocale function. The current locale can be retrieved using the same setlocale function. Internally, each locale is divided into several parts which can be controlled separately. Users normally inform programs about their locale preferences using one or more environment variables (LANG, LC_ALL, LC_COLLATE, LC_CTYPE, LC_MONETARY, LC_NUMERIC, LC_TIME, LC_MESSAGES).
First of all, the convention for naming locales indicates the language, the regional subvariant and the encoding used. The “portable locale” is an exception; it has two names, C and POSIX. They have identical meaning. Two-letter abbreviations for language (ISO 639) and country (ISO 3166) have been standardized. So U.S. English, for example, may be specified as en_US.iso646-irv. (ISO 646-IRV is the version of the U.S. ASCII standard published by the ISO.) This has a slightly different meaning from en_US.iso8859-1, in that the latter specifies use of the ISO 8859-1 encoding, which would permit the direct inclusion of accented letters from German or French if the necessary fonts were available. The former does not. How these differences are handled will be implementation-dependent, and even within Linux the console driver handles this differently from X.
British English would be specified as en_GB.iso8859-1. (ISO 646-IRV would not be satisfactory here, as it does not include the pound currency sign.) The U.S. and British locales do not differ on things like spelling. Theoretically, ispell could take a hint from the LANG variable, but as far as I know it does not. The main differences would be currency formatting and of course the dates, i.e., the U.S. uses MM/DD/YY, while Britain uses DD/MM/YY. Furthermore, Linux provides an English language locale for Denmark (reflecting the nationality of Keld Simonsen, who coordinates the locale library for WG15 of the ISO); this locale uses the Danish kroner as currency unit and the ISO 8601 standard YY/MM/DD for dates.
Another example is the several Japanese locales. While the Japanese language is widely used only in Japan, so that all of them start with ja_JP, there are several commonly used encodings. The locale normally used on Japanese Linux systems is ja_JP.eucJP (using the EUC-JP encoding), but internationalized software running on Japanese MS Windows systems would presumably use the ja_JP.sjis locale (using Microsoft's Shift-JIS encoding for Japanese). The reason for the difference is that UNIX file systems are compatible with file names encoded in EUC-JP, while MS Windows file systems will use file names in the somewhat different Shift-JIS encoding. This could cause problems with Japanese file names in MS-DOS and VFAT file systems mounted on a Linux system, as the POSIX locale system does not allow for multiple locales to be simultaneously active. Some elderly Japanese systems which are not 8-bit clean might use the ja_JP.jis locale, with the basic seven-bit JIS encoding.
The following aspects of a system are affected by the locale. The first five are implemented in libc according to the POSIX standard. Others are implemented in X or ad hoc by application software. Those implemented by libc are controlled explicitly by the setlocale call, which normally will default to the contents of the environment variable LANG. The remainder are implemented as “advice” encoded in the LANG environment variable or other environment variables.
file system encoding
text file encoding
The display fonts are normally set by the application according to the encoding portion of the locale (after the period). Russian, Japanese and traditional Chinese all have multiple encodings. However, most fonts are provided in only one encoding, so applications must re-map other encodings internally. If this remapping is not done properly, the display will be unintelligible “mojibake”, pronounced MOH-JEE-BAH-KAY, a Japanese word literally meaning “changed characters” but more fancifully translated “monster characters”. Compare the Japanese text encoded in EUC-JP in Figure 4 with the mishmash of punctuation and funny characters produced when the kterm is explicitly told to interpret the text as Shift JIS in Figure 5. This is familiar to users of programs ported from DOS which use text windows based on the PC line-drawing characters. Since most Linux fonts are based on the ISO-8859-1 character set, you get “French windows” bordered in frilly accented characters rather than lines.
The file system encoding is also retrieved from the advice in the encoding portion of the LANG variable. Here, it is critical that the application be very defensively programmed. Carelessly accepting the advice may result in files with names which get corrupted when inserted in the file system or which cannot be accessed.
Text file encodings are similarly defaulted to the advice in the encoding portion of the LANG variable. However, in a networked environment, alien files will often be imported, e.g., using FTP, and there is no reason to suppose that these files will have the same encoding as the current locale. Applications should provide a means of specifying the encodings of files used; in locales where multiple encodings are available (today, Japanese, Russian and traditional Chinese, but soon with the popularization of Unicode and UTF-8, all locales), utilities for translating among compatible encodings should be provided.
Finally, a default input method can often be guessed from the LANG variable. In current systems, it will often be bundled with the keyboard mapping or console driver. X provides a more flexible system in which the XMODIFIERS environment variable is consulted to learn the user's preferred input method for each locale.
Next month, I will look at the large body of internationalization standards which have evolved to handle these problems.
- Brent Laster's Professional Git (Wrox)
- Own Your DNS Data
- Machine Learning Everywhere
- Smoothwall Express
- Bash Shell Script: Building a Better March Madness Bracket
- Simple Server Hardening
- From vs. to + for Microsoft and Linux
- Understanding OpenStack's Success
- Ensono M.O.
- The Weather Outside Is Frightful (Or Is It?)