Alphabet Soup: The Internationalization of Linux, Part 1

Mr. Turnbull takes a look at the problems faced when different character sets and the need for standardization.
User Interfaces

The next layer is user interfaces, such as the Linux console and X. Here, the strong preference is for a primitive form of multilingualization, allowing arbitrary fonts to be displayed and text to be input in arbitrary encodings via configurable mappings of the keyboard. Both the Linux console and X provide these features, although the Linux console does not directly support languages with characters that cannot be encoded in one byte. They need not have more sophisticated mechanisms, because users rarely deal directly with them; application developers will build user-friendly interfaces on top of these toolkits. On the other hand, they should be as general as possible, so that the localizations can be as flexible as possible.


The next layer is applications, including system utilities. Here, things become much more complicated. Not only is it desirable that they issue messages and accept input in the user's native language, but must also handle non-trivial text manipulations like sorting. Of course, their entire purpose may be text manipulation, e.g., the text editor Emacs or the text formatter TeX.

One example of this complexity is that even where languages have characters in common, the sorting order is typically different. For example, Spanish and English share most of the Roman alphabet and both can be encoded in the ISO-8859-1 encoding. However, in English the names Canada, China and Czech Republic sort in that order, but in Spanish they sort as Canada, Czech Republic and China, because Spanish treats “ch” as a single entity, sorting after “c” but before “d”. Although Chinese, Japanese, Korean and to some extent Vietnamese share the ideographic characters that originated in China, they have very different ideas about how those characters are sorted.

Inter-system Communication

The outermost layer, beside the user-to-system interface, is inter-system communication. This layer has all the problems already mentioned, plus one more. Within a single system, specifying how to handle each language can be done implicitly; when a language is recognized, the appropriate version of some subsystem handles it. However, when communicating with another system, a mechanism for specifying formats must be present. Here, the MIME (Multipurpose Internet Multimedia Extensions) formats are crucial. Where possible, a means to negotiate the appropriate format for the communication should be provided as in HTTP, the hypertext transport protocol which is the foundation of the World Wide Web.

Components of Localization

Localizing an application means enabling it to display, receive input and modify text in the preferred language of the user. Since this is usually the user's native language, we will also write native language support (NLS) for localization.

Text Display

The most basic capability is text display. Merely discussing text display requires three concepts: character set, encoding and font. A language's character set is those characters used to form words, phrases and sentences. A character is a semantic unit of the language and the concept of “character” is quite abstract. Computers cannot deal directly with characters; they must be encoded as bit strings. These bit strings are usually 8 bits wide; strings of 8 bits are called octets in the standards. “Byte” is not used because it is a machine-oriented concept; octets may refer to objects transmitted over a serial line, and there is no need for the hosts at either end to have facilities for handling 8-bit bytes directly. Most Linux users are familiar with the hexadecimal numbering system and the ASCII table, so I will use a two-hex-digit representation of octets. For example, the “Latin capital letter A” will be encoded as 0x41.

Human readers do not normally have serial interfaces for electronic input of bit strings; instead, they prefer to read a visual representation. A font is an indexed set (not necessarily an array, because there may be gaps in sequences of the legal indices) of glyphs (character shapes or images) which can be displayed on a printed page or video monitor. The glyphs in a font need not be in a one-to-one correspondence with characters and do not necessarily have semantic meaning in the native language. For example, consider the word “fine”. As represented in memory, it will consist of the string of bytes “Ox66 Ox69 Ox6E Ox65”. Represented as a C array of characters, it would be “fine”, but as displayed after formatting by TeX in the PostScript Times-Roman font, it would consist of three glyphs, “fi”, “n” and “e” as shown in Figure 2.

Figure 2. Times-Roman Font Display of Word fine

Conversely, in some representations of the Spanish small letter enye (ñ), the base character and the tilde are encoded separately. This is unnecessary for Spanish if ISO-8859-1 or Unicode is used, but the facility is provided in Unicode. It is frequently useful in mathematics, where arbitrary meanings may be assigned to typographical accents. An example of a font which does not have semantic meaning in any human's native language is the standard X cursor font (see Figure 3).

Figure 3. X Cursor Font

An encoding is a mapping from each abstract character or glyph to one or more octets. For the common encodings of character sets and fonts for Western languages, only one octet is used. However, Asian languages have repertoires of thousands of ideographic characters; normally, two octets are used per character and two per glyph. Two formats are used for such large encodings. The first is the wide character format, in which each character is represented by the same number of octets. Examples are the pure Japanese JIS encoding and Unicode, which use two octets per character, and the ISO-10646 UCS-4 encoding (a planned superset of Unicode) which uses four octets per character. This encoding is the index mapping for a character set or font.

Another format is the multibyte character in which different characters may be represented by different numbers of octets. One example is the packed Extended UNIX Code for Japanese (8-bit EUC-JP), in which ASCII characters are represented in one octet which does not have the eighth bit set, and Japanese characters are represented by two octets. These octets are the same as in the plain JIS encoding, except that the eighth bits are set (in pseudo-C code, euc = jis | 0x8080). By using this encoding, any 8-bit-clean compiler designed for ASCII can be used to compile programs which use Japanese in comments and strings. This option would not be available for wide-character formats. If programs were written in pure JIS, the compiler would have to be rewritten to accept JIS ASCII characters. The ASCII character set is a subset of the JIS character set, but instead of being assigned the range 0x00 to 0x7F, the letters and digits are assigned values given by the ASCII value + 0x2300, and punctuation is scattered with no such simple translation. Another common place to encounter multibyte characters is in transformation formats, specifically the file-system-safe transformation of Unicode, UTF-8. Like EUC-JP, UTF-8 encodes the ASCII characters as single bytes in their standard positions.

Multibyte formats do not interfere with handling of text where the program does not care about the content (operations such as concatenation, file I/O and character-by-character display), but will work poorly or inefficiently where the content is important and addressing a specific character position is necessary (operations such as string comparison, the basis of sorting and searching). They may be especially useful for backward compatibility with systems designed for and implemented under the constraints of ASCII, e.g., compilers. They may be more space-efficient if the single-octet characters are relatively frequent in the text.

Wide-character formats are best where addressing specific character positions is important. They cannot be backward compatible with systems designed for single-octet encodings, although with appropriate choice of encoding, e.g., Unicode, little effort beyond recompiling with the type of characters extended to the size of wide characters may be necessary. Unfortunately, existing standards for languages like C do not specify the size of a wide character, only that it is at least one byte. However, the most recently designed languages often specify Unicode as the internal encoding of characters, and most system libraries specify a wide-character type of two bytes, which is equivalent to two octets.



Comment viewing options

Select your preferred way to display the comments and click "Save settings" to activate your changes.

Re: Alphabet Soup: The Internationalization of Linux, Part 1

Anonymous's picture