Alphabet Soup: The Internationalization of Linux, Part 2

Mr. Turnbull takes a look at the problems faced with different character sets and the need for standardization.
Character Sets

It would be nice if we could think of character sets as corresponding directly to the scripts we use to write by hand, but unfortunately things are not so simple. For example, there are nearly 200 countries in the world, each with its own currency. Of course, many share the same symbol, but it is clear that if your keyboard had a key for every currency symbol, it would be about twice as big as the one you use today. Other useful symbols are the paragraph and sectioning marks used by lawyers and the various operators and non-Latin symbols used by mathematicians. Since new characters are being created all the time (for example, the symbol for the new European monetary unit, the euro), it is impossible to include them all. So in fact, a character set is someone's idea of a useful set of characters.

Representation as bit strings inside a computer imposes further constraints. Since modern computers all work in terms of bytes as the smallest efficient unit of access, there is a big difference in the space and processing requirements for text based on a 256-character set, which can be encoded in a single byte, and a 257-character set, which cannot. One might think that the extension to two bytes, or 65536 characters, would be enough to satisfy anyone, but it turns out that even this is not enough. The process of selecting about 20,000 ideographic characters of Chinese origin occasioned many arguments while the Unicode character set was being designed. Even those 20,000 may not be enough; while there are only a few people in the world who care about some of the excluded ideographic characters, to them it may be the most important character in the world, as it is the one they use to write their name.

The result is that many character sets have been designed and populated, and standards have been written to codify their use.


The most influential standard of all is the American Standard Code for Information Interchange, abbreviated ASCII. This is a list of the 128 7-bit bit strings, with an assignment of each one to either a character commonly used in American English or a control function. Many of the control functions are not used today, but so much software has been written on the assumption that hex values 0x00 to 0x1F are not printing characters that no one considers assigning a few more characters to some of those code points.

Because nearly all existing computer languages are compatible with the ASCII character set, ASCII in some form is a subset of most electronic character sets. However, there are many variants. For example, the JIS Roman character set used in most Japanese computers is almost identical to ASCII, except that a couple of the glyphs are changed and the Japanese yen symbol is substituted for the backslash. In order to codify this development, ASCII-like character sets are defined by the International Standards Organization (ISO) in standard ISO 646. U.S. ASCII is designated the international reference version for ISO 646 and is occasionally referred to as ISO 646-IRV (for example, in naming fonts for the X Window System).

The ISO 8859 Family of Character Sets

ASCII is simply not sufficient for use in an internationalized environment. For example, most European languages use accented characters. Certainly, it is possible to represent “Latin small letter a with acute accent” (á) as a two character ligature (e.g., 'a), but this is inconvenient for sorting and possibly ambiguous. Furthermore, it is not obvious how to represent the caron using only ASCII characters. In order to maintain compatibility with ASCII for the sake of existing software, and accommodate many of the countries most intensively using computers, the ISO 8859 standard was designed. ISO 8859 had three main goals: maintain ASCII compatibility, implement within the constraints of ISO 2022 and provide the broadest coverage of languages within a single-octet encoding. Unfortunately, these three goals are not compatible. Several important scripts which can be encoded in a single octet require several dozen code points each for their characters because they do not overlap with ASCII or each other: Greek, Russian, Hebrew and Arabic.

The solution arrived at was not to define a single character set, but rather a family of character sets. Each ISO 8859 character set contains ASCII (ISO-646-IRV) as a subset, and the encoding is defined so that, interpreted as integers (C chars), the ASCII characters are encoded identically in ASCII and ISO 8859. Then a list of supplementary character sets, each including at most 96 characters to conform to ISO 2022 (described below), was defined. These supplementary characters are then assigned to the code points 0xA0 to 0xFF. Where the supplementary set is derived from an alphabet, the natural collating order is followed, but for the collections of accented characters the order is necessarily arbitrary. The current supplementary character sets are listed in Table 1.

Table 1.