Alphabet Soup: The Internationalization of Linux, Part 2

Mr. Turnbull takes a look at the problems faced with different character sets and the need for standardization.
National Standard and Private Character Sets

Besides the national standard character sets mentioned above, many others are still in common use. A few of the more important ones include Russian's KOI-8 (alternative to ISO 8859-5), ISCII for Indian languages written in the Devanagari script and VISCII for Vietnamese. Of course, U.S. ASCII is available.

Other important character sets are those defined by industry or individual firms. An important characteristic of these private character sets is that their encodings often do not conform to the ISO 2022 standard, making interchange among systems difficult. Microsoft is at the forefront, defining and registering myriads of Windows character sets. Most of these are ASCII derivatives and closely related to ISO 8859 encodings, so although the small differences are annoying to programmers, they are often insignificant to users. However, in the field of Asian languages the non-ISO-2022-compliant encodings called Shift JIS, an encoding for Japanese used by Microsoft and Apple operating systems, and Big 5, an encoding for traditional Chinese defined by a consortium of five large Taiwanese manufacturers, are important. In both cases, portions of the code space not used by international standard character sets were employed.

In the case of Shift JIS, the idea was to include the so-called half-width katakana, the 70 or so characters necessary to phonetically transcribe the Japanese language. Rarely used in normal text, they are somewhat convenient for file names in the DOS 8.3 format. This requires that they be encoded as a single octet. The Japanese standard JIS X 0201 encodes ASCII in its usual code points and places the katakana in the octets with values 0xA1 to 0xDF. Shift JIS is based on this standard and uses a simple algorithm to transform standard JIS kanji codes into two-octet codes with the first octet in the ranges 0x81 to 0x9F and 0xE0 to 0xEF, which are unused by JIS X 0201.

Character Set Extension: ISO 2022

Unicode and UCS-4 solve the problem of character representation permanently. However, as seen above, Unicode is not quite sufficient for all purposes and UCS-4 is much too wasteful for general use. Furthermore, an enormous amount of hardware and software is oriented toward one-octet character sets. Thus, the ISO 2022:1994 standard for code extension techniques (in particular, using several character sets in one data stream), the most recent edition of a standard first published as ECMA-35 by the European Computer Manufacturer's Association in 1971, remains relevant.

ISO 2022 is a rather abstract standard. A brief outline of its provisions follows.

  • Division of codes into 7-bit and 8-bit types; the 256 code points in the 8-bit table are divided into the left (L, 0x00 to 0x7F) and right (R, 0x80 to 0xFF) halves. 7-bit codes are considered to use only the left half.

  • Further division of the 128 points in each half into control (C, 0x00 to 0x1F) and graphic (G, 0x20 to 0x7F) codes.

  • Codes 0x1B (escape), 0x20 (space) and 0x7F (delete) in CL and GL are fixed. Codes 0xA0 and 0xFF in GR are often left unused.

  • Provisions are made for handling control characters similar to those for graphic characters described below, but these are uninteresting in a discussion of internationalization.

  • Graphic character sets must be encoded in a fixed number of bytes per character. Either all bytes of all characters are in the range 0x20 to 0x7F, or all bytes of all characters are in the range 0xA0 to 0xFF. A character set in which the bytes 0x20 and 0x7F or 0xA0 and 0xFF are never used is referred to as a 94n-character set, where n is the number of bytes. Otherwise, the character set is a 96n-character set.

  • An encoding may use up to four character sets simultaneously, denoted G0, G1, G2 and G3. G0 must be a 94n-character set; the other three may be 96n-character sets. The interpretation of a byte depends on the shift state. Any of G0, G1, G2 or G3 may be invoked into GL by the locking shift control codes LS0, LS1, LS2 and LS3 respectively. When the character set G0 is shifted into GL, then G0 is used to interpret bytes in the range 0x20 to 0x7F. Similarly, in 8-bit encodings, right locking shifts are used to invoke character sets G1, G2 or G3 into GR by the control codes LS1R, LS2R and LS3R. Then that character set is used to interpret bytes in the range 0xA0 to 0xFF.

  • A single character may be invoked from the G2 and G3 sets by use of the single shift control codes SS2 and SS3.

  • Escape sequences are provided for the purpose of designating new character sets into the G0, G1, G2 and G3 elements.

A given version of ISO 2022 need not provide all of the above shifting and designating facilities. ASCII, for example, provides none. To the extent that they are provided by a derivative standard, the control codes must take the values as shown in Table 2.

Table 2.

Three examples of codes which may be considered applications of the ISO 2022 standard are ASCII, ISO 8859-1 and EUC-JP. ASCII is the standard encoding for American English. It is a 7-bit code with the ASCII control codes designated to C0, the ASCII graphic characters designated to G0, and C1, G1, G2 and G3 not used. C0 is invoked in CL; G0 is invoked in GL. No shift functions are used.

ISO 8859-1 is an 8-bit code, with C0 left unspecified (but normally C0 has the ASCII control characters in it), the ASCII graphic characters are designated to G0 and the Latin-1 character set is designated to G1. C1, G2 and G3 are unused. C0 is invoked in CL and G0 is invoked in GL. No shift functions are used.

Packed-format EUC-JP is an 8-bit code, with C0 unspecified but normally using the ASCII control characters; the JIS X 0201 Roman version of ISO 646 designated to G0; the main Japanese character set JIS X 0208 containing several alphabets, punctuation, the Japanese kana syllabaries, some dingbats and about 6000 of the most common kanji (ideographs) designated to G1; the half-width katakana from JIS X 0201 designated to G2; and the JIS X 0212 set of about 8000 less common kanji designated to G3. C0 is invoked in CL, G0 is invoked in GL and G1 is invoked in GR. No locking shift functions are used. Half-width katakana and the JIS X 0212 kanji must be accessed using the single shifts SS2 and SS3 respectively, and they are shifted into GR.

Finally, ISO 2022 is commonly used in Internet mail and multilingual documents. The 7-bit version is used and every character set must be designated to G0 before use.

The single most important aspect of ISO 2022 is that code points in the range of ASCII control characters may not be used for graphic characters. This means that text files using encodings conforming to ISO 2022 will behave like text (with line breaks and not causing strange behaviour in your terminal or emulator) when displayed. If you do not have the fonts or your software does not understand the designation escape sequences, you will see gibberish, but at least your terminal will continue working.

A second useful fact is that in most cases ASCII or some version of ISO 646 is designated to G0. An encoding like EUC-JP with ASCII designated to G0 and invoked to GL and all of the other character sets invoked to GR is “file system safe” in 8-bit clean systems. This is more or less the definition of the EUC variant of ISO 2022.

Some encodings which do not conform and thus often cause problems in software not specifically prepared for them are Shift JIS, Big 5, VISCII and KOI-8. Shift JIS in particular annoys me, because I dual-boot Linux and Windows 95 for Japanese OCR, conversion of Microsoft Word documents to plaintext and FreeCell, and directory listings with Japanese in them are invariably messed up. Fortunately, I find yet to have a reason to try to access a file named in Japanese. Kernel patches are available which help deal with this, but they are unofficial and will stay that way because they are inherently dangerous. That is, they work with Japanese most but not all of the time, and they will not handle non-Japanese 8-bit encodings correctly.