Alphabet Soup: The Internationalization of Linux, Part 2

Mr. Turnbull takes a look at the problems faced with different character sets and the need for standardization.
Unicode and the ISO-10646 Universal Character Sets

The next step is to unify all of the various character sets. Of course, the national standards have two main advantages. They are space-efficient, encoding the characters needed for daily use and computer programming in one byte, and they are time-efficient, since they can be arranged in the natural collating order. The second advantage has already been conceded by the majority of European languages when using encodings in the ISO 8859 family. Indeed, ISO 8859-1 has been an enormous success since it effectively unifies all the major Western European and American languages in a single multilingual encoding. With system library support for sorting (the LC_COLLATE portion of POSIX locales), it is hard to justify using anything else where it will serve.

In this context, it was natural to try to extend the success of ISO 8859 by abandoning the efficiency of one-byte encodings in favor of a single comprehensive encoding for all characters used by all the world's languages. Two complementary efforts, proceeding in parallel, were conducted by a commercial consortium and the ISO. Unsurprisingly, the ISO's working group called its effort by the ponderous name Universal Multiple-Octet Coded Character Set (abbreviated UCS), while the commercial consortium adopted the sprightly “Unicode”. Also unsurprisingly, the Unicode Consortium (driven by the commercial advantages of a uniform two-byte encoding) was able to formulate a standard unifying nearly all of the world's scripts in a single two-byte encoding by 1991, as well as codifying a dictionary of properties of each character guiding such usages as ligatures and bidirectional text, while the ISO ended up defining both a two-byte version and a four-byte (31-bit) version of the UCS in 1993 without the additional properties. Also in 1993, the Unicode and UCS-2 character sets and encodings were unified, although each standard retains unique features.

Why separate efforts? Surely 65336 different characters are enough for anyone. Who needs two billion characters?

The reason for separate efforts is easy enough to explain. The Unicode effort was driven by the commercial advantages of a single encoding. Much effort has been expended in the standardization of Internet protocols, first working around the problems caused by “8-bit-dirty” Internet software, then in adding support for Asian languages, and finally in creating protocols for negotiating character sets. It would be nice if all that effort and the necessary implementation inefficiencies could be avoided by having one standard encoding. As we will see, it is not that easy, but standardizing on Unicode could result in large cost savings, both in development and processor time and protocol overhead.

On the other hand, the ISO group was primarily concerned that a truly universal framework be created so as to avoid the need for yet another “universal” standardization effort in the future. It worried more about generality and eschewed standardizing poorly-understood areas, such as treatment of bidirectional text. In fact, UCS-4 currently contains only those characters defined by the Unicode standard, adopted en masse as the Basic Multilingual Plane of UCS-4 and equivalent to UCS-2.

The reason for their concern is it is already painfully obvious 65536 characters are not enough for some purposes. Although over 18,000 unassigned code positions remain in Unicode, classical scholars of hieroglyphics or Chinese could rapidly fill these positions with ideographs. The current set of “unified Han” (Chinese ideographs used in Chinese, Japanese, Korean and Vietnamese) was reduced to 20,902 only through a highly contentious unification process, suggesting that some of the controversial characters might be reassigned to code points. Archaic Hangul (composed Korean syllables) would add thousands more. Unicode also explicitly excludes standardized graphic notations such as those used in music, dance and electronics. It is clear that a truly universal character set will easily exceed the limit of 65536 imposed by a two-octet encoding.

Why does ISO 10646 specify a 31-bit encoding? Current hardware is byte-oriented, but there is no particular reason to stop at 24 bits, since only certain video hardware can efficiently use three-byte words of memory. The word size most efficiently accessed by most current hardware is four bytes. With potentially billions of characters, it was considered wise to reserve a bit in each character for arbitrary internal processing purposes; however, this bit must be cleared before passing the character on to an entity expecting a UCS character.

Similarly, large contiguous private spaces have been reserved containing 1/8 of the three-octet codes, i.e., those with the high octet 0, and 1/4 of the four-octet codes. This means that an application can embed entire national standard character sets in this space in a natural way (in particular, preserving their orderings) if desired, without any possibility of conflict with the standard, current or any future extensions. ISO 10646 does not necessarily recommend such techniques, but certainly permits them. This still leaves over 1.5 billion code points reserved for future standardization; it seems certain most will remain reserved but unused for a good long time.

However, it seems unlikely that Unicode, let alone UCS-4, will soon have the success enjoyed by ISO 8859-1. First, the Oriental languages' digital character set standards are not yet satisfactory, in part because the languages are not fully standardized. Standardization efforts for all the Han character languages remain active. If the Japanese, for example, have not yet settled on a national character set, how can they be satisfied with the unified Han characters of Unicode? A recent tract entitled Japanese is in Danger! claims that Unicode will be the death of the Japanese language, and many computer-literate Japanese show varying degrees of sympathy with its arguments.

Second, in multilingual texts it may be desirable to search for some specifically Chinese character (as opposed to its Korean or Japanese cognates). In Unicode, this requires maintaining substantial amounts of surrounding context which would contain markup tags indicating language and would be impossible by definition in Unicode-encoded plain text. Although you could point to similar difficulties with ISO 8859-1 text, it is not the same. A Chinese character is a semantic unit with specific meaning, unlike an alphabetic character. In fact, the Han unification process normally ignores semantics. Thus, it confounds a Japanese character with the same shape as a given Chinese character, but a different meaning. ISO 8859-1 characters, on the other hand, are rarely searched for in isolation; if so, they have no semantic content.

Third, Asians are simply not yet as multilingual across the Asian languages as Europeans are across European ones, although this is changing rapidly. Still, it is unlikely that we will ever see an “Asian Switzerland” with Chinese, Japanese and Vietnamese simultaneously in use as official languages. Thus, the advantage of Unicode over national standards is not so great.

Fourth, from the Western European point of view, most of the gains to a single character set supporting multilingual processing have already been achieved by ISO 8859-1. Western Europeans have little need for Unicode.

In the near future, Unicode will be most useful to computer and operating system vendors, including Linux. By supporting Unicode as the basic internal code set, an unambiguous way is provided to avoid linguistic confusion. Adding new languages will simply be a matter of providing fonts, a Unicode-to-font-encoding mapping table and translating the messages. No additional programming effort will be necessary, and backwards compatibility is guaranteed. This is not trivial. An example is given below of a kernel patch used to make directory listings of Japanese Windows file systems mounted with either the MS-DOS or VFAT file systems readable. This kernel patch is certainly never going to be integrated into the kernel source code, because it is impossible to ensure it won't mess up non-Japanese names.