Alphabet Soup: The Internationalization of Linux, Part 1
The foundation is the operating system kernel. The main condition imposed on the functionality of the kernel is that it must treat natural language text data as binary data. Data passed between processes should not be altered gratuitously in any way. For many kernel functions (e.g., scheduling and memory management), this is already a requirement to which most systems conform.
Historically, computer text communications used 7-bit bytes and the ASCII character set and encoding. Often, communications software would use the high bit as a “parity bit”, which facilitates error detection and correction. Some terminal drivers would use it as a “bucky-bit” or to indicate face changes, such as underlining or reverse video. Although none of these usages are bad per se, they should be restricted to well-defined and well-documented interfaces, with 8-bit-clean alternatives provided.
Unfortunately, the seven-significant-bits assumption leaked into a lot of software, in particular implementations of electronic mail, and now has been enshrined in Internet standards such as RFC-821 for the Simple Mail Transport Protocol (SMTP) and RFC-822 which describes the format of Internet message headers for text messages. Software that transparently passes all data and does not perform unrequested transformations is called “8-bit clean”. Note that endianness is now an issue in text processing, since character sets such as Chinese, Japanese, Korean and the Unicode character set use at least two-byte integers to represent some or all characters. Hencefoward, any mention of 8-bit cleanness should be taken to imply correct handling of endianness issues as well.
One might also want kernel log and error messages to be localized. Although error messages are rather stylized and easy to learn, they are also easy to translate. The main objection to doing this is efficiency. Compiling all known languages into the kernel would make it huge; localizing the kernel would mean maintaining separate builds for each language. A more subtle problem is the temptation to avoid new features that would require translating more messages, or to do a hasty, inaccurate translation. Finally, separate message catalogs would require loading at some point, introducing more possibilities for bugs. Even worse, a system crash before loading the catalog would mean no error messages at all.
The next layer is file systems which are normally considered part of the kernel. In cases like Linux's /proc file system, they are inseparable from the kernel. File systems are a special case because the objects they contain inherently need to be presented to users and input by them. Hence, it is not enough for the programs implementing file systems to be 8-bit clean, since the directory separator, /, has special meaning to the operating system. This means that trying to use Japanese file names encoded according to the JIS-X-0208 standard may not work because some of those two-byte characters contain the ASCII value 0x2F as a second byte. If the file system treats path names as simply a string of bytes, such characters will be corrupted when passed to the file system interface in function arguments.
One can imagine various solutions to the problem such as redesigning file system calls to be aware of various encodings, using a universal encoding such as Unicode, or removing dependence of the operating system on such special characters by defining a path data type as a list of strings. However, these solutions would preclude backward compatibility and compatibility with alien file systems and would be something of a burden to programmers. Fortunately, a fairly satisfactory solution is to use a transformed encoding that sets the eighth bit on non-ASCII characters. The major Asian languages have such encodings (EUC), as does Unicode (UTF-FSS, officially UTF-8). By definition the ISO-8859 family of encodings, all of which contain U.S. ASCII as a subset, satisfy this constraint without need for transformation. Using a transformation format is rarely a burden on users, as they generally need not be aware of which encoding format their language is using. However, this difficulty can still apply to mounting alien file systems, especially MS-DOS and VFAT file systems, where Microsoft has implemented idiosyncratic encodings such as Shift-JIS for their localized extensions of DOS and MS Windows. A partial directory listing of a Japanized Windows 95 file system mounted as an MS-DOS file system on my Linux system is shown in Figure 1, a Dired buffer in Mule. Note the error messages at the top of the buffer. ls is unable to find a file name it just received from the file system. When the directory listing is made, one pass generates a list of file names, which includes Shift-JIS-encoded Japanese names. When the list is passed back to the file system to get the file properties, some of the octets of Japanese characters are interpreted as file system separators. Thus, the Japanese character is not found and the error message results. The VFAT file system does not exhibit this problem in this case; I am not sure why. Most characters are passed unscathed, as you can see.
This principle applies generally to other system processes (init, network daemons, loggers and so on). As long as the programs implementing them are 8-bit clean, use of non-ASCII characters in comments and strings in configuration files and the like should be fairly transparent, as long as file-system-safe transformation formats of standard encodings are used. This is true because keywords and syntactically significant characters have historically been drawn from the U.S. ASCII character set and, in particular, U.S. English. This is unlikely to change because the historic dominance of the U.S. in computer systems manufacturing and distribution means that most programming and scripting languages are English-based and use the ASCII character set. An interesting exception is APL; because of its IBM heritage, it is based on EBCDIC, which contains many symbols not present in ASCII.
The major standards for programming languages, operating systems (POSIX), user interfaces (X) and inter-system communication (RFC 1123, MIME and ISO 2022) specify portable character sets which are subsets of ASCII. Protocol keywords are defined to be strings from the portable character set, and where a default initial encoding is specified, it is U.S. ASCII or its superset, ISO-8859-1. Note that in most cases a character set is specified, for example, in C and X. IBM mainframes must support a portable character set which is a subset of ASCII, but those characters will be encoded in EBCDIC.
The only exception is ISO 2022 which specifies neither a portable nor ASCII character set as default; however, even there the influence of ASCII is extremely strong. The 256 possible bytes are divided into “left” (eighth bit zero) and “right” (eighth bit one) halves of 128 code points each and within each half, the first 32 code points (0x00 to 0x1F) are reserved for control characters while the remaining 96 may be printable characters. Further, positions 0x20 and 0x7F have reserved interpretations as the space and delete characters respectively and may not be used for graphic characters, while 0xA0 and 0xFF are often left unused.
|Designing Electronics with Linux||May 22, 2013|
|Dynamic DNS—an Object Lesson in Problem Solving||May 21, 2013|
|Using Salt Stack and Vagrant for Drupal Development||May 20, 2013|
|Making Linux and Android Get Along (It's Not as Hard as It Sounds)||May 16, 2013|
|Drupal Is a Framework: Why Everyone Needs to Understand This||May 15, 2013|
|Home, My Backup Data Center||May 13, 2013|
- New Products
- Linux Systems Administrator
- Senior Perl Developer
- Technical Support Rep
- UX Designer
- Designing Electronics with Linux
- Dynamic DNS—an Object Lesson in Problem Solving
- Using Salt Stack and Vagrant for Drupal Development
- Making Linux and Android Get Along (It's Not as Hard as It Sounds)
- Nice article, thanks for the
6 hours 50 min ago
- I once had a better way I
12 hours 36 min ago
- Not only you I too assumed
12 hours 54 min ago
- another very interesting
14 hours 47 min ago
- Reply to comment | Linux Journal
16 hours 40 min ago
- Reply to comment | Linux Journal
23 hours 34 min ago
- Reply to comment | Linux Journal
23 hours 50 min ago
- Favorite (and easily brute-forced) pw's
1 day 1 hour ago
- Have you tried Boxen? It's a
1 day 7 hours ago
- seo services in india
1 day 12 hours ago
Enter to Win an Adafruit Pi Cobbler Breakout Kit for Raspberry Pi
It's Raspberry Pi month at Linux Journal. Each week in May, Adafruit will be giving away a Pi-related prize to a lucky, randomly drawn LJ reader. Winners will be announced weekly.
Fill out the fields below to enter to win this week's prize-- a Pi Cobbler Breakout Kit for Raspberry Pi.
Congratulations to our winners so far:
- 5-8-13, Pi Starter Pack: Jack Davis
- 5-15-13, Pi Model B 512MB RAM: Patrick Dunn
- 5-21-13, Prototyping Pi Plate Kit: Philip Kirby
- Next winner announced on 5-27-13!
Free Webinar: Hadoop
How to Build an Optimal Hadoop Cluster to Store and Maintain Unlimited Amounts of Data Using Microservers
Realizing the promise of Apache® Hadoop® requires the effective deployment of compute, memory, storage and networking to achieve optimal results. With its flexibility and multitude of options, it is easy to over or under provision the server infrastructure, resulting in poor performance and high TCO. Join us for an in depth, technical discussion with industry experts from leading Hadoop and server companies who will provide insights into the key considerations for designing and deploying an optimal Hadoop cluster.
Some of key questions to be discussed are:
- What is the “typical” Hadoop cluster and what should be installed on the different machine types?
- Why should you consider the typical workload patterns when making your hardware decisions?
- Are all microservers created equal for Hadoop deployments?
- How do I plan for expansion if I require more compute, memory, storage or networking?