Alphabet Soup: The Internationalization of Linux, Part 2

Mr. Turnbull takes a look at the problems faced with different character sets and the need for standardization.
Internet Messaging

One of the earliest and most important applications for the Internet is messaging, either direct to recipients (electronic mail) or broadcast (Usenet newsgroups). From the internationalization point of view, these are basically the same; internationalization doesn't care about the transmission mechanism, only how the content is handled.

Because messaging was an early application, it assumes a rather restricted environment. In particular, it assumes the data stream is limited to 7-bit bit-strings, and one cannot even be sure that all ASCII characters will be transmitted without error. In particular, if a message originates in the UNIX world, is passed through BITNET, i.e., EBCDIC encoding and back to UNIX, some characters are likely to be corrupted. Of course, these days such corruption is unlikely, but when the standards were designed, it was commonplace. Now these restrictions are defined in standards and widely implemented in software, so they are likely to continue for the foreseeable future, even though the hardware and software for Internet transmission of data is extremely reliable.

The Internet mail transmission protocol (SMTP) is defined in RFC-821. The main provision of interest is that the transmission channel must transmit all 128 ASCII characters properly. 8-bit-clean channels are encouraged, but implicitly 7-bit characters are the norm. Internet messages are standardized in RFC-822 for electronic mail and RFC-1036 for Usenet. RFC-1036 adopts RFC-822 nearly in full, so I will refer to these three standards together as RFC-822.

RFC-822 is intended first of all to be compatible with RFC-821. The content of a message is divided into the part that is relevant to the mail transport system, the headers, and the part that is irrelevant to transporting the message, the body. RFC-822 allows users to send 8-bit content in the body at their own risk, but the headers must be in a 7-bit code, in particular, ASCII. This is rather annoying to non-English-speaking users. To permit non-English text in subject headers and in comments (particularly full names associated with addresses) and to provide reliable transport for non-ASCII body content, both non-English text and binary data of various kinds, the Multipurpose Internet Mail Extension suite of protocols was defined. Today, this standard occupies no less than five RFCs (2045-2049). We will be interested only in those parts related to internationalization.

MIME Transfer Encodings

The MIME transfer encodings are like the UCS transformation formats discussed above. They allow arbitrary content to be expressed in a way that will not choke the transmission channel or be damaged by it. MIME defines two transfer encodings, quoted printable and BASE64.

The quoted-printable encoding is very simple. Any octet may be represented by its hexadecimal code, preceded by an equals sign. So a space character is represented as =20 and the Spanish small enye (ñ) is =F1. The Latin capital letter A is =41. However, in general these are used only in three circumstances. First, since the equal sign is an escape character, it must be represented by =3D. Second, some software strips trailing whitespace, in particular on systems with record-oriented storage that do not use control characters to represent line breaks. A space or tab that ends a line will be encoded =20 or =09, respectively. This is important to the signature convention used on Usenet newsgroups. Finally, non-ASCII octets including most control characters will be encoded. Thus, the quoted-printable encoding is intended for applications, such as Western European languages, where most characters come from the basic Latin (i.e., ASCII) set. In fact, one quickly learns to accurately read quoted printable text without decoding it.

Note that this is a transfer encoding. It is a purely mechanical transformation and provides no information about the intended meaning of the character. Although ñ is one interpretation of =F1, there are many others including a different one for each of the ten ISO 8859 character sets. Quoted printable encoding provides no indication of which is intended.

The BASE64 encoding is intended to be a robust encoding for arbitrary binary data, including images and audio. However, it is also commonly used for languages like Japanese where interpreting each octet separately as an ASCII character is illegible without decoding. It is more efficient than quoted printable, using only 33% more space than the original text, where each quoted character uses three times as much space as the unencoded octet. BASE64 is similar to the famous uuencode format long used in UNIX for the same purpose, but the characters used for the encoding are limited to the 52 Latin letters, the 10 decimal digits, the plus sign and slash.

The equals sign is also used, as padding. The reason for this choice is that base 64 is a convenient radix for byte-oriented encoding, since four base-64 digits can encode 24 bits or 3 octets. The characters chosen are passed intact by all known systems, which is not true of some of the punctuation marks used in the uuencode algorithm. The encoding algorithm is obvious:

  1. Break up the data stream into groups of three octets. The last group may have one or two octets and will be treated specially.

  2. For each group of three, concatenate the octets into a 24-bit string, then break it into four 6-bit groups. Interpret each as a 6-bit binary integer and index into the table above. This results in a group of four base-64 digits. Add them to the output.

  3. If there is a remaining group, it has either one or two octets in it. Add one or two null octets to complete the group of three. Now treat it as in Step 2, except that if there was one octet in the group, add the first two base-64 digits to the output and pad the end with two equals signs to make a group of four. If there were two octets in the final group, add the first three base-64 digits to the output and pad with a final equals sign to make a group of four.

Notice that by using the equals sign it is always possible to exactly decode the original text; there will not even be a spurious null character at the end. Furthermore, the algorithm is very fast and space-efficient, given the restrictions.