Alphabet Soup: The Internationalization of Linux, Part 2

Mr. Turnbull takes a look at the problems faced with different character sets and the need for standardization.
MIME-specific Headers

A message conforming to the MIME standard must have a version header of the form

MIME-Version: 1.0

Some mailers are sufficiently picky as to refuse to do MIME processing on mail lacking a valid MIME-Version header. This would be amusing, except for the fact that many mailers either do not implement the MIME functions correctly, produce an illegal MIME-Version header, or fail to insert the MIME-Version header at all.

The only version of the MIME header formats is 1.0. The MIME standard has undergone several revisions and expansions, but the basic format has remained unchanged at version 1.0. These revisions and standards have added new values for some of the parameters and specified interpretations for some ambiguous areas, but the syntax is unchanged. Case is irrelevant, in both the header tags and the values. The style of capitalization used below is more or less conventional, but not required.

One way to protect the content, or at least check that it has not been truncated, is to provide a Content-Length header. This is allowed by the MIME standard. The general type of encoding of the body is stated in the content-transfer-encoding header. The default is

Content-Transfer-Encoding: 7-bit

Other allowed values are “quoted-printable” and “base64” (both implicitly 7-bit) and “8-bit”.

Next, the content type of the body is specified. In most messages it will be plain text, specified as

Content-Type: text/plain

Other text types commonly found in mail these days are text/rich and text/html. A forwarded message (with no prefatory comments) may have content type specified as message/rfc-822. Messages can also be multipart. This is commonly used to add multimedia attachments, but can also be used to break up the body into components in different languages.

The MIME standard specifies that the character set is ASCII unless otherwise noted. RFC-822 requires that all headers be ASCII, so the MIME character set specification applies only to the body of the message. This specification is done using the charset parameter of the content type header. The default could be explicitly specified as

Content-Type: text/plain;charset=us-ascii

Note that the optional parameters are specified in keyword=value form. The correct way to specify ASCII is “us-ascii”, because that is the preferred form as registered with the IANA. A list of valid character sets for MIME is at Europeans will commonly use

Content-Transfer-Encoding: 8-bit
Content-Type: text/plain;charset=iso-8859-1

The Japanese standard for electronic messages is a version of ISO 2022 called ISO-2022-JP. In fact, this encoding needs to be extended only slightly. It can be used for Chinese and Korean as well and even as a multilingual encoding. The extended version is known as ISO-2022-JP-2 or ISO-2022-INT.

MIME-encoded Words: Non-ASCII Text in Headers

The MIME standard also provides a mechanism for putting non-ASCII text in headers. RFC-822 makes this illegal, so use of this mechanism will result in gibberish being displayed by mail programs that do not implement MIME. However, most mail programs today are MIME-aware, so this should not present any problems. If your correspondents complain, tell them to get a MIME-aware mailer.

The mechanism is simple. Non-ASCII text is encoded using either quoted-printable encoding or BASE64 encoding according to convenience, and bundled up into an encoded word. The reason it must be bundled into an encoded word is that the Content-Type header applies to the body, and if the body is multipart, there will be no charset parameter. Using a special header to control the format of headers seems silly, so the encoded word itself will contain the necessary character set information.

The format of an encoded word begins with the characters =?, continues with the name of the encoding used, the character ?, either the letter Q (for “quoted printable”) or the letter B (for BASE64), the character ? again, the encoded text, and finally the characters ?=. For example, the French word “voil<\#226>” is encoded =?ISO-8859-1?Q?V=F3il=E0?=. Incredibly inefficient, of course, but these will be used only a few times per message. Note that one extra restriction is put on quoted printable encoding, not present in the basic encoding: any question marks in the encoded text must be encoded. Otherwise, the sequence <question mark><encoded octet> would be interpreted as the end of the encoded word.