Alphabet Soup: The Internationalization of Linux, Part 2

Mr. Turnbull takes a look at the problems faced with different character sets and the need for standardization.
MIME-specific Headers

A message conforming to the MIME standard must have a version header of the form

MIME-Version: 1.0

Some mailers are sufficiently picky as to refuse to do MIME processing on mail lacking a valid MIME-Version header. This would be amusing, except for the fact that many mailers either do not implement the MIME functions correctly, produce an illegal MIME-Version header, or fail to insert the MIME-Version header at all.

The only version of the MIME header formats is 1.0. The MIME standard has undergone several revisions and expansions, but the basic format has remained unchanged at version 1.0. These revisions and standards have added new values for some of the parameters and specified interpretations for some ambiguous areas, but the syntax is unchanged. Case is irrelevant, in both the header tags and the values. The style of capitalization used below is more or less conventional, but not required.

One way to protect the content, or at least check that it has not been truncated, is to provide a Content-Length header. This is allowed by the MIME standard. The general type of encoding of the body is stated in the content-transfer-encoding header. The default is

Content-Transfer-Encoding: 7-bit

Other allowed values are “quoted-printable” and “base64” (both implicitly 7-bit) and “8-bit”.

Next, the content type of the body is specified. In most messages it will be plain text, specified as

Content-Type: text/plain

Other text types commonly found in mail these days are text/rich and text/html. A forwarded message (with no prefatory comments) may have content type specified as message/rfc-822. Messages can also be multipart. This is commonly used to add multimedia attachments, but can also be used to break up the body into components in different languages.

The MIME standard specifies that the character set is ASCII unless otherwise noted. RFC-822 requires that all headers be ASCII, so the MIME character set specification applies only to the body of the message. This specification is done using the charset parameter of the content type header. The default could be explicitly specified as

Content-Type: text/plain;charset=us-ascii

Note that the optional parameters are specified in keyword=value form. The correct way to specify ASCII is “us-ascii”, because that is the preferred form as registered with the IANA. A list of valid character sets for MIME is at Europeans will commonly use

Content-Transfer-Encoding: 8-bit
Content-Type: text/plain;charset=iso-8859-1

The Japanese standard for electronic messages is a version of ISO 2022 called ISO-2022-JP. In fact, this encoding needs to be extended only slightly. It can be used for Chinese and Korean as well and even as a multilingual encoding. The extended version is known as ISO-2022-JP-2 or ISO-2022-INT.

MIME-encoded Words: Non-ASCII Text in Headers

The MIME standard also provides a mechanism for putting non-ASCII text in headers. RFC-822 makes this illegal, so use of this mechanism will result in gibberish being displayed by mail programs that do not implement MIME. However, most mail programs today are MIME-aware, so this should not present any problems. If your correspondents complain, tell them to get a MIME-aware mailer.

The mechanism is simple. Non-ASCII text is encoded using either quoted-printable encoding or BASE64 encoding according to convenience, and bundled up into an encoded word. The reason it must be bundled into an encoded word is that the Content-Type header applies to the body, and if the body is multipart, there will be no charset parameter. Using a special header to control the format of headers seems silly, so the encoded word itself will contain the necessary character set information.

The format of an encoded word begins with the characters =?, continues with the name of the encoding used, the character ?, either the letter Q (for “quoted printable”) or the letter B (for BASE64), the character ? again, the encoded text, and finally the characters ?=. For example, the French word “voil<\#226>” is encoded =?ISO-8859-1?Q?V=F3il=E0?=. Incredibly inefficient, of course, but these will be used only a few times per message. Note that one extra restriction is put on quoted printable encoding, not present in the basic encoding: any question marks in the encoded text must be encoded. Otherwise, the sequence <question mark><encoded octet> would be interpreted as the end of the encoded word.


White Paper
Linux Management with Red Hat Satellite: Measuring Business Impact and ROI

Linux has become a key foundation for supporting today's rapidly growing IT environments. Linux is being used to deploy business applications and databases, trading on its reputation as a low-cost operating environment. For many IT organizations, Linux is a mainstay for deploying Web servers and has evolved from handling basic file, print, and utility workloads to running mission-critical applications and databases, physically, virtually, and in the cloud. As Linux grows in importance in terms of value to the business, managing Linux environments to high standards of service quality — availability, security, and performance — becomes an essential requirement for business success.

Learn More

Sponsored by Red Hat

White Paper
Private PaaS for the Agile Enterprise

If you already use virtualized infrastructure, you are well on your way to leveraging the power of the cloud. Virtualization offers the promise of limitless resources, but how do you manage that scalability when your DevOps team doesn’t scale? In today’s hypercompetitive markets, fast results can make a difference between leading the pack vs. obsolescence. Organizations need more benefits from cloud computing than just raw resources. They need agility, flexibility, convenience, ROI, and control.

Stackato private Platform-as-a-Service technology from ActiveState extends your private cloud infrastructure by creating a private PaaS to provide on-demand availability, flexibility, control, and ultimately, faster time-to-market for your enterprise.

Learn More

Sponsored by ActiveState