Alphabet Soup: The Internationalization of Linux, Part 2
One of the earliest and most important applications for the Internet is messaging, either direct to recipients (electronic mail) or broadcast (Usenet newsgroups). From the internationalization point of view, these are basically the same; internationalization doesn't care about the transmission mechanism, only how the content is handled.
Because messaging was an early application, it assumes a rather restricted environment. In particular, it assumes the data stream is limited to 7-bit bit-strings, and one cannot even be sure that all ASCII characters will be transmitted without error. In particular, if a message originates in the UNIX world, is passed through BITNET, i.e., EBCDIC encoding and back to UNIX, some characters are likely to be corrupted. Of course, these days such corruption is unlikely, but when the standards were designed, it was commonplace. Now these restrictions are defined in standards and widely implemented in software, so they are likely to continue for the foreseeable future, even though the hardware and software for Internet transmission of data is extremely reliable.
The Internet mail transmission protocol (SMTP) is defined in RFC-821. The main provision of interest is that the transmission channel must transmit all 128 ASCII characters properly. 8-bit-clean channels are encouraged, but implicitly 7-bit characters are the norm. Internet messages are standardized in RFC-822 for electronic mail and RFC-1036 for Usenet. RFC-1036 adopts RFC-822 nearly in full, so I will refer to these three standards together as RFC-822.
RFC-822 is intended first of all to be compatible with RFC-821. The content of a message is divided into the part that is relevant to the mail transport system, the headers, and the part that is irrelevant to transporting the message, the body. RFC-822 allows users to send 8-bit content in the body at their own risk, but the headers must be in a 7-bit code, in particular, ASCII. This is rather annoying to non-English-speaking users. To permit non-English text in subject headers and in comments (particularly full names associated with addresses) and to provide reliable transport for non-ASCII body content, both non-English text and binary data of various kinds, the Multipurpose Internet Mail Extension suite of protocols was defined. Today, this standard occupies no less than five RFCs (2045-2049). We will be interested only in those parts related to internationalization.
The MIME transfer encodings are like the UCS transformation formats discussed above. They allow arbitrary content to be expressed in a way that will not choke the transmission channel or be damaged by it. MIME defines two transfer encodings, quoted printable and BASE64.
The quoted-printable encoding is very simple. Any octet may be represented by its hexadecimal code, preceded by an equals sign. So a space character is represented as =20 and the Spanish small enye (ñ) is =F1. The Latin capital letter A is =41. However, in general these are used only in three circumstances. First, since the equal sign is an escape character, it must be represented by =3D. Second, some software strips trailing whitespace, in particular on systems with record-oriented storage that do not use control characters to represent line breaks. A space or tab that ends a line will be encoded =20 or =09, respectively. This is important to the signature convention used on Usenet newsgroups. Finally, non-ASCII octets including most control characters will be encoded. Thus, the quoted-printable encoding is intended for applications, such as Western European languages, where most characters come from the basic Latin (i.e., ASCII) set. In fact, one quickly learns to accurately read quoted printable text without decoding it.
Note that this is a transfer encoding. It is a purely mechanical transformation and provides no information about the intended meaning of the character. Although ñ is one interpretation of =F1, there are many others including a different one for each of the ten ISO 8859 character sets. Quoted printable encoding provides no indication of which is intended.
The BASE64 encoding is intended to be a robust encoding for arbitrary binary data, including images and audio. However, it is also commonly used for languages like Japanese where interpreting each octet separately as an ASCII character is illegible without decoding. It is more efficient than quoted printable, using only 33% more space than the original text, where each quoted character uses three times as much space as the unencoded octet. BASE64 is similar to the famous uuencode format long used in UNIX for the same purpose, but the characters used for the encoding are limited to the 52 Latin letters, the 10 decimal digits, the plus sign and slash.
The equals sign is also used, as padding. The reason for this choice is that base 64 is a convenient radix for byte-oriented encoding, since four base-64 digits can encode 24 bits or 3 octets. The characters chosen are passed intact by all known systems, which is not true of some of the punctuation marks used in the uuencode algorithm. The encoding algorithm is obvious:
Break up the data stream into groups of three octets. The last group may have one or two octets and will be treated specially.
For each group of three, concatenate the octets into a 24-bit string, then break it into four 6-bit groups. Interpret each as a 6-bit binary integer and index into the table above. This results in a group of four base-64 digits. Add them to the output.
If there is a remaining group, it has either one or two octets in it. Add one or two null octets to complete the group of three. Now treat it as in Step 2, except that if there was one octet in the group, add the first two base-64 digits to the output and pad the end with two equals signs to make a group of four. If there were two octets in the final group, add the first three base-64 digits to the output and pad with a final equals sign to make a group of four.
Notice that by using the equals sign it is always possible to exactly decode the original text; there will not even be a spurious null character at the end. Furthermore, the algorithm is very fast and space-efficient, given the restrictions.
Today’s modular x86 servers are compute-centric, designed as a least common denominator to support a wide range of IT workloads. Those generic, virtualized IT workloads have much different resource optimization requirements than hyperscale and cloud applications. They have resulted in a “one size fits all” enterprise IT architecture that is not optimized for a specific set of IT workloads, and especially not emerging hyperscale workloads, such as web applications, big data, and object storage. In this report, you will learn how shifting the focus from traditional compute-centric IT architectures to an innovative disaggregated fabric-based architecture can optimize and scale your data center.
Sponsored by AMD
Built-in forensics, incident response, and security with Red Hat Enterprise Linux 6
Every security policy provides guidance and requirements for ensuring adequate protection of information and data, as well as high-level technical and administrative security requirements for a system in a given environment. Traditionally, providing security for a system focuses on the confidentiality of the information on it. However, protecting the data integrity and system and data availability is just as important. For example, when processing United States intelligence information, there are three attributes that require protection: confidentiality, integrity, and availability.
Learn more about catching the bad guy in this free white paper.
Sponsored by DLT Solutions
| Making Linux and Android Get Along (It's Not as Hard as It Sounds) | May 16, 2013 |
| Drupal Is a Framework: Why Everyone Needs to Understand This | May 15, 2013 |
| Home, My Backup Data Center | May 13, 2013 |
| Non-Linux FOSS: Seashore | May 10, 2013 |
| Trying to Tame the Tablet | May 08, 2013 |
| Dart: a New Web Programming Experience | May 07, 2013 |
- RSS Feeds
- New Products
- Making Linux and Android Get Along (It's Not as Hard as It Sounds)
- Drupal Is a Framework: Why Everyone Needs to Understand This
- A Topic for Discussion - Open Source Feature-Richness?
- Home, My Backup Data Center
- Developer Poll
- Dart: a New Web Programming Experience
- May 2013 Issue of Linux Journal: Raspberry Pi
- What's the tweeting protocol?
Enter to Win an Adafruit Prototyping Pi Plate Kit for Raspberry Pi

It's Raspberry Pi month at Linux Journal. Each week in May, Adafruit will be giving away a Pi-related prize to a lucky, randomly drawn LJ reader. Winners will be announced weekly.
Fill out the fields below to enter to win this week's prize-- a Prototyping Pi Plate Kit for Raspberry Pi.
Congratulations to our winners so far:
- 5-8-13, Pi Starter Pack: Jack Davis
- 5-15-13, Pi Model B 512MB RAM: Patrick Dunn
- Next winner announced on 5-21-13!
Free Webinar: Linux Backup and Recovery
Most companies incorporate backup procedures for critical data, which can be restored quickly if a loss occurs. However, fewer companies are prepared for catastrophic system failures, in which they lose all data, the entire operating system, applications, settings, patches and more, reducing their system(s) to “bare metal.” After all, before data can be restored to a system, there must be a system to restore it to.
In this one hour webinar, learn how to enhance your existing backup strategies for better disaster recovery preparedness using Storix System Backup Administrator (SBAdmin), a highly flexible bare-metal recovery solution for UNIX and Linux systems.




50 min 37 sec ago
1 hour 37 min ago
3 hours 11 min ago
4 hours 47 min ago
6 hours 45 min ago
7 hours 2 min ago
7 hours 32 min ago
7 hours 33 min ago
7 hours 34 min ago
10 hours 34 min ago