Unicode
Growing up in the northeastern United States, I never had to use a language other than English. I read in English, spoke in English, wrote in English and conducted business in English. This was also true of the engineers who created ASCII back in 1968, who made sure the 128 ASCII characters would suffice for English-language documents. So long as you stuck with the standard set of ASCII characters, you were guaranteed the ability to move files from one computer to another without having to worry about them getting garbled.
ASCII was fine in its day, but people who spoke French, Spanish and other Western European languages quickly discovered that it was insufficient for their needs. After all, people who write in these languages on a computer want to use the correct accent marks. So over the course of time, the 7-bit ASCII code became the 8-bit extended ASCII code, including a number of special letters and symbols necessary for displaying Western European text.
But because extended ASCII was never declared a standard, a number of different, incompatible extensions to the base ASCII code became widespread. Windows had its own extensions, as did the Macintosh and NeXTSTEP operating systems. So although you could write a document in French using Windows, you would need to translate it when moving it to the Macintosh. Otherwise, the bytes would be interpreted on the receiving machine incorrectly, turning your otherwise superb French screenplay into something more akin to French toast.
International standards finally prevailed, at least somewhat, with a standard known formally as ISO-8859-1 and informally as Latin-1. Computer manufacturers could then exchange Western European documents without having to worry about things becoming garbled. Of course, this meant we were using all eight bits of each character's byte, doubling the number of available characters from 128 to 256.
However, this didn't solve all of the problems. For example, Hebrew speakers have their own standard, ISO-8859-8, which is identical to Latin-1 for characters 0-127 and quite different from 128-256. A document written in Hebrew but displayed on a computer using Latin-1 will look like a letter substitution puzzle using letters from the wrong alphabet.
Practically speaking, this means you cannot write a document that contains English, Hebrew and French using the ISO-8859 series of standards. And indeed, this makes sense given the fact that we have only 256 characters to play with in a single 8-bit byte. But it raises some serious questions and issues for those of us who work with more than two languages.
Things get especially hairy if you want to display a page in English, French, Hebrew and Chinese. After all, there are tens of thousands of ideographs in Chinese, not to mention Japanese and other languages.
Enter Unicode, the ASCII table for the next century. Like ASCII, Unicode assigns a number to each letter, number and symbol. Unlike ASCII, Unicode contains enough space for every written symbol ever created by humans. This means that a Unicode document can contain any number of characters from any number of languages, without having to worry about clashes between them. Unicode also handles a number of issues that ASCII never dreamed about, including combining characters (for accents and other diacritical marks) and directional issues (for languages that do not read from left to right).
Unicode has been around for about a decade, but it is only now becoming popular and supported for web applications. This month, we take a look at Unicode as it affects web developers. What should you consider? What do you need to worry about? And, how can you get around the problems associated with Unicode?
Unicode, like ASCII, assigns a unique number to each letter, number, symbol and control character. As indicated above, though, Unicode extends through each of the symbols and character sets ever created. So using Unicode, you can create a document that uses English, Russian, Japanese and Arabic, in which each character is clearly distinct from the others.
How do we turn these unique numbers—known as code points in the Unicode universe—into bits and bytes? The encoding for ASCII is very straightforward; with only 127 characters (or 256, if you include the various extensions), each ASCII character will fit into a single byte. And indeed, C programmers know that the char data type is an 8-bit integer.
The most obvious solution is to assign a fixed multibyte encoding for our Unicode characters. And indeed, UCS-2 is such an encoding, using two bytes to describe all of the basic 65,536 Unicode characters. (There are some extended characters that require additional bytes, but we won't go into that.) UCS-2 assigns a single 2-byte code to each of these characters. Documents are thus equally long regardless of the language in which they are written, and programs can easily calculate the number of bytes they need by doubling the number of characters. Microsoft's modern operating systems use UCS-2, as you might have noticed if you exchange any documents with users of those systems.
But there is a basic problem with UCS-2, namely its incompatibility with ASCII. If you have 100,000 documents written in ASCII, you will have to translate them into UCS-2 in order to read them accurately. Given that most modern programs work with ASCII, this lack of backward compatibility is quite a problem.
Enter UTF-8, which is a variable-length Unicode encoding. Just as Roman and Arabic numerals represent the same numbers differently, UTF-8 and UCS-2 are simply different encodings for the same underlying Unicode character set. But whereas every UCS-2 character requires two bytes, a UTF-8 character might require anywhere from one to four bytes. One-byte UTF-8 characters are the same as in ASCII, which means that a legal ASCII document is also a legal UTF-8 document. However, Latin-1 and other 8-bit character sets are incompatible with UTF-8; existing Latin-1 documents will not only need to be transformed but could potentially double in size.
UTF-8 is the preferred encoding on UNIX and Linux systems, as well as in most of the standards and open-source software that I tend to use. Perl, Python, Tcl and Java all encode strings in UTF-8. PostgreSQL has supported UTF-8 for years, and Unicode support has apparently been added to MySQL 4.1, which will be released in alpha in the coming months.
Adding Unicode support to an existing system is a Herculean task for which the various developers should be given great praise. Not only do developers need to add support for multibyte characters, but databases and languages also need to support regular expressions and sorting operators, neither of which is easy to do.
Realizing the promise of Apache® Hadoop® requires the effective deployment of compute, memory, storage and networking to achieve optimal results. With its flexibility and multitude of options, it is easy to over or under provision the server infrastructure, resulting in poor performance and high TCO. Join us for an in depth, technical discussion with industry experts from leading Hadoop and server companies who will provide insights into the key considerations for designing and deploying an optimal Hadoop cluster.
Sponsored by AMD
Built-in forensics, incident response, and security with Red Hat Enterprise Linux 6
Every security policy provides guidance and requirements for ensuring adequate protection of information and data, as well as high-level technical and administrative security requirements for a system in a given environment. Traditionally, providing security for a system focuses on the confidentiality of the information on it. However, protecting the data integrity and system and data availability is just as important. For example, when processing United States intelligence information, there are three attributes that require protection: confidentiality, integrity, and availability.
Learn more about catching the bad guy in this free white paper.
Sponsored by DLT Solutions
| Designing Electronics with Linux | May 22, 2013 |
| Dynamic DNS—an Object Lesson in Problem Solving | May 21, 2013 |
| Using Salt Stack and Vagrant for Drupal Development | May 20, 2013 |
| Making Linux and Android Get Along (It's Not as Hard as It Sounds) | May 16, 2013 |
| Drupal Is a Framework: Why Everyone Needs to Understand This | May 15, 2013 |
| Home, My Backup Data Center | May 13, 2013 |
- New Products
- Linux Systems Administrator
- Senior Perl Developer
- Technical Support Rep
- UX Designer
- Web & UI Developer (JavaScript & j Query)
- Designing Electronics with Linux
- Dynamic DNS—an Object Lesson in Problem Solving
- Making Linux and Android Get Along (It's Not as Hard as It Sounds)
- Using Salt Stack and Vagrant for Drupal Development
- Reply to comment | Linux Journal
36 min 48 sec ago - Nice article, thanks for the
11 hours 17 min ago - I once had a better way I
17 hours 3 min ago - Not only you I too assumed
17 hours 20 min ago - another very interesting
19 hours 13 min ago - Reply to comment | Linux Journal
21 hours 7 min ago - Reply to comment | Linux Journal
1 day 4 hours ago - Reply to comment | Linux Journal
1 day 4 hours ago - Favorite (and easily brute-forced) pw's
1 day 6 hours ago - Have you tried Boxen? It's a
1 day 12 hours ago
Enter to Win an Adafruit Pi Cobbler Breakout Kit for Raspberry Pi

It's Raspberry Pi month at Linux Journal. Each week in May, Adafruit will be giving away a Pi-related prize to a lucky, randomly drawn LJ reader. Winners will be announced weekly.
Fill out the fields below to enter to win this week's prize-- a Pi Cobbler Breakout Kit for Raspberry Pi.
Congratulations to our winners so far:
- 5-8-13, Pi Starter Pack: Jack Davis
- 5-15-13, Pi Model B 512MB RAM: Patrick Dunn
- 5-21-13, Prototyping Pi Plate Kit: Philip Kirby
- Next winner announced on 5-27-13!
Featured Jobs
| Linux Systems Administrator | Houston and Austin, Texas | Host Gator |
| Senior Perl Developer | Austin, Texas | Host Gator |
| Technical Support Rep | Houston and Austin, Texas | Host Gator |
| UX Designer | Austin, Texas | Host Gator |
| Web & UI Developer (JavaScript & j Query) | Austin, Texas | Host Gator |
Free Webinar: Hadoop
How to Build an Optimal Hadoop Cluster to Store and Maintain Unlimited Amounts of Data Using Microservers
Realizing the promise of Apache® Hadoop® requires the effective deployment of compute, memory, storage and networking to achieve optimal results. With its flexibility and multitude of options, it is easy to over or under provision the server infrastructure, resulting in poor performance and high TCO. Join us for an in depth, technical discussion with industry experts from leading Hadoop and server companies who will provide insights into the key considerations for designing and deploying an optimal Hadoop cluster.
Some of key questions to be discussed are:
- What is the “typical” Hadoop cluster and what should be installed on the different machine types?
- Why should you consider the typical workload patterns when making your hardware decisions?
- Are all microservers created equal for Hadoop deployments?
- How do I plan for expansion if I require more compute, memory, storage or networking?




Comments
International ASCII codes
Where do I find the Linux ASCII codes for Denmark and Germany? I know what the individual foreign characters are but I don't know how to use them on letters or my kmail So I can write to my family.
Can you help?
Thank You
Bill Hansen
thank you
hi reuven, thanks a lot for a comprehensive article. i've always stabbed in the dark regarding charsets and encoding, but now am on the right path (regarding this at least).