Unicode
Now that we have gotten the basics out of the way, let's consider how Unicode documents are transferred across the Web. The basic problem is this: when your browser receives a document, how does it know if it should interpret the bytes as Latin-1, Big-5 Chinese or UTF-8?
The answer lies in the Content-type HTTP header. Every time an HTTP server sends a document to a browser, it identifies the type of content it is sending using a MIME-style designation, such as text/html, image/png or application/msword. If you receive a JPEG image (image/jpeg), there is only one way to represent the image. But if you receive an HTML document (text/html), the Content-type header must indicate the character set and/or encoding that is being used. We do this by adding a charset= designation to the end of the header, separating the type from the charset. For example:
Content-type: text/html; charset=utf-8
Purists rightly say that UTF-8 is an encoding and not a character set. Unfortunately, it's too late to do anything about this. This is similar to the fact that the word “referrer” is misspelled in the HTTP specification as “referer”; everyone knows that it's wrong but is afraid to break existing software.
If no Content-type is specified, it is assumed to be Latin-1. Moreover, if no Content-type is specified, individual documents can set (or override) the value within a metatag. Metatags cannot override an explicit setting of the character set, however.
As you begin to work with different encodings, you will undoubtedly discover an HTTP server that has not been configured correctly and that is announcing the wrong character set in the Content-type header. An easy way to check this is to use Perl's LWP (library for web programming), which includes a number of useful command-line programs for web developers, for example:
$ HEAD http://yad2yad.huji.ac.il/
Typing the above on my Linux box returns the HTTP response headers from the named site:
200 OK Cache-Control: max-age=0 Connection: close Date: Tue, 10 Dec 2002 08:38:37 GMT Server: AOLserver/3.3.1+ad13 Content-Type: text/html; charset=utf-8As you can see, the Content-type header is declaring the document to be in UTF-8.
Mozilla and other modern browsers allow the user to override the explicitly stated encoding. Although this should not normally be necessary for end users, I often find this functionality to be useful when developing a site.
Although it's nice to know we can transfer UTF-8 documents via HTTP, we first need some UTF-8 documents to send. Given that ASCII documents are all UTF-8 documents as well, it's easy to create valid UTF-8 documents, so long as they contain only ASCII characters. But what happens if you want to create HTML pages that contain Hebrew or Greek? Then things start to get interesting and difficult.
There are basically two ways to include Unicode characters in an HTML document. The first is to type the characters themselves using an editor that can work with UTF-8. For example, GNU Emacs allows me to enter text using a variety of keyboard options and then save my document in the encoding of my choice, including UTF-8. If I try to save a Chinese document in the Latin-1 encoding, Emacs will refuse to comply, warning me that the document contains characters that do not exist in Latin-1. Unfortunately, for people like me who want to use Hebrew, Emacs doesn't yet handle right-to-left input.
A better option, and one which is increasingly impressive all of the time, is Yudit, an open-source UTF-8-compliant editor that handles many different languages and directions. It can take a while to learn to use Yudit, but it does work. Yudit, like Emacs, allows you to enter any character you want, even if your operating system or keyboard does not directly support all of the desired languages.
Both Emacs and Yudit are good options if you are working on Linux, if you are willing to tinker a bit, and if you don't mind writing your HTML by hand. But nearly all of the graphic designers I know work on other platforms, and getting them to work with HTML editors that use UTF-8 has been rather difficult.
Luckily, Mozilla comes with not only a web browser but a full-fledged HTML editor as well. As you might expect, Mozilla's composer module is a bit rough around the edges but handles most tasks just fine.
Another option is to use HTML entities. The best-known entities are <, > and & which make it possible to insert the <, > and & symbols into an HTML document without having to worry that they will be interpreted as tags.
Modern browsers not only understand entities such as © (the copyright symbol) but also include the full list of Unicode characters. Thus, you can refer to Unicode characters by inserting &#XXXX; in your document, entering the character's decimal code instead of the XXXX. For example, the following HTML document displays my name in Hebrew, using Unicode entities:
<html>
<head><title>Reuven's name</title></head>
<body><p>ראובן</p>
</body>
</html>
Creating the above document does not require a Unicode-compliant editor, and it will render fine in any modern browser, regardless of the Content-type that was declared in the HTTP response headers. However, editing a file that uses entities in this way is tedious and difficult at best. Unfortunately, the save-as-HTML feature in the international editions of Microsoft Word uses this extensively, which makes it easy for Word users to create Unicode-compliant documents but difficult for people to edit them later.
Realizing the promise of Apache® Hadoop® requires the effective deployment of compute, memory, storage and networking to achieve optimal results. With its flexibility and multitude of options, it is easy to over or under provision the server infrastructure, resulting in poor performance and high TCO. Join us for an in depth, technical discussion with industry experts from leading Hadoop and server companies who will provide insights into the key considerations for designing and deploying an optimal Hadoop cluster.
Sponsored by AMD
Built-in forensics, incident response, and security with Red Hat Enterprise Linux 6
Every security policy provides guidance and requirements for ensuring adequate protection of information and data, as well as high-level technical and administrative security requirements for a system in a given environment. Traditionally, providing security for a system focuses on the confidentiality of the information on it. However, protecting the data integrity and system and data availability is just as important. For example, when processing United States intelligence information, there are three attributes that require protection: confidentiality, integrity, and availability.
Learn more about catching the bad guy in this free white paper.
Sponsored by DLT Solutions
| Designing Electronics with Linux | May 22, 2013 |
| Dynamic DNS—an Object Lesson in Problem Solving | May 21, 2013 |
| Using Salt Stack and Vagrant for Drupal Development | May 20, 2013 |
| Making Linux and Android Get Along (It's Not as Hard as It Sounds) | May 16, 2013 |
| Drupal Is a Framework: Why Everyone Needs to Understand This | May 15, 2013 |
| Home, My Backup Data Center | May 13, 2013 |
- New Products
- Linux Systems Administrator
- Senior Perl Developer
- Technical Support Rep
- UX Designer
- Web & UI Developer (JavaScript & j Query)
- Designing Electronics with Linux
- Dynamic DNS—an Object Lesson in Problem Solving
- Making Linux and Android Get Along (It's Not as Hard as It Sounds)
- Using Salt Stack and Vagrant for Drupal Development
- Reply to comment | Linux Journal
2 hours 54 min ago - Nice article, thanks for the
13 hours 35 min ago - I once had a better way I
19 hours 21 min ago - Not only you I too assumed
19 hours 38 min ago - another very interesting
21 hours 31 min ago - Reply to comment | Linux Journal
23 hours 25 min ago - Reply to comment | Linux Journal
1 day 6 hours ago - Reply to comment | Linux Journal
1 day 6 hours ago - Favorite (and easily brute-forced) pw's
1 day 8 hours ago - Have you tried Boxen? It's a
1 day 14 hours ago
Enter to Win an Adafruit Pi Cobbler Breakout Kit for Raspberry Pi

It's Raspberry Pi month at Linux Journal. Each week in May, Adafruit will be giving away a Pi-related prize to a lucky, randomly drawn LJ reader. Winners will be announced weekly.
Fill out the fields below to enter to win this week's prize-- a Pi Cobbler Breakout Kit for Raspberry Pi.
Congratulations to our winners so far:
- 5-8-13, Pi Starter Pack: Jack Davis
- 5-15-13, Pi Model B 512MB RAM: Patrick Dunn
- 5-21-13, Prototyping Pi Plate Kit: Philip Kirby
- Next winner announced on 5-27-13!
Featured Jobs
| Linux Systems Administrator | Houston and Austin, Texas | Host Gator |
| Senior Perl Developer | Austin, Texas | Host Gator |
| Technical Support Rep | Houston and Austin, Texas | Host Gator |
| UX Designer | Austin, Texas | Host Gator |
| Web & UI Developer (JavaScript & j Query) | Austin, Texas | Host Gator |
Free Webinar: Hadoop
How to Build an Optimal Hadoop Cluster to Store and Maintain Unlimited Amounts of Data Using Microservers
Realizing the promise of Apache® Hadoop® requires the effective deployment of compute, memory, storage and networking to achieve optimal results. With its flexibility and multitude of options, it is easy to over or under provision the server infrastructure, resulting in poor performance and high TCO. Join us for an in depth, technical discussion with industry experts from leading Hadoop and server companies who will provide insights into the key considerations for designing and deploying an optimal Hadoop cluster.
Some of key questions to be discussed are:
- What is the “typical” Hadoop cluster and what should be installed on the different machine types?
- Why should you consider the typical workload patterns when making your hardware decisions?
- Are all microservers created equal for Hadoop deployments?
- How do I plan for expansion if I require more compute, memory, storage or networking?




Comments
International ASCII codes
Where do I find the Linux ASCII codes for Denmark and Germany? I know what the individual foreign characters are but I don't know how to use them on letters or my kmail So I can write to my family.
Can you help?
Thank You
Bill Hansen
thank you
hi reuven, thanks a lot for a comprehensive article. i've always stabbed in the dark regarding charsets and encoding, but now am on the right path (regarding this at least).