Unicode is necessary for international web development but poses a few pitfalls.
Unicode and HTTP

Now that we have gotten the basics out of the way, let's consider how Unicode documents are transferred across the Web. The basic problem is this: when your browser receives a document, how does it know if it should interpret the bytes as Latin-1, Big-5 Chinese or UTF-8?

The answer lies in the Content-type HTTP header. Every time an HTTP server sends a document to a browser, it identifies the type of content it is sending using a MIME-style designation, such as text/html, image/png or application/msword. If you receive a JPEG image (image/jpeg), there is only one way to represent the image. But if you receive an HTML document (text/html), the Content-type header must indicate the character set and/or encoding that is being used. We do this by adding a charset= designation to the end of the header, separating the type from the charset. For example:

Content-type: text/html; charset=utf-8

Purists rightly say that UTF-8 is an encoding and not a character set. Unfortunately, it's too late to do anything about this. This is similar to the fact that the word “referrer” is misspelled in the HTTP specification as “referer”; everyone knows that it's wrong but is afraid to break existing software.

If no Content-type is specified, it is assumed to be Latin-1. Moreover, if no Content-type is specified, individual documents can set (or override) the value within a metatag. Metatags cannot override an explicit setting of the character set, however.

As you begin to work with different encodings, you will undoubtedly discover an HTTP server that has not been configured correctly and that is announcing the wrong character set in the Content-type header. An easy way to check this is to use Perl's LWP (library for web programming), which includes a number of useful command-line programs for web developers, for example:

$ HEAD http://yad2yad.huji.ac.il/

Typing the above on my Linux box returns the HTTP response headers from the named site:

200 OK
Cache-Control: max-age=0
Connection: close
Date: Tue, 10 Dec 2002 08:38:37 GMT
Server: AOLserver/3.3.1+ad13
Content-Type: text/html; charset=utf-8
As you can see, the Content-type header is declaring the document to be in UTF-8.

Mozilla and other modern browsers allow the user to override the explicitly stated encoding. Although this should not normally be necessary for end users, I often find this functionality to be useful when developing a site.

Unicode and HTML

Although it's nice to know we can transfer UTF-8 documents via HTTP, we first need some UTF-8 documents to send. Given that ASCII documents are all UTF-8 documents as well, it's easy to create valid UTF-8 documents, so long as they contain only ASCII characters. But what happens if you want to create HTML pages that contain Hebrew or Greek? Then things start to get interesting and difficult.

There are basically two ways to include Unicode characters in an HTML document. The first is to type the characters themselves using an editor that can work with UTF-8. For example, GNU Emacs allows me to enter text using a variety of keyboard options and then save my document in the encoding of my choice, including UTF-8. If I try to save a Chinese document in the Latin-1 encoding, Emacs will refuse to comply, warning me that the document contains characters that do not exist in Latin-1. Unfortunately, for people like me who want to use Hebrew, Emacs doesn't yet handle right-to-left input.

A better option, and one which is increasingly impressive all of the time, is Yudit, an open-source UTF-8-compliant editor that handles many different languages and directions. It can take a while to learn to use Yudit, but it does work. Yudit, like Emacs, allows you to enter any character you want, even if your operating system or keyboard does not directly support all of the desired languages.

Both Emacs and Yudit are good options if you are working on Linux, if you are willing to tinker a bit, and if you don't mind writing your HTML by hand. But nearly all of the graphic designers I know work on other platforms, and getting them to work with HTML editors that use UTF-8 has been rather difficult.

Luckily, Mozilla comes with not only a web browser but a full-fledged HTML editor as well. As you might expect, Mozilla's composer module is a bit rough around the edges but handles most tasks just fine.

Another option is to use HTML entities. The best-known entities are &lt;, &gt; and &amp; which make it possible to insert the <, > and & symbols into an HTML document without having to worry that they will be interpreted as tags.

Modern browsers not only understand entities such as &copy; (the copyright symbol) but also include the full list of Unicode characters. Thus, you can refer to Unicode characters by inserting &#XXXX; in your document, entering the character's decimal code instead of the XXXX. For example, the following HTML document displays my name in Hebrew, using Unicode entities:

    <head><title>Reuven's name</title></head>

Creating the above document does not require a Unicode-compliant editor, and it will render fine in any modern browser, regardless of the Content-type that was declared in the HTTP response headers. However, editing a file that uses entities in this way is tedious and difficult at best. Unfortunately, the save-as-HTML feature in the international editions of Microsoft Word uses this extensively, which makes it easy for Word users to create Unicode-compliant documents but difficult for people to edit them later.



Comment viewing options

Select your preferred way to display the comments and click "Save settings" to activate your changes.

International ASCII codes

Bill Hansen's picture

Where do I find the Linux ASCII codes for Denmark and Germany? I know what the individual foreign characters are but I don't know how to use them on letters or my kmail So I can write to my family.
Can you help?

Thank You

Bill Hansen

thank you

Justin Lawrence's picture

hi reuven, thanks a lot for a comprehensive article. i've always stabbed in the dark regarding charsets and encoding, but now am on the right path (regarding this at least).