As I indicated earlier, Unicode is a complex standard, and it has taken some time for different languages and technologies to support it. For example, Perl 5.6.x used Unicode internally, but input and output operations couldn't easily use it, which made such support basically useless. Perl 5.8 by contrast has excellent Unicode support, allowing developers to write regular expressions that depend on Unicode properties.
There are still some problems, however. A major problem that developers have to deal with is the issue of input encoding vs. storage encoding, such as when your terminal might use Latin-1 but the back end might use UTF-8. This sort of arrangement means you can continue to use your old (non-Unicode) terminal program and fonts but connect to and use your Unicode-compliant back-end program.
Various implementations also have some holes, which might not be obvious when you first start to work on a project. For example, I recently worked on a J2EE project that used PostgreSQL on its back end and stored all of the characters in Unicode. Everything was fine until we decided to compare the user's input string with text in the database in a case-insensitive fashion. Unfortunately, the PostgreSQL function we used doesn't handle case insensitivity correctly for Unicode strings. We found a workaround in the end, but it was both embarrassing and frustrating to encounter this.
Collating, or sorting, is also a difficult issue—one that has bitten me on a number of occasions. Unicode defines a character set, but it does not indicate in which order the characters in that set should be sorted. Until recently, for example, “ch” was sorted as its own separate letter in Spanish-speaking countries; this was not true for speakers of English, German and French. The sort order thus depends not only on the character set, but on the locale in which the character set is being applied. You may need to experiment with the LANG and LC_ALL environment variables (among others) to get things to work the way you expect.
Unicode is clearly the way of the future; most operating systems now support it to a certain degree, and it is becoming an entrenched standard in the computer world. Unfortunately, Unicode requires unlearning the old practice of equating characters and bytes and handling a great deal of new complexities and problems.
If you only need to use a single language on your web site, then consider yourself lucky. But if you want to use even a single non-ASCII character, you will soon find yourself swimming in the world of Unicode. It's worth learning about this technology sooner rather than later, given that it is slowly but surely making its way into nearly every open-source system and standard.
Reuven M. Lerner (email@example.com) is a consultant specializing in web/database technologies. His first book, Core Perl, was published by Prentice Hall in January 2002. His next book, about open-source web/development environments, will be published by Apress in late 2003. Reuven lives with his wife and daughters in Modi'in, Israel.
- High-Availability Storage with HA-LVM
- DNSMasq, the Pint-Sized Super Dæmon!
- Localhost DNS Cache
- Real-Time Rogue Wireless Access Point Detection with the Raspberry Pi
- Days Between Dates: the Counting
- You're the Boss with UBOS
- The Usability of GNOME
- Linux for Astronomers
- Multitenant Sites
- PostgreSQL, the NoSQL Database