As I indicated earlier, Unicode is a complex standard, and it has taken some time for different languages and technologies to support it. For example, Perl 5.6.x used Unicode internally, but input and output operations couldn't easily use it, which made such support basically useless. Perl 5.8 by contrast has excellent Unicode support, allowing developers to write regular expressions that depend on Unicode properties.
There are still some problems, however. A major problem that developers have to deal with is the issue of input encoding vs. storage encoding, such as when your terminal might use Latin-1 but the back end might use UTF-8. This sort of arrangement means you can continue to use your old (non-Unicode) terminal program and fonts but connect to and use your Unicode-compliant back-end program.
Various implementations also have some holes, which might not be obvious when you first start to work on a project. For example, I recently worked on a J2EE project that used PostgreSQL on its back end and stored all of the characters in Unicode. Everything was fine until we decided to compare the user's input string with text in the database in a case-insensitive fashion. Unfortunately, the PostgreSQL function we used doesn't handle case insensitivity correctly for Unicode strings. We found a workaround in the end, but it was both embarrassing and frustrating to encounter this.
Collating, or sorting, is also a difficult issue—one that has bitten me on a number of occasions. Unicode defines a character set, but it does not indicate in which order the characters in that set should be sorted. Until recently, for example, “ch” was sorted as its own separate letter in Spanish-speaking countries; this was not true for speakers of English, German and French. The sort order thus depends not only on the character set, but on the locale in which the character set is being applied. You may need to experiment with the LANG and LC_ALL environment variables (among others) to get things to work the way you expect.
Unicode is clearly the way of the future; most operating systems now support it to a certain degree, and it is becoming an entrenched standard in the computer world. Unfortunately, Unicode requires unlearning the old practice of equating characters and bytes and handling a great deal of new complexities and problems.
If you only need to use a single language on your web site, then consider yourself lucky. But if you want to use even a single non-ASCII character, you will soon find yourself swimming in the world of Unicode. It's worth learning about this technology sooner rather than later, given that it is slowly but surely making its way into nearly every open-source system and standard.
Reuven M. Lerner (email@example.com) is a consultant specializing in web/database technologies. His first book, Core Perl, was published by Prentice Hall in January 2002. His next book, about open-source web/development environments, will be published by Apress in late 2003. Reuven lives with his wife and daughters in Modi'in, Israel.
Win an iPhone 6
Enter to Win
|Geek Hide-away in Guatemala - Stay for Free!||Nov 26, 2015|
|Microsoft and Linux: True Romance or Toxic Love?||Nov 25, 2015|
|Non-Linux FOSS: Install Windows? Yeah, Open Source Can Do That.||Nov 24, 2015|
|Cipher Security: How to harden TLS and SSH||Nov 23, 2015|
|Web Stores Held Hostage||Nov 19, 2015|
|diff -u: What's New in Kernel Development||Nov 17, 2015|
- Cipher Security: How to harden TLS and SSH
- Non-Linux FOSS: Install Windows? Yeah, Open Source Can Do That.
- Microsoft and Linux: True Romance or Toxic Love?
- Geek Hide-away in Guatemala - Stay for Free!
- Web Stores Held Hostage
- Firefox's New Feature for Tighter Security
- It's a Bird. It's Another Bird!
- PuppetLabs Introduces Application Orchestration
- diff -u: What's New in Kernel Development
- IBM LinuxONE Provides New Options for Linux Deployment