As I indicated earlier, Unicode is a complex standard, and it has taken some time for different languages and technologies to support it. For example, Perl 5.6.x used Unicode internally, but input and output operations couldn't easily use it, which made such support basically useless. Perl 5.8 by contrast has excellent Unicode support, allowing developers to write regular expressions that depend on Unicode properties.
There are still some problems, however. A major problem that developers have to deal with is the issue of input encoding vs. storage encoding, such as when your terminal might use Latin-1 but the back end might use UTF-8. This sort of arrangement means you can continue to use your old (non-Unicode) terminal program and fonts but connect to and use your Unicode-compliant back-end program.
Various implementations also have some holes, which might not be obvious when you first start to work on a project. For example, I recently worked on a J2EE project that used PostgreSQL on its back end and stored all of the characters in Unicode. Everything was fine until we decided to compare the user's input string with text in the database in a case-insensitive fashion. Unfortunately, the PostgreSQL function we used doesn't handle case insensitivity correctly for Unicode strings. We found a workaround in the end, but it was both embarrassing and frustrating to encounter this.
Collating, or sorting, is also a difficult issue—one that has bitten me on a number of occasions. Unicode defines a character set, but it does not indicate in which order the characters in that set should be sorted. Until recently, for example, “ch” was sorted as its own separate letter in Spanish-speaking countries; this was not true for speakers of English, German and French. The sort order thus depends not only on the character set, but on the locale in which the character set is being applied. You may need to experiment with the LANG and LC_ALL environment variables (among others) to get things to work the way you expect.
Unicode is clearly the way of the future; most operating systems now support it to a certain degree, and it is becoming an entrenched standard in the computer world. Unfortunately, Unicode requires unlearning the old practice of equating characters and bytes and handling a great deal of new complexities and problems.
If you only need to use a single language on your web site, then consider yourself lucky. But if you want to use even a single non-ASCII character, you will soon find yourself swimming in the world of Unicode. It's worth learning about this technology sooner rather than later, given that it is slowly but surely making its way into nearly every open-source system and standard.
Reuven M. Lerner (email@example.com) is a consultant specializing in web/database technologies. His first book, Core Perl, was published by Prentice Hall in January 2002. His next book, about open-source web/development environments, will be published by Apress in late 2003. Reuven lives with his wife and daughters in Modi'in, Israel.
|PostgreSQL, the NoSQL Database||Jan 29, 2015|
|HPC Cluster Grant Accepting Applications!||Jan 28, 2015|
|Sharing Admin Privileges for Many Hosts Securely||Jan 28, 2015|
|Red Hat Enterprise Linux 7.1 beta available on IBM Power Platform||Jan 23, 2015|
|Designing with Linux||Jan 22, 2015|
|Wondershaper—QOS in a Pinch||Jan 21, 2015|
- PostgreSQL, the NoSQL Database
- Sharing Admin Privileges for Many Hosts Securely
- HPC Cluster Grant Accepting Applications!
- Designing with Linux
- Wondershaper—QOS in a Pinch
- Internet of Things Blows Away CES, and it May Be Hunting for YOU Next
- January 2015 Issue of Linux Journal: Security
- Ideal Backups with zbackup
- Slow System? iotop Is Your Friend
- Red Hat Enterprise Linux 7.1 beta available on IBM Power Platform
Editorial Advisory Panel
Thank you to our 2014 Editorial Advisors!
- Jeff Parent
- Brad Baillio
- Nick Baronian
- Steve Case
- Chadalavada Kalyana
- Caleb Cullen
- Keir Davis
- Michael Eager
- Nick Faltys
- Dennis Frey
- Philip Jacob
- Jay Kruizenga
- Steve Marquez
- Dave McAllister
- Craig Oda
- Mike Roberts
- Chris Stark
- Patrick Swartz
- David Lynch
- Alicia Gibb
- Thomas Quinlan
- Carson McDonald
- Kristen Shoemaker
- Charnell Luchich
- James Walker
- Victor Gregorio
- Hari Boukis
- Brian Conner
- David Lane