Introduction to Internationalization Programming
In the old days when only a few people used computers, they were, for the most part, English speakers. Today, computers are widely available, and differences in languages, traditions and cultures need to be reflected in the world of programming. This article introduces the GNU gettext system.
The idea of using the same program but changing its properties according to the cultural traditions of different peoples is called internationalization. However, because programmers like to make words shorter, instead of typing 20 characters they type only four: i18n. I18n means programming designed to handle many languages.
Once you've written an i18n program, you may want to add a new language. This is not an i18n problem. In general, you need a person who will translate messages from the program for a specific nation. This problem is called localization, or i10n. I10n refers to the implementation of a specific language for an internationalized software, or in other words, the creation of localized objects according to the specific region's rules.
Although each organization and company that designs and distributes software tries to implement this in its own way, in general, the i18n idea is simple. Software should be created with two parts in mind: common and nation-dependent. This second part is known as localized objects.
Hopefully, standards will make life more comfortable. The basic concept of locale was introduced by the ISO (International Standard Organization) with the C standard in 1990, which was expanded in 1995. POSIX also has rules for i18n, so the term POSIX locale is used together with National Language Support (NLS). Formally, NLS is not a part of POSIX but has some functions that help when using the POSIX locale. X11 has its own i18n implementation, but the common way for programmers is to move the X11 i18n/i10n “up a level” into the POSIX/NLS locale. [Other software has its own i18n and i10n. See “Bridging the Digital Divide in South Africa” for one way to handle it].
What should we take into account when speaking of locale? Of course, the name of the language, but that is not enough. Everybody knows there are differences between American and British English, so we also have to know where the particular language is used, or in other words, the territory, taking into account individual traditions and cultural rules.
Every language has its own system of writing, and sometimes even several. Languages have alphabets, or character repertoires, but computers deal with digits. So, a character should be associated with a digit. This kind of association is called a coded character set (CCS). There are plenty of them, and each has its own name, such as ASCII, ISO-8859-1, KOI8-U. Instead of CCS, the term charset is often used. There is no special standard for the name of a charset, so ISO-8859-1, ISO8859-1 and iso8859-1 all refer to the same thing. There are some definitions from IANA, the organization that also is responsible for the registration of charsets (see Resources). As you probably know, the X11 system has its own system for charset naming, and their document “Logical Font Description Conversion” (described in Jim Flower's article, see Resources) provides a good name and alias charset creating system.
Charsets are important. Some countries have several different charsets for the same language! In the Ukraine, for instance, the same text may be displayed nicely in koi8-u but may be absolutely unreadable if a terminal uses the Ukrainian charsets iso8859-5, cp1251 or Unicode. In those cases, we would have to convert the text from one charset into another.
In order to take all of this into account, the POSIX locale defines some things that all together are called locale categories. They are shown in Table 1. Knowing them is important; C functions work differently with different locales! Categories are reflected in shell as environment variables with the same name. An example of using LC_ALL is shown in Listing 1.
The syntax to build a locale name looks like this:
where language is represented by two lowercase letters, such as en for English and fr for French; territory is represented by two uppercase letters, such as GB for United Kingdom and FR for France, and in these two cases, euro would be the modifier. So, you can change your locale by setting the corresponding environment variables. See Listing 1, where we use the programs date, cat and our example program, counter. Note, we use only language and territory; we cannot change the charset for the terminal with this command. Now imagine that the program messages are written in one charset but are output in another. POSIX does not have functions to determine the current charset, but XPG has nl_langinfo(). In some distributions, the man page for this function may be missing (Debian does not have it, but SuSE and Red Hat do). In any case, you can obtain additional information from /usr/include/langinfo.h. To determine the current charset, use the following code:
#include <locale.h> #include <langinfo.h> ... setlocale(LC_ALL,""); printf ("Current charset = %s\n", nl_langinfo(CODESET));To convert from one coding into another for the correct output, you can use the conv() function. For more details, consult “Introduction to i18n” (see Resources).
In order to provide output information, a message catalog for that locale was created. This means that all software messages are kept separately from a program that may have (and must have) its own catalog. NLS provides a set of utilities for creating and supporting such catalogs, as well as functions for extracting information according to three keys: 1) program name, 2) current categories of locale and 3) pointer to a particular message to be output.
There are two general realisations of the NLS mechanism:
X/Open Portability Guide XPG3/XPG4/XPG5 with the functions catopen(), catgets() and catclose() and the gencat utility.
SUN XView with functions gettext() and textdomain(). The GNU Project has its own fully compatible release called GNU gettext.
Usually programs, as well as system libraries, use one (or even two) NLS systems.
Although XPG5 is included in the UNIX specification version 2, and all versions of UNIX systems support it, with Linux, GNU gettext is the most popular solution.
The POSIX locale has the following components:
Locale API, i.e., subroutines like setlocale(), isalpha(), etc.
Shell environment variables to manage locale categories.
The locale utility to get information about the current locale; see man locale for more details.
Objects of localization. The default directory for their location is /usr/share/locale/.
Editorial Advisory Panel
Thank you to our 2014 Editorial Advisors!
- Jeff Parent
- Brad Baillio
- Nick Baronian
- Steve Case
- Chadalavada Kalyana
- Caleb Cullen
- Keir Davis
- Michael Eager
- Nick Faltys
- Dennis Frey
- Philip Jacob
- Jay Kruizenga
- Steve Marquez
- Dave McAllister
- Craig Oda
- Mike Roberts
- Chris Stark
- Patrick Swartz
- David Lynch
- Alicia Gibb
- Thomas Quinlan
- Carson McDonald
- Kristen Shoemaker
- Charnell Luchich
- James Walker
- Victor Gregorio
- Hari Boukis
- Brian Conner
- David Lane