Alphabet Soup: The Internationalization of Linux, Part 2
It would be nice if we could think of character sets as corresponding directly to the scripts we use to write by hand, but unfortunately things are not so simple. For example, there are nearly 200 countries in the world, each with its own currency. Of course, many share the same symbol, but it is clear that if your keyboard had a key for every currency symbol, it would be about twice as big as the one you use today. Other useful symbols are the paragraph and sectioning marks used by lawyers and the various operators and non-Latin symbols used by mathematicians. Since new characters are being created all the time (for example, the symbol for the new European monetary unit, the euro), it is impossible to include them all. So in fact, a character set is someone's idea of a useful set of characters.
Representation as bit strings inside a computer imposes further constraints. Since modern computers all work in terms of bytes as the smallest efficient unit of access, there is a big difference in the space and processing requirements for text based on a 256-character set, which can be encoded in a single byte, and a 257-character set, which cannot. One might think that the extension to two bytes, or 65536 characters, would be enough to satisfy anyone, but it turns out that even this is not enough. The process of selecting about 20,000 ideographic characters of Chinese origin occasioned many arguments while the Unicode character set was being designed. Even those 20,000 may not be enough; while there are only a few people in the world who care about some of the excluded ideographic characters, to them it may be the most important character in the world, as it is the one they use to write their name.
The result is that many character sets have been designed and populated, and standards have been written to codify their use.
The most influential standard of all is the American Standard Code for Information Interchange, abbreviated ASCII. This is a list of the 128 7-bit bit strings, with an assignment of each one to either a character commonly used in American English or a control function. Many of the control functions are not used today, but so much software has been written on the assumption that hex values 0x00 to 0x1F are not printing characters that no one considers assigning a few more characters to some of those code points.
Because nearly all existing computer languages are compatible with the ASCII character set, ASCII in some form is a subset of most electronic character sets. However, there are many variants. For example, the JIS Roman character set used in most Japanese computers is almost identical to ASCII, except that a couple of the glyphs are changed and the Japanese yen symbol is substituted for the backslash. In order to codify this development, ASCII-like character sets are defined by the International Standards Organization (ISO) in standard ISO 646. U.S. ASCII is designated the international reference version for ISO 646 and is occasionally referred to as ISO 646-IRV (for example, in naming fonts for the X Window System).
ASCII is simply not sufficient for use in an internationalized environment. For example, most European languages use accented characters. Certainly, it is possible to represent “Latin small letter a with acute accent” (á) as a two character ligature (e.g., 'a), but this is inconvenient for sorting and possibly ambiguous. Furthermore, it is not obvious how to represent the caron using only ASCII characters. In order to maintain compatibility with ASCII for the sake of existing software, and accommodate many of the countries most intensively using computers, the ISO 8859 standard was designed. ISO 8859 had three main goals: maintain ASCII compatibility, implement within the constraints of ISO 2022 and provide the broadest coverage of languages within a single-octet encoding. Unfortunately, these three goals are not compatible. Several important scripts which can be encoded in a single octet require several dozen code points each for their characters because they do not overlap with ASCII or each other: Greek, Russian, Hebrew and Arabic.
The solution arrived at was not to define a single character set, but rather a family of character sets. Each ISO 8859 character set contains ASCII (ISO-646-IRV) as a subset, and the encoding is defined so that, interpreted as integers (C chars), the ASCII characters are encoded identically in ASCII and ISO 8859. Then a list of supplementary character sets, each including at most 96 characters to conform to ISO 2022 (described below), was defined. These supplementary characters are then assigned to the code points 0xA0 to 0xFF. Where the supplementary set is derived from an alphabet, the natural collating order is followed, but for the collections of accented characters the order is necessarily arbitrary. The current supplementary character sets are listed in Table 1.
Practical Task Scheduling Deployment
One of the best things about the UNIX environment (aside from being stable and efficient) is the vast array of software tools available to help you do your job. Traditionally, a UNIX tool does only one thing, but does that one thing very well. For example, grep is very easy to use and can search vast amounts of data quickly. The find tool can find a particular file or files based on all kinds of criteria. It's pretty easy to string these tools together to build even more powerful tools, such as a tool that finds all of the .log files in the /home directory and searches each one for a particular entry. This erector-set mentality allows UNIX system administrators to seem to always have the right tool for the job.
Cron traditionally has been considered another such a tool for job scheduling, but is it enough? This webinar considers that very question. The first part builds on a previous Geek Guide, Beyond Cron, and briefly describes how to know when it might be time to consider upgrading your job scheduling infrastructure. The second part presents an actual planning and implementation framework.
Join Linux Journal's Mike Diehl and Pat Cameron of Help Systems.
Free to Linux Journal readers.View Now!
|The Firebird Project's Firebird Relational Database||Jul 29, 2016|
|Stunnel Security for Oracle||Jul 28, 2016|
|SUSE LLC's SUSE Manager||Jul 21, 2016|
|My +1 Sword of Productivity||Jul 20, 2016|
|Non-Linux FOSS: Caffeine!||Jul 19, 2016|
|Murat Yener and Onur Dundar's Expert Android Studio (Wrox)||Jul 18, 2016|
- Stunnel Security for Oracle
- The Firebird Project's Firebird Relational Database
- Murat Yener and Onur Dundar's Expert Android Studio (Wrox)
- SUSE LLC's SUSE Manager
- Managing Linux Using Puppet
- My +1 Sword of Productivity
- Non-Linux FOSS: Caffeine!
- Google's SwiftShader Released
- SuperTuxKart 0.9.2 Released
- Doing for User Space What We Did for Kernel Space
With all the industry talk about the benefits of Linux on Power and all the performance advantages offered by its open architecture, you may be considering a move in that direction. If you are thinking about analytics, big data and cloud computing, you would be right to evaluate Power. The idea of using commodity x86 hardware and replacing it every three years is an outdated cost model. It doesn’t consider the total cost of ownership, and it doesn’t consider the advantage of real processing power, high-availability and multithreading like a demon.
This ebook takes a look at some of the practical applications of the Linux on Power platform and ways you might bring all the performance power of this open architecture to bear for your organization. There are no smoke and mirrors here—just hard, cold, empirical evidence provided by independent sources. I also consider some innovative ways Linux on Power will be used in the future.Get the Guide