UpFRONT | Linux Journal

Linux Journal

by Doc Searls

on June 1, 2001

Linux Makes Machine Translation Development Possible

Systran Internet Translation Technologies was born during the Cold War when the US government wanted to translate a large quantity of Russians texts quickly. At the end of the sixties, it became a private company called Systran, located in La Jolla, California.

In the nineties, Systran decided to dump the OS/390 running under MVS and port the whole system to UNIX. By then, PCs had become powerful enough to host the translation engines. An automatic translator was used to migrate most of the assembly code into C code.

The original port was made to Solaris, but they quickly switched to cheaper hardware, PCs and Slackware (they've since moved to Red Hat). The reasons for the choice of Linux were: it runs on a variety of hardware; it provides all the tools a developer may need; natural language processing uses large texts requiring powerful tools; the translation engine uses a large set of rules, hence the migration produced large C/C++ programs and needed powerful tools such as Make and gcc/g++; clients like AltaVista have a large audience and need a robust application on a stable system; and cost.

To these can be added the fact that drivers for newer hardware appear more quickly on Linux than on other platforms, and it uses marginally less resources. Linux provides a homogeneous configuration easy to replicate and is very scalable. Linux also comes with a firewall, sendmail, Apache, modperl and PostgreSQL, all of which are needed for Systran's on-line services (http://www.systranlinks.com/, http://www.systranet.com/). Moreover, environments like GNOME or KDE make it possible to put Linux in the hands of non-programmers as well. This is important because a large number of the Systran staff are linguists rather than programmers. Finally, POSIX compliance ensures that Systran can port easily to other forms of UNIX.

Systran software is behind most of the automatic translation done in the world. Clients include not only US government agencies and European institutions, but also AltaVista, Microsoft, Apple, Lycos and AOL.

Machine translation is at the confluence of linguistics and computer science. Developing a product is simply translating into computer language all the rules of human language. The main problem is a linguistic one, since you need to start with an accurate description of the languages concerned. There is a description of the source language (the analysis phase) and one of the target language (the synthesis phase).

The code is divided into four parts: 1) the analysis of the source language; 2) the synthesis of the target language; 3) the transfer rules; and 4) the common procedures to all translation engines, i.e., memory management, command-line management, dictionary lookup procedures, filters, pre-processing, post-processing, etc.

The dictionaries used are very specific; they do not only include the translation of the words (i.e., manger = to eat) but also syntactic and lexical information, such as “this verb is transitive, it can be used in this specific context in which case it means this.” There are three kinds of dictionaries. The first two are internal, one with simple word stems and the other with complex or idiomatic expressions. The third is external. The latter are created on-demand for a specific customer on a specific theme. Systran also has resource files that contain the flexions for the verbs or the declensions for languages that have them, as well as specific priority rules for the external (customer) dictionary and stylistic indications. All this is coded in C, although the newer extensions generally are coded in C++.

In order to produce the rules, linguists use a graphical interface coded in GTK. The data is stored in an ASCII file that goes through a Perl program to generate the macroinstructions from the data in the code. The dictionaries are built semiautomatically using the rules discovered during the analysis of the language. A unilingual master dictionary is created for each language; terminology is entered, and Systran's tools automatically add the relevant linguistic information on the base of tables. For example, “automatically” would be recognized as an adverb because it ends in “ally”. Bilingual dictionaries are then built by creating a simple double-entry list, which will then retrieve the relevant syntactic information from the master unilingual dictionary.

It is only at the last stage that the dictionaries are compiled into binary format in order to increase processing speed at runtime. When you clicked on the Translate button on AltaVista, you probably never thought the process behind it was so complex!

Systran is preparing a free Linux release with all the features of the Systran Personal Windows edition.

—Thunus F., Director of Systran Luxembourg

NASA's JPL Builds War Game Simulator On Linux

The Jet Propulsion Laboratory (JPL) of Pasadena, California is one of the space program's major players. Managed for NASA by the California Institute of Technology, JPL is the lead US center for robotic exploration of the solar system and its spacecraft have visited all known planets except Pluto. In addition to its work for NASA, JPL conducts research and development projects for a variety of federal agencies. One such project, the Corps Battle Simulation (CBS) recently made the transition from VAX to Red Hat Linux Version 7.0, resulting in a substantial increase in performance at considerably reduced cost.

CBS has been used to train army officers in battle tactics for over 15 years. Previously, it ran on VAX's most powerful computer, a $100,000-plus 7800-series machine. However, due to the steadily increasing intelligence and the addition of new features, CBS reached its limitations on VAX. This made further innovation a struggle and threatened to render the battle simulator obsolete within a few years. As a result, the US Army's Simulation, Training, and Instrumentation Command (STRICOM), in Orlando, Florida asked JPL to port the software to Linux in order to increase functionality while cutting cost.

After spending a man-year reconfiguring CBS source code, then recompiling, testing and debugging, the team benchmarked the system running on Linux with rewarding results. “By porting CBS from VAX to Linux, we have achieved far better performance at a much reduced cost and have lots of extra capacity,” says Jay Braun, a simulation software technologist at JPL.

The additional capacity of Linux gives the CBS system more room to expand. Terrain elevation, for instance, can now be modeled at a very detailed level. Previously, attempting complex line of sight calculations severely taxed VAX capabilities. Now, high-fidelity maps are available on Linux that make simulations more realistic, increasing the accuracy of the battle scenarios.

CBS is running on a $4,000 PC with a 1.2 gigahertz AMD Athlon processor. This Linux machine runs the largest CBS exercise almost four times faster than the most powerful VAX without sacrificing anything in model fidelity. Using the VAX, fidelity had to be reduced in order to allow a simulation to progress at a one-to-one game ratio, i.e., a virtual minute in the simulation requires a real minute of execution time. Under Linux, however, one-to-one scenarios can be achieved at the highest quality levels available.

JPL has also made adjustments so that CBS has a 20-second save time for the largest exercises and three seconds for small exercises. This is an order of magnitude faster than on the old VAX system. Under Linux the application can now represent almost 3GB of virtual address space for each simulation. “That's a big image!” says Braun. “Our model has plenty of features that are pushing the limits of Linux.”

JPL will deliver the ported software in June of 2001. Braun predicts that in the near future, the system will further advance to a two-processor machine that can support additional simulations. JPL is now shifting over to Red Hat Linux 7.1 with the new 2.4 kernel.

Fine Print

No text selections can be copied from this book to the clipboard....No printing is permitted on this book.... This book cannot be lent or given to someone else....This book cannot be given to someone else....This book cannot be read aloud.

—From the “permissions” that accompany Alice in Wonderland, as published by Adobe in its downloadable .pdf format. Alice in Wonderland, written by Lewis Carroll in 1865, has long since passed into the public domain.

Obviously some protection of copyrighted material will, and should be, built into code. But the power to control perfectly the use of copyrighted material should not. The key will be to find a balance. And when companies like Adobe clearly signal that their effort is to find a balance, they deserve the benefit of the doubt.

—Lawrence Lessig, The Industry Standard, March 27, 2001

LJ Index—June 2001

Development cost, in billions, of the Iridium satellite-based mobile phone system: 5
Sale price, in millions, of Iridium, after bankruptcy: 25
Number of Iridium satellites that would have been force-burned back to Earth if the system hadn't been sold: 60
Rejected sum, in billions, offered to record companies by Napster for allowing copyrighted works to be exchanged on the service: 1
Estimated cost of streaming 90 minutes of music to one listener: $81 US per day
Estimated delivery costs for the same on a peer-to-peer subscriber basis: $15 US per day
Percentage of time spent on-line “accounted for” by AOL-TimeWarner: 32.7
Percentage of “at-home penetration” by AOL-Time Warner: 74.8
Page-view percentage devoted to the 1,000 most popular web sites in June 2000: 53
Page-view percentage devoted to the 1,000 most popular web sites in January 2001: 48
PDA sales, in millions, in 2000: 9.39
Projected PDA sales, in millions, in 2004: 33.7
Approximate percentage of global PDA sales Sharp hopes to capture with its new Linux-based PDAs: 50
Sharp's global sales goal in millions for the year ending 2002: 1
Number of Java-based programs Sharp hopes to see running on its Linux-based PDA by October 2002: 10,000
Sharp's estimate of the number of active programmers for the Linux PDA platform: 100,000
Sharp's estimate of the number of Microsoft PDA programmers: 50,000

Sources:

1-3: Hoovers
4-6: ZDNet
7-8: Mediametrix
9-10: Industry Standard from Alexa Internet, March 2001
11-12: Gartner Group
13-17: CNET

Seeing Red

At the NBA's first three-point shooting contest, Larry Bird looked at his opponents in the locker room and said, “Who's playing for second?” That's the main question when it comes to Linux distros. Red Hat has had the Larry Bird position for years now, and about all that's changed is who comes after #1.

Recently Evans Data Corporation of Santa Cruz, California asked 300 Linux developers which distributions they would select for a web server or a web application server. The obvious answer was Red Hat. Coming in second were SuSE and Mandrake, each with 21.8%. As the chart shows, though, the question for developers really is “Who would you select in addition to Red Hat?” The average number of additional choices was 1.3 (for a total of 2.3 choices). Caldera, Debian and FreeBSD weren't far behind SuSE and Mandrake.

The survey's table of contents is on the Web at Evan's site (www.evansdata.com/Linux01TOC.htm).

They Said It

Between truth and the search for truth, I opt for the second.

—Bernard Berenson

If you tell the truth, you don't have to remember anything.

—Mark Twain

Honesty is the best policy. If you can fake that, you've got it made.

—George Burns

A closed mouth gathers no foot.

—Simon Murcott

Nearly all men can stand adversity, but if you want to test a man's character, give him power.

—Abraham Lincoln

The strategic goal here is getting Windows CE standards into every device we can. We don't have to make money over the next few years. We didn't make money on MS-DOS in its first release. If you can get into this market at $10, take it.

—Bill Gates

Do we have a way for people who host web sites on Linux to build on [.NET]? Yes, we do. That's not to say our overall strategy is not to get those web sites over to Windows, but we will provide a way for those Linux servers to use .NET.

—Steve Ballmer

I'm not one of those who think Bill Gates is the devil. I simply suspect that if Microsoft ever met up with the devil, it wouldn't need an interpreter.

—Nick Petreley

Storage is like eating. You can eat cheaper, but you can't not eat.

—Colin Ferenbach, on prospects for the storage firm EMC

It's a great year for entrepreneurs. The problem is that VCs haven't been investing in entrepreneurs, they've been investing in figureheads with no technology.

—Dave Winer

The only thing you can't do with open-source software is make monopoly profits.

—Jeremy Allison

The first thing that happened after we opened sourced InterBase was customers wanted to know how much it cost. The most important new feature, after open sourcing, was the price tag.

—Ted Shelton

At least, thanks to open source, the technology doesn't die with the company.

—Deirdre Saoirse

Hey, for the price of a distribution, you can have a year of Linux Journal.

—Evil Bastard, on OpenSourceRadio

People who whine that Linux user groups exist to “help” people invariably use proprietary mailers.

—Rick Moen

For every traction there is an equal and opposite retraction.

—Doc Searls

What is wanted is not the will to believe, but the will to find out, which is the exact opposite.

—Bertrand Russell

Nobody can jump to confusions faster than the Linux community.

—Arne Flones

The right way to do things is not to try to persuade people you're right but to challenge them to think it through for themselves.

—Noam Chomsky

all your apt-get arebelong to us. dist-upgradenow for great honour

—Debian Haiku by Marc Merlin

Think of It as an Ego Auction

So, how often do people search for your name on Google? To find out, we made an ad on Google's AdWords page, then asked Google to estimate how many times a month we would have to pay to run it when users searched for each of the following names:

Larry Augustin: 0

Chris DiBona: 0

Phil Hughes: 0

Rob Malda: 0

Don Marti: 4,000

Rick Moen: 0

Bruce Perens: 0

Eric Raymond: 4,000

Doc Searls: 0

Richard Stallman: 0

Linus Torvalds: 1,300

Richard Vernon: 0

Bob Young: 0

And, of course, we tried it for operating systems too:

Linux: 4,284,200

Windows: 5,653,800

UNIX: 872,900

There is no charge to get estimates. Try it yourself at http://adwords.google.com/. There is no charge to get estimates.

Things Women Hear at Linux Events

I have a few questions about Linux in general. Is there somebody here that can answer a couple questions for me?

Are you here by yourself?

You use Linux? Well, good for you!

After you're done helping her, can you answer a question for me? (Spoken over the head of a woman volunteer at an installfest, to the person she's teaching to install Linux.)

We only have men's extra-large shirts...but here you go, you can wear it as a nightshirt.

We need some more coffee over here.

Are you in marketing?

—Don Marti

Load Disqus comments