Indian Language Solutions for GNU/Linux

Some of the world's most widely spoken languages are the hardest to support on a computer. Here's how support is coming together for Hindi, Malayalam and other languages of the Indian subcontinent.

South Asia, home to nearly one-sixth of humanity, is struggling to attain regional language solutions that would make computing accessible to everyone. Even if most are poor and have low purchasing ability, this could open the floodgate to greater computing power and much-needed efficiency in a critical area of the globe. However, some call Indic and other South Asian scripts the final challenge for full-i18n support.

Some Indian regional languages are larger than those spoken by whole countries elsewhere. Hindi, with 366 million speakers, is second only to Mandarin Chinese. Telugu has 69 million; Marathi, 68 million; and Tamil, 66 million. Sixteen of the top 70 global languages are Indian languages with more than 10 million speakers. Other languages spoken in India are also spoken elsewhere. Bengali has 207 million speakers in India and Bangladesh, and Urdu has 60 million in Pakistan and India.

Simputer Offers Text-to-Speech

The Simputer is a simple and relatively inexpensive Linux computer for people in Indian villages. The creation of the Simputer is being organized with a hardware license, the Simputer General Public License, modeled on the GPL. Although the license provides for free publication of specifications, it does require a one-time royalty payment before licensees sell Simputers.

The Simputer features a 200MHz StrongARM processor, 32MB of DRAM, 24MB of Flash storage, a monochrome display, speaker and microphone.

dhvani is a text-to-speech system for Indian languages developed by the Simputer Trust developers and others. It is promising to have a better phonetic engine, Java port and language-independent framework soon. (See sourceforge.net/projects/dhvani.) Meanwhile, IMLI is a browser created by the Simputer Trust for the IML markup language. It is designed for easy creation of Indian language content and is integrated with the text-to-speech engine.

I18n Frameworks Already in Place

In Kerala, a southern state with an impressive 90% literacy rate whose language Malayalam is spoken by 35 million people, senior local government official Ajay Kumar (kumarajay1111@yahoo.com) is leading an initiative to make GNU/Linux Malayalam-friendly: “We propose to develop a renderer for our language. Specifically, we are looking for a renderer for Pango (the generic engine used with the GTK toolkit).”

He adds, that in nine months time, “we want to create an atmosphere where language computing in Malayalam improves.” He also says, “We are confident that once we deliver the basic framework, others will start localizing more applications in Malayalam.”

At the toolkit level, GTK and Qt are the most used. GTK already has a good framework through the Pango Project and has basic support for Indian languages. Qt also now has Unicode support for all languages, but rendering is not yet ready.

International efforts also are helping India. Yudit, the free Unicode text editor, now offers support for three South Indian languages: Malayalam, Kannada and Telugu. Delhi-based GNU/Linux veteran Raj Mathur commented, “The current version of Yudit has complete support for Malayalam and other Indic languages. It can also use OpenType layout tables of Malayalam fonts. I think Yudit is the first application that can use OpenType tables for Malayalam.”

K Ratheesh was a student of the Indian Institute of Technology-Madras (at the South Indian town of Chennai) when he worked on enabling the GNU/Linux console for local languages a couple years ago. He said:

As the [then] current PSF format didn't support variable width fonts, I have made a patch in the console driver so that it will load a user-defined multiglyph mapping table so that multiple glyphs can be displayed for a single character code. All editing operations also will be taken care of.

In Indian languages, there are various consonant/vowel modifiers that result in complex character clusters. “So I have extended the patch to load user-defined, context-sensitive parse rules for glyphs and character codes as well. Again, all editing operations will behave according to the parse rule specifications”, Ratheesh commented.

Ratheesh also said, “Even though the patch has been developed keeping Indian languages in mind, I feel it will be applicable to many other languages (such as Chinese) that require wider fonts on console or user-defined parsing at I/O level.”

The package, containing the patch, some documentation, utilities and sample files then weighed in at around 100KB.

______________________

Comments

Comment viewing options

Select your preferred way to display the comments and click "Save settings" to activate your changes.

Is there any Malayalam OCRs around?

aneeskA's picture

Hi all,

I want to know if there exist any OCRs for Malayalam. I see that the last activity in this post was on 2005. Since it is almost 4 years now in to the future, I am feeling confident that some thing must have come in to the foray

regards
-- anees k A

Malayalam language solutions

Santhosh Thottingal's picture

Swathanthra Malayalam Computing project https://savannah.nongnu.org/projects/smc/ is working on various malayalam language projects. SMC is developing a malayalam text to speech system based on dhvani TTS and an OCR reader named Akshara. SMC is also trying to solve the font related problems of malayalam.

The major achievements are

Anonymous's picture

Further to the crossposting from tamilinix group and some excellent works reported at http://www.tamillinux.org , it seems that achievements by the Tamil language are totally ignored in this report. This casts aspersions on the journalist's integrity. How else can one read;

International efforts also are helping India. Yudit, the free Unicode text editor, now offers support for three South Indian languages: Malayalam, Kannada and Telugu

While the previous posting says tamil was supported in yudit much early. What is the necessity to miss only tamil among the four s.indian languages?

Re: The major achievements are

Anonymous's picture

Though my reply is so late......This report is absurd even to the advancements made at the time of the report itself............

This report has completely excluded advancements in Tamil Linux development, which is considered even today as the foremost for all Indian languages..........

And also about unicode, his arguments are completely wrong.........When the whole world moves towards unique..... this person is simply misleading

Re: The major achievements are

Anonymous's picture

See tamil is not simply an Indian language, its official languages in other contries also. Its spken worldwide. Tamil has great future in computing (May be the best for computing ). Another great advantage of tamil is they have the spirit of Tamil, they use Tamil everywhere unlike other other south indian peoples. So tamil cann't neglect by anyone. Its matured language than any oher south indian language. I really like to hear tamil talks,music etc. (I am from south India, not a tamil speaker, can read and apeak a little bit)

Re: Indian Language Solutions for GNU/Linux

Anonymous's picture

It is unfortunate that Noronha has overlooked the Acharya project that has been going on for more than a decade at IIT, Madras.

  • It deals with a large number of languages, Indian and otherwise.
  • It provides extensive APIs for various programming languages and is available for Linux, Windows, and, in some cases, other platforms.
  • It features a text to speech module for the visually impaired.
  • Best of all, the software is free and is in use by a number of
    individuals and organizations.

URL: http://acharya.iitm.ac.in

Thanks.

Ajit Natarajan

I am sceptical about this

Anonymous's picture

I am basically from kerala and Malayalam is my mother tongue. I am very proficient in that language also. I have utmost respect to any one associated with all these works. But let me tell you some thing very frankly. I dont think people WANT malayalam version of softwares very much. Even though their mother tongue is Malayalam, they are not particular on using malayalam. Most of them ( especially the young guys who use computer) knows english and they like english MORE THAN malayalam(beleive it or not ! ). Most of the kids going to school are going in ENGLISH medium schools and many of them even can't read or write malayalam properly. I doubt whether they would ever need or use any piece of software in malayalam. Even in the government level , most of the documents are handled in English and people are quite comfortable with it( or at least happy with it and dont want any change). So I am sceptical about the success of these. I wish I weren't and I am not in a position to extend the opinion to the other languages. But at least with malayalam , this is the case.

Re: I am sceptical about this

Anonymous's picture

Not me.

I think I understand your point of view, and Malayalam is my mother tongue too. The works reported here have great potential aside from making Malayalam versions of software applications.

Let us imagine how things would be if we could not print anything in Malayalam... Now let us imagine how we might display a message in Malayalam, say 10 or 100 years from now...

We will be using different types of media in the years to come, and the potential for electronic applications are great. One of those multimedia applications might let you read Malayalam text on an electronic display device. There will be applications that will read Malayalam to an interested listener. Imagine how that might change the life of a blind person. May be it would be for someone who is too tired to read, but willing to sit back and listen.

And what if there was an efficient way to produce captions for a (Malayalam) movie, both in English and Malayalam? Advances are being made in the area of Natural Languages and we can expect lots more.

And imagine how text messages may be exchanged in Malayalam over a span of great distance...

Aside from these applications, consider this work as a different, perhaps an efficient, representation scheme for Malayalam fonts. And I would say, this is just one small, but important, step in the right direction.

Go for it!

-----
Aside: Now where is that English - Malayalam dictionary? I want to see Malayalam in Malayalam. Can I also hear the correct pronounciation of the words? Thank you.

malayalams limitations

Sunil kumar T K's picture

Very True.

Malayalam as a language has to over come lots of stumbling block to its destiny.

Malayalam pronunciation is defenitely very difficult to imitate.

An efficient, representation scheme for Malayalam fonts would defenitly should be first step ahead.

Re: I am sceptical about this

Anonymous's picture

And yes,

Every one read english newspaper daily, watch english moives and talk english at home, and you are chatting with parents back home in english over yahoo messenger.

People learn english here so that they are better suited for searching jobs outside kerala, like you are and not because they don't like the language. Malayalam has adopted a lot of words from sanskrit and English because it is a living language and unlike tamil pretty young one. You can trace 95% of the words in malayalam to either English, Sanskrit or to Tamil.

Even though some people like you do not need malayalam there are many others who would like to send email in Malayalam or to see their government orders over web in malayalam or to chat in malayalam.

Re: I am sceptical about this

Anonymous's picture

These efforts may not for the people you described but for the common man majority of whom do not know English. I'm sure you know that you need not be a direct user of malayalam applications to appreciate the benefit, it can be in the form of a printed out government form in malayalam printout etc. which even computer illiterate people who know malayalam can have the benefit.

So when making comments like these pl. do think a bit more.

Re: Indian Language Solutions for GNU/Linux

ramv's picture

"Microsoft's Windows XP has Indian language support based on the current Unicode version (3.x) and hence suffers from all the problems of Unicode-based solutions: inability to represent all the characters of some Indian languages and awkwardness in text processing. "

What does this mean??. Which characters cannot be represented in Unicode?? What kind of awkwardness in text processing??

I have been working in Unicode since 1999 and yet to see an Indic characters that cannot be represented.

"Satish Babu (sb@inapp.com), a free software enthusiast and vice president of InApp, an Indo-US software company dealing with free and open-source solutions, points to other problems, such as collation (sorting) order confusion (oftentimes, there is no unique ``natural'' collation order, and one has to be adopted through standardization)."

Yeah. Please read the Unicode Collation Algorithm and the standard. Or see the open-source implementation of Unicode Standard http://oss.software.ibm.com/icu/

"Besides, others point out, fonts are another mess altogether. Most of the current implementations rely on glyph locations to display and store information. For instance, to represent the letter ``a'', what is stored is the position of ``a'' in some particular font used by that package. This is different from normal English where the ASCII standard specifies that to represent ``a'' the number 65 must be used. No such standard exists for Indian languages, and thus one document written in one application cannot be opened in another. This is also the reason why authors of Indian web pages must specify particular fonts. "

I isn't clear/correct. If you are talking about standards analogous to ASCII for Indic languages, there are ISCII and Unicode. For Tamil there TSCII, TAM, TAB, etc.

What you are talking about is font hacking. Take ASCII code page and use the code points to represent something in the font. The web page authors have to specify fonts because they use hacked ASCII encoding to represent Indic languages. Use Unicode for heaven's sakes.

If there are no free fonts for Indic Languages then we need volunteers to read the Unicode Standard and create Open Type fonts not hacked ASCII fonts.

Re: Indian Language Solutions for GNU/Linux

Anonymous's picture

> Yeah. Please read the Unicode Collation Algorithm and the standard.
> Or see the open-source implementation of Unicode Standard
> http://oss.software.ibm.com/icu/

If you use the natural character order in the encoding(as in ascii for latin), there's no need for a complex collation algorithm.

Re: Indian Language Solutions for GNU/Linux

Anonymous's picture

Visit http://groups.yahoo.com/group/tamilinix/ for the latest developments in Linux inTamil. Incidently Tamil is the first Indic language to be supported 100% out of the box in the latest Mandrake 9.0. see the screenshots here for tamil in Linux.

http://groups.yahoo.com/group/tamilinix/files/misc/screenshots/Mandrake/

here is a follow up on this in tamilinix mailing list.

From: "Vasee Vaseeharan"

Date: Mon Oct 21, 2002 3:39 am

Subject: Re: Linux Journal covers Indian language Linux

Yes, I am very concerned that Tamil Linux seems to have been ignored

completely in that article. Had Noronha bothered to investigage

Indic Linux properly, especially in some of the areas that he writes

about in the article, he would have found that:

1. Tamil is the first (and only) Indic language to be supported and

is still being actively supported in the KDE desktop environment.

A glance at the translation statistics page at

http://i18n.kde.org/stats/gui/

will bear this out.

2. Tamil was also the first Indic language to be fully supported in

Yudit (in Jan. 2002). Support for other Indic languages was added

after this. I was personally involved in this work with Gaspar

Sinai, the author of Yudit.

Refer to these postings on Tamilinix:

http://groups.yahoo.com/group/tamilinix/message/891

http://groups.yahoo.com/group/tamilinix/message/1026

3. Tamil is the first Indic languge to have an Pango support due to

the work of Sivraj and Vikaram Subramanian (in 2001). None of the

other Indian Languages had any support till Eric Mader ported the

IBM ICU code to Pango earlier this year. Pango CVS comments, and

gtk-i18n mailing list archives have proof of this fact.

4. A major linux distribution, Mandrake, officially supports Tamil

from version 9.0. (Again, for the umpteenth time, the first Indian

language to be supported.)

If Noronha was indeed aware of these facts, as Mani alluded to, I

wonder what Tamil Linux community has done/failed to do, to deserve

such glaring omissions.

-Vasee

-- posted by Prabu

Re: Indian Language Solutions for GNU/Linux

Anonymous's picture

I do agree with you in the aspect that the author should have done some more research.

But your post sounds more linguistically fanatic column rather than a critique.

BTW how many millions of tamils use Tamil-Linux. Technology can do more than what people can accept, my only suggestion is to look at the acceptance rate rather than its inception date.

Tamil-Linux is a failure, so are many indic linux. Reason an indian, operating computer seriously has 10+ years of english education.

Re: Indian Language Solutions for GNU/Linux

Terry's picture

There are probably over 380 million people who speak english as their native language, if you count the USA, the UK, Australia, and english speaking Canada.

Just a comment in response to the article stating there are only 366 million hindi speakers. I naively assummed that the majority of India would speak the same language.

Re: english speakers and hindi speakers

Anonymous's picture

Actually the total number of hindi speaker, is appx. 400 million. Since hindi is also spoken in few east african nations, maritius and immigrant indians in the western world.

More appropriate numbers can be found on the net.

Now you can understand why

Anonymous's picture

If you've read all the messages, you'll immediately realize "the crab mentality" of people, which is pulling down the community all together.

I agree..What Indian's miss i

Pavan's picture

I agree..What Indian's miss is "Can do attitude".
Hope and pray , can ever be an attitude change in Indian Psyche.
India does not need any money what it needs is an attitude change.

Webinar
One Click, Universal Protection: Implementing Centralized Security Policies on Linux Systems

As Linux continues to play an ever increasing role in corporate data centers and institutions, ensuring the integrity and protection of these systems must be a priority. With 60% of the world's websites and an increasing share of organization's mission-critical workloads running on Linux, failing to stop malware and other advanced threats on Linux can increasingly impact an organization's reputation and bottom line.

Learn More

Sponsored by Bit9

Webinar
Linux Backup and Recovery Webinar

Most companies incorporate backup procedures for critical data, which can be restored quickly if a loss occurs. However, fewer companies are prepared for catastrophic system failures, in which they lose all data, the entire operating system, applications, settings, patches and more, reducing their system(s) to “bare metal.” After all, before data can be restored to a system, there must be a system to restore it to.

In this one hour webinar, learn how to enhance your existing backup strategies for better disaster recovery preparedness using Storix System Backup Administrator (SBAdmin), a highly flexible bare-metal recovery solution for UNIX and Linux systems.

Learn More

Sponsored by Storix