The complexity and diversity of human languages makes automated translation one of the hardest problems in computer science. Yet the job is becoming more important as writing and speech are increasingly digitized and as the traditional separations between societies dissolve.
Few parts of the globe have as much need to translate from one language to another as does India. According to India's 2001 census, the country has 122 languages, 22 of which are designated as official languages by the government. The top sixHindi, Bengali, Telugu, Marathi, Tamil, and Urduare spoken by 850 million people worldwide.
Now a decades-long effort by researchers is about to bear fruit. A multipart machine translation architecture, Sampark, is nearing completion as the combined effort of 11 institutions led by the Language Technologies Research Center at the International Institute of Information Technology in Hyderabad (IIIT-H).
Sampark combines both traditional rules- and dictionary-based algorithms with statistical machine learning, and will be rolled out to the public at http://sampark.iiit.ac.in/. By this month, systems for 12 out of 18 language pairs (nine languages) will be online and available for experimentation, with six more to follow soon after.
Many Indian languages are derived from Sanskrit, which is based on rules set down by Panini, the 4th century B.C. grammarian. Even those Indian languages that are not derived from Sanskrit are structurally similar to others in India. This common underpinning makes the translation from one Indian language to another easier than from, say, German to Chinese. Nevertheless, there are 462 pair-wise translations (counting each direction for a pair) possible among the 22 official Indian languages, so clearly the researchers had to find a generalized approach that could be easily adapted from one language to another.
The chosen method, a transfer-based approach, consists of three major parts: analyze, transfer, and generate. First, the source sentence is analyzed, then the results are transferred in a standard format to a set of modules that turn it into the target language. Each step consists of multiple translation "modules."
An advantage of the three-step approach, says Rajeev Sangal, director of the Language Technologies Research Center, is that a particular language analyzer, one for Telugu, for example, can be developed once, independent of other languages, and then paired with generators in various other languages, such as Hindi.
The 13 major translation modules together form a hybrid system that combines rules-based approacheswhere grammar and usage conventions are codifiedwith statistical-based methods in which the software in essence discovers its own rules through "training" on text tagged in various ways by human language experts.
Translation systems for major languages todayfrom companies like Google and Microsoft, for exampleoften use statistical approaches based on parallel corpora, huge databases of corresponding sentences in two languages. These systems use probability and statistics to learn by example which translation of a word or phrase is most likely correct. And they move directly from source language to target language with no intermediate transfer step.
"The statistical direct translation approach is, in a sense, the lazy man's approach, because all it requires is that you go and hunt for parallel corpora and you turn the crank and you get what you get," says Srinivas Bangalore, a speech and language processing specialist at AT&T Research in Florham Park, NJ. "But the transfer-based approach is much more linguistically motivated, because you are trying to analyze the sentence and trying to arrive at something that is close to a representation of its meaning."
Parallel corpora are specialized databases consisting of sentences very carefully translated and then mapped one-for-one to their translations. Moreover, to do a good job of training translation systems, the parallel corpora must be very largein the billions of sentences. "People are coming to grips with the fact that parallel data are not easy to come by," Bangalore says. "This is a very specialized kind of data."
Indeed, parallel corpora for many Indian language pairs do not exist and cannot easily be built, in part because not much Indian language text has been digitized. Nevertheless, developers at the Language Technologies Research Center were able to apply statistical machine learning in a limited way by annotating small monolingual corpora and analyzing the tagged text with statistical techniques, Sangal says.
So although machine learning techniques were employed in some of the modules, developers painstakingly developed multilanguage dictionaries and codified rules in the Computational Paninian Grammar framework. They also held workshops of experts of all these languages to develop a standard tag set, and then used those tags to annotate the monolingual corpora.
"Most machine translation is not inspiration, it's perspiration," Bangalore says. "The hard part is building all the resources required, like dictionaries, morphological analyzers, parsers, and generators. It's a lot of grunt work."
Sangal says the effort that Sampark developers put into language analysis could have a broad impact beyond translating Indian languages. He says that even the best purely statistical systems can be made more accurate by first doing the types of detailed language analysis employed in Sampark. "What one can do in the future is to first do monolingual analysis of one or both sides in paralleled corpora, and then use that to improve the quality of machine learning from the parallel corpora," he says. "So what we have done would also be useful if larger parallel corpora became available tomorrow."
Another advantage of the transfer approach, says AT&T's Bangalore, is its generalizability. "If you give me a parallel corpus dealing with financial news, and I train it up with millions of sentences of that sort, and two days later you say, 'Translate a sports article,' it's not going to perform as well."
But that kind of application domain change has been explicitly anticipated by Sampark's developers. The first version, rolling out now language-by-language, is general purpose and optimized for tourism-related uses, but it will be made available to large users who wish to customize it for other domains, says Dipti Sharma, an associate professor at IIIT-H. That would involve building a new domain dictionary, incorporating rules that handle domain-specific grammatical structures, and perhaps retraining some modules such as Part of Speech Tagger and Named Entity Recognizer.
The effort required to make those changes is minimized by building on the existing multilingual dictionary, Sharma says. It is sense- or meaning-based, so that for one domain or language, "bank," for example, would most likely represent a financial institution, but for another it might refer to the edge of a river, Sharma says. The dictionary currently allows translation among nine languages.
Sangal says the language-translation system has two especially noteworthy attributes. First, the linguistic analysis based on Panini is "extremely good," he says. "It was initially chosen for Indian languages, but we find it is also suitable for other languages." Initially, hard work is needed, he says, in setting it up by developing standards for parts-of-speech tags and dependency tree labels and for figuring out ways to handle unique language constructs.
The second attribute of special note is the system's software architecture. It is an open architecture in which all modules produce output in Shakti Standard Format (SSF). The architecture allows modules written in different programming languages to be plugged in. Readability of SSF helps in development and debugging because the input and output of any module can be easily seen. Also, a dashboard tool supports the architecture in a variety of ways. Custom written, it is "extremely robust," Sangal says. "If a module fails to perform a proper analysis, the next module will still work, albeit in a degraded mode. So the system never gives up; it always tries to produce something."
Naskar, S. and Bandyopadhyay, S.
Use of machine translation in India: current status. Machine Translation Review 15, Dec. 2005.
Bharati, A., Sangal, R., Mishra, D., V., Sriram, T., Papi Reddy
Handling multi-word expressions explicit linguistic rules in an MT system. Proceedings of the Seventh International Conference on Text, Speech and Dialogue, 2004.
Lavie, A., Vogel, S., Levin, L., Peterson, E., Probst, K., Llitjos, A.F., Reynolds, R., Carbonell, J., Cohen, R.
Experiments with a Hindi-to-English transfer-based MT system under a miserly data scenario. ACM Trans. on Asian Language Processing 2, 2, June 2003.
Bharati, A., Chaitanya, V., Kulkarni, A., Sangal, R., Umamaheshwara Rao, G.
Anusaaraka: overcoming the language barrier in India. Anuvad: Approaches to Translation, Sage, New Delhi, 2001.
Foundations of Statistical Natural Language Processing. MIT Press, Cambridge, MA, 1999.
Bharati, A. and Sangal, R..
Parsing free-order languages in the Paninian framework. Proceedings of the 31st Annual Meeting of the Association for Computational Linguistics, June, 1993.
©2010 ACM 0001-0782/10/0100 $10.00
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. To copy otherwise, to republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee.
The Digital Library is published by the Association for Computing Machinery. Copyright © 2010 ACM, Inc.