Driven by advanced techniques in machine learning, commercial systems for automated language translation now nearly match the performance of human linguists, and far more efficiently. Google Translate supports 105 languages, from Afrikaans to Zulu, and in addition to printed text it can translate speech, handwriting, and the text found on websites and in images.
The methods for doing those things are clever, but the key enabler lies in the huge annotated databases of writings in the various language pairs. A translation from French to English succeeds because the algorithms were trained on millions of actual translation examples. The expectation is that every word or phrase that comes into the system, with its associated rules and patterns of language structure, will have been seen and translated before.
Now researchers have developed a method that, in some cases, can automatically translate extinct languages, those for which these big parallel data sets do not exist. Jiaming Luo and Regina Barzilay at the Massachusetts Institute of Technology (MIT) and Yuan Cao at Google were able to automate the “decipherment” of Linear B—a Greek language predecessor dating to 1450 B.C.—into modern Greek. Previous translations of Linear B to Greek were only possible manually, at great effort, by language and subject-matter experts. The same automated methods were also able to translate Ugaritic, an extinct Semitic language, into Hebrew.
How It Works
In a recent paper, Neural Decipherment via Minimum-Cost Flow, the three computer scientist authors describe a two-step process. The first part, operating at the character level, uses a conventional neural network to predict the correct character in each word in the decipherment, based on prior knowledge of the patterns that tend to match across the two languages. The second step uses a linear program to minimize deviations of the derived vocabulary from the previously manually translated vocabulary. The two steps iterate back and forth and together attempt to translate words that are “cognates”—words with the same derivation.
“These two parts are complementary to each other,” says Luo, a Ph.D. student at MIT. “The linear program provides a global perspective that looks at the entire derived vocabulary, and it can utilize the information that is not readily available for a conventional local neural net. On the other hand, neural nets are very good at extracting local, character-level patterns that are harder to formulate using a linear program.”
The lack of parallel data for training, and the scarcity of ancient texts, make decipherment the “ultimate low-resource challenge for both humans and machines,” the researchers say in their paper. A typical manual decipherment “spans decades and requires encyclopedic domain knowledge, prohibitive manual effort, and sheer luck.”
Taylor Berg-Kirkpatrick, an assistant professor in the department of computer science and engineering at the University of California, San Diego, was not involved in this work, but has done similar research in unsupervised translation by machine learning. As he explains, the three scientists writing the paper “explicitly pose the problem of cognate matching as a combinatorial optimization problem, and conceptually divide the problem into alphabet- and cognate-matching. What their paper does that’s quite new is make use of high-capacity neural nets within this framework. It works quite well.”
The three scientists employ “a smart way of biasing the model to be simple, while letting it still be flexible,” Berg-Kirkpatrick says. “It doesn’t reorder characters; it just looks at the input left-to-right and changes the characters to the new alphabet, possibly deleting or inserting a character here and there. It’s a sort of fuzzy alignment, not a concrete alignment.”
Automated translation without extensive training on parallel datasets generally requires developers to know in advance how two languages are related, with similar alphabets, structures, and patterns. These patterns are predictable and tend to repeat, and they can be matched across languages that are from the same family. For example, English verbs often assume the past tense with “-ed” added, while German verbs do that with the “ge-” prefix. “The stronger the prior knowledge—the inductive bias—that you put into the algorithm, the less data you will need,” says Cao, a research engineer at Google.
The researchers knew from earlier manual translations that Linear B and Greek have many cognates. While these cognates have the same origin, they have changed in slightly different ways over the years. The neural network-linear program system could use this knowledge to iterate through successively accurate decipherments, says Regina Barzilay, a professor of computer science and artificial intelligence at MIT. The method could translate between Linear B and Greek, but not directly from Linear B to French because those languages are too dissimilar, she says.
A lack of parallel training data and scarce ancient texts make decipherment the “ultimate low-resource challenge for humans and machines.”
The algorithms were able to decipher Linear B with 67% success, meaning two-thirds of the cognate pairs were translated correctly. The other third was translated incorrectly based on the earlier, manually created dictionary, while non-cognates were not considered. “We need to find another way to solve that piece; words that don’t have a common origin or that are not in the manually created database,” says MIT’s Luo.
Luo says commercial systems like Google Translate work at the semantic level, converting entire sentences from one language to another in a way that tries to preserve their meaning. However, “decipherment” does only character-mapping and word-matching of cognates and does not get at the overall meaning of a block of text.
Application to Other Languages
In a sense, the translations of Linear B and Ugaritic by neural network were only proofs of concept—albeit important ones—as those languages were already translated. The next step—which computer scientists say will be much more difficult but not impossible—will be to try the ideas on so-far-undeciphered languages.
Barzilay says earlier manual attempts to decipher Linear B failed because researchers didn’t think Linear B was related to Greek. They struggled unsuccessfully for years to translate it into other languages, not succeeding until 1953, when the British architect and linguist Michael Ventris tried Greek. “So, finding the right related language is really crucial,” Barzilay says.
The three colleagues are now working to extend their methods to other extinct languages; for example, Iberian. However, Iberian, which was used by the indigenous people of the Iberian Peninsula (present-day home of Spain and Portugal) more than two millennia ago, has never been deciphered by any means, and presents several difficulties. One is that some of the existing texts lie in large monolithic blocks of characters, making it difficult to identify discrete words.
Even more problematic is that there is no agreement as to what other known language(s) might share a common origin with Iberian. There is debate, indeed controversy, among archaeological linguists as to whether the ancient Iberian language is related to Basque, to Aquitanian (a precursor of the Basque language) or to some other extinct language. “We believe that by automating the available data, we can shed light objectively on the subject to bring some understanding to what happened, and to European history,” Luo says.
Neural decipherment via minimum-cost flow may have application beyond even language translation, in areas like DNA sequence-alignment.
In the meantime, there is more work that could be done with Linear B and Ugaritic, Cao says. The methods described in the recent paper only model the “surface form” (individual letters, not whole words at once) of the input text so as to find cognates. “What we didn’t do is consider the semantics of words, the context of words,” he says. “Similar words tend to occur in similar contexts, so that’s a big piece we have to add to the algorithm. And what about non-cognates, or phrases? These are much harder, but not impossible. We are working on things like that.”
Does society really need to know more about languages that have not been spoken for centuries? “You might as well ask what’s the point of doing research on archaeology,” Cao says. “This is sort of archaeology for languages.” At a minimum, these advanced machine learning and artificial intelligence techniques will be a big help to scholars who previously lavished huge efforts on manual translations, he says.
“Their paper serves as a new demonstration that constraint-based methods and neural methods can work together to solve unsupervised problems,” Berg-Kirkpatrick says. “I hope this will serve to reinterest the NLP [natural language processing] community in historical decipherment problems.”
Neural decipherment via minimum-cost flow potentially may have application even beyond language translation. For example, the concepts might be applied to DNA sequence-alignment, in which biologists try to find small matching segments of DNA, called “motifs,” on the same or similar strands of DNA. That problem resides in a broad class of applications called “correspondence induction.”
Asgari, E. and Schütze, H.
Past, Present, Future: A Computational Investigation of the Typology of Tense in 1000 Languages, Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing, April 2017 https://www.aclweb.org/anthology/D17-1011/
Berg-Kirkpatrick, T. and Klein, D.
Simple Effective Decipherment via combinatorial Optimization, Proceedings of the 2011 Conference on Empirical Methods in Natural Language Processing, July 2011 https://www.aclweb.org/anthology/D11-1029/
Duh, K.
Bayesian Analysis in Natural Language Processing, Computational Linguistics, Vol. 44, Issue 1, March 2018, p.187–189 http://bit.ly/2Vwn2Rn
Luo, J. Cao, Y. and Barzilay, R.
Neural Decipherment via Minimum-Cost Flow: from Ungaritic to Linear B, eprint arXiv:1906.06718, June 2019, https://arxiv.org/pdf/1906.06718.pdf
Robinson, A.
Lost Languages: The Enigma of the World’s Undeciphered Scripts, Thames & Hudson, reprint edition, https://amzn.to/2B1LAZc
Snyder, B. and Barzilay, R.
Unsupervised Multilingual Learning for Morphological Segmentation, Proceedings of ACL-08: HLT, Assoc. for Computational Linguistics, June 2008, https://www.aclweb.org/anthology/P08-1084/
Join the Discussion (0)
Become a Member or Sign In to Post a Comment