Artificial Intelligence and Machine Learning News

Software Helps Linguists Reconstruct, Decipher Ancient Languages

Linguists who once spent an entire career reconstructing a major language family now can accomplish that in just a few hours.
Software Helps Linguists Reconstruct, Decipher Ancient Languages, illustration
  1. Introduction
  2. Further Reading
  3. Author
Software Helps Linguists Reconstruct, Decipher Ancient Languages, illustration

Linguists who reconstruct ancient languages—and who previously did the arduous work solely by hand—now have another tool in their arsenal to speed up their laborious efforts. Computer scientists have proven they can use software to recreate the early languages from which modern tongues have derived.

While previously it might have taken a linguist their entire career to reconstruct a major language family, now software running computations on, say, a large experiment that may involve a sixth of the world’s languages can be completed in just a few hours.

The achievement is not about speed, cautions Dan Klein, associate professor of computer science at the University of California, Berkeley. It’s about being able to do things in a large-scale, data-driven manner without losing all the important insights that historical linguists have gained in working on these sorts of problems for decades, he says.

Indeed, linguistic researchers compare these techniques to those used for gene sequence evolution.

"This achievement should not be compared to, for example, Deep Blue, IBM’s chess-playing computer," Klein insists. "This is not a human-versus-machine story in which humans used to be better until finally a computer was able to take the crown. This is a story of computation giving human linguists new tools that supplement their weaknesses and let them work in new ways."

The efforts of Klein and his colleagues are described in their paper, "Automated Reconstruction of Ancient Languages Using Probabilistic Models of Sound Change," published by the National Academy of Sciences in February.

According to Klein, the work’s main contribution was a new tool that researchers can use to automatically reconstruct the vocabularies of ancient languages using only their modern-language descendants.

The goal, he says, is not to just rewind the clock; rather, it is to better understand the processes that give rise to language change, and to model how the evolution of language proceeds. "And so we want to know things like what kind of sound changes are more likely and what kind of sound changes go together," he explains.

To test the system, the team applied it to 637 languages currently spoken in Asia and the Pacific, and recreated the early language from which they all descended. The result? More than 85% of the system’s reconstructions were within one character of the manual reconstruction provided by a linguist who specialized in Austronesian languages—and, of course, the differences are not necessarily errors.

How does the system work?

The way we produce words differs from the way our ancestors pronounced those same words. As time goes by, minute, ongoing alterations help turn an ancestral language like Latin into modern descendants like French, Italian, and Portuguese.

The sound changes are almost always regular, with similar words changing in similar ways, explains Klein, and so patterns are left. The trick is to identify those patterns of change and then to "reverse them," basically tracking the evolution of the language backward in time.

"Linguists have known this for a good 100 years or more, but it’s a hard and time-consuming process to do by hand," he says. "However, that is where computers shine."

Yet the use of computers and linguistic software is not limited to reconstructing ancient languages.

Ben Snyder, an assistant professor in the Department of Computer Sciences at the University of Wisconsin–Madison, employs his own software to do decipherment of some sort of text—perhaps a tablet—written in a long-dead language that may or may not be related to a living language. He then tries to reconstruct the dead language, making a prediction about that language for which he has no direct evidence.

In his most recent work, he developed a software program into which he is able to feed an unknown language not necessarily connected to any other language. The program is then able—in about 30 seconds—to "say something useful" about the language, says Snyder.

For instance, in a paper for this year’s Association of Computational Linguistics (ACL) Conference in Bulgaria entitled "Unsupervised Consonant-Vowel Prediction Over Hundreds Of Languages," Snyder describes how his program is able to tell—with 99% accuracy—which letters of a long-dead language are consonants and which are vowels. In addition, the paper describes how the program can determine—with 89% accuracy—some of the qualities of the consonants; for example, which are probably nasal sounds and which are not.

"What I wanted to develop was a program based on machine-learning techniques that would examine hundreds of languages and, by doing so, build a universal model of linguistic plausibility," he explains. "What I’ve done so far is just the starting point; I hypothesize that, in time, we’ll be able to determine much more with the software."

"What I wanted to develop was a program … that would examine hundreds of languages and, by doing so, build a universal model of linguistic plausibility."

Unlike the reconstruction program created by Dan Klein’s team, which acts to supplement the manual work of linguistics by making it simpler and more efficient, Snyder says what his decipherment software "goes way beyond what humans are able to do through manual analysis."

While Snyder does not see software replacing more-traditional manual linguistic methods completely, he suspects the field will undergo a shift toward greater use of computational methods, given the amount of data that is being accumulated.

"This is a really nice example of analyzing large amounts of data using novel algorithmic techniques," he says, "and a great example of computer science being the new handmaiden of the sciences, the way math once was."

Why would any linguistic analysis continue to be done by hand, when computational techniques seem to be so much faster and easier?

Kevin Knight believes that humans are simply better at finding new data and seeing the patterns in that data. Knight is a senior research scientist at USC/Information Sciences Institute.

"On the one hand, computers are much more thorough and much more patient than people are at searching for patterns," he says, "but they only look for what you tell them to look for. If the text uses some other method of encoding that you didn’t tell the computer about, it’s not going to find an answer. Humans, on the other hand, are much better at this kind of flexible pattern-matching and adapting."

For example, he says, there are multitudes of ways to write the letter ‘A’ in English, including capital, lower-case, cursive, and so on.

"I could show you 50 different ways and you would look at them and say, ‘yeah, that’s right, they are all A’s’—different from each other but recognizable as A’s," he says. "But while humans can do that naturally, it’s difficult to program computers to do it—although they are getting much better at it."

He predicts the decision whether to do linguistic work by hand or software will depend on the specific issue under consideration, "although in much of our work, a joint human-computer team tends to be the best way to go."

For instance, three years ago, Knight teamed up with Ben Snyder and Regina Barzilay, an associate professor in MIT’s Computer Science and Artificial Intelligence Lab, to present the paper "A Statistical Model for Lost Language Decipherment" at ACL 2010. The paper demonstrated how some of the logic and intuition of human linguists can be successfully modeled, allowing computational tools to be used in the decipherment process.

"Through a painstaking trial-and-error process, scholars were able to decipher Ugaritic, a 3,000- to 5,000-year-old language from ancient Syria known almost only in the form of writings from ruins," says Knight. "It took them five years to do that by hand."

In their 2010 paper, Knight and his co-researchers described how it took them six months to develop a computational model to do the same task—and about an hour to run it and achieve results.

The team then evaluated their software by comparing those results to what the linguists had achieved by hand. They got 29 out of 30 letters correct, missing only one "rare letter." They recovered 60% of the Ugaritic words that have a similar meaning in Hebrew and come from a common ancestor so they have a similar pronunciation.

"Computers have totally taken over in that area of biological classification and, I predict, they’ll totally take over in the area of linguistic reconstruction, for sure."

Snyder believes the project illustrates how the field of linguistics will undergo a shift toward greater use of computational methods, but those methods "will be guided by our knowledge of linguistics and what are the relevant features of language to look at. So, in a sense, the design of the algorithm still needs to be guided by human linguistic knowledge."

Knight concurs that the future of linguistics will definitely depend on software—and how much of the data regarding languages can be assembled online and available to computers.

"The big enabler will be getting all this data online, organizing the databases, and allowing computers to analyze it all," he says. "There’s a clear parallel here to DNA sequencing and biological data analysis. Computers have totally taken over in that area of biological classification and, I predict, they’ll totally take over in the area of linguistic reconstruction, for sure."

As for Dan Klein and his NSF-funded reconstruction efforts, he is preparing for the next steps, which include further scaling up the current models so that he and his team can reconstruct even further back than the 7,000 years they’ve been able to so far. That’s a matter of gathering more data—larger collections of languages that are even more distantly related—and, at the same time, tweaking the software, since the further back you want to go, the better the models have to be.

For example, he would like to feed in all the languages of the world and then draw inferences about what their roots looked like.

"Obviously, because so much data is involved, it will require computation, not just manual work," says Klein. "But a historical linguist would observe that what we are attempting is not to automate what people have been doing by hand, because people are very good at the kind of research they do by hand. They would say that tools like ours give us a way to answer new kinds of questions that are impractical to answer by hand."

Back to Top

Further Reading

Bouchard-Cote, A., Hall, D., Griffiths, T., and Klein, D.
"Automated Reconstruction of Ancient Languages Using Probabilistic Models of Sound Change," March 12, 2013, the National Academy of Sciences of the United States of America,

Kim, Y., Snyder, B.
"Unsupervised Consonant-Vowel Prediction Over Hundreds Of Languages," to be published at the summer 2013 Association of Computational Linguistics Conference,

Snyder, B., Barzilay, R., and Knight, K.
"A Statistical Model for Lost Language Decipherment," July 13, 2010, the 2010 Association of Computational Linguistics Conference,

Bouchard-Cote, A., Griffiths, T., Klein, D.
"Improved Reconstruction of Protolanguage Word Forms," May 31, 2009, the 2009 Annual Conference of the North American Chapter of the Association for Computational Linguistics,

Hall, D. and Klein, D.
"Large-Scale Cognate Recovery," July 27, 2011, the 2011 Conference on Empirical Methods in Natural Language Processing (EMNLP ’11),

"Kevin Knight: Language Translation and Code-Building" (video), April 18, 2013,

Back to Top

Join the Discussion (0)

Become a Member or Sign In to Post a Comment

The Latest from CACM

Shape the Future of Computing

ACM encourages its members to take a direct hand in shaping the future of the association. There are more ways than ever to get involved.

Get Involved

Communications of the ACM (CACM) is now a fully Open Access publication.

By opening CACM to the world, we hope to increase engagement among the broader computer science community and encourage non-members to discover the rich resources ACM has to offer.

Learn More