We Need to Talk About Linguistic Diversity in AI

Considering the languages of AI. — Of the 7,117 living languages currently known, Apple's Siri supports 21, Amazon Alexa eight, and Google Home 13.

Our learned ability to use words to construct sentences that convey information, ideas, and emotions in an organized way makes us unique among animals. However, language has significance beyond communication. It is an expression of cultural identity, a demonstration of the existence of communities of peoples.

According to Ethnologue: Languages of the World, there are currently 7,117 known living languages. The survival of a language depends on many factors, not least the prevalence of its use in everyday life. Yet increasing commonplace services, such as virtual personal assistants, predictive text, and speech recognition and machine translation tools, support only a fraction of global languages.

Apple's Siri, for example, supports 21 languages, Amazon Alexa eight, and Google Home 13. Google Translate supports 108 languages; five new ones ⁠(Kinyarwanda, Odia, Tatar, Turkmen, and Uyghur) were added in February 2020.

For language survival, artificial intelligence (AI) is both portentous and promising, says New York City-based Daniel Bögre Udell, cofounder of Wikitongues, a non-profit platform that supports language preservation through projects such as archiving and language revival toolkits.

Says Udell, machine translation and predictive text are powerful tools, but if your language is not supported, "It's yet another layer of social pressure to abandon your mother tongue for something else."

Speech recognition still struggles to understand different accents within a single language, he says. "We hear stories about people speaking Caribbean English not being able to use their iPhone or talk to Alexa."

Udell launched Wikitongues in 2013. What began as a YouTube channel is now a global network of over 1,000 contributors from around 100 countries. The platform's video archiving project has safeguarded nearly 1,000 oral histories from 500 languages.

"Languages are vehicles of cultural expression, so when a language dies out, a community has collapsed effectively," Udell says, pointing. out that over 7,000 spoken languages and hundreds of sign languages are also excluded from speech and text-based technologies.

Predictive text is based on writing systems, and speech recognition on audio processing, neither of which work for sign languages.

Ultimately for Udell, linguistic diversity in AI is an issue of social justice and the future of human knowledge, "No one should have to choose between globalization and their culture," he says.

The impact of inclusion

Unicode is a universal character-encoding standard. Google engineers participate in the committees that review proposals to add new alphabets to Unicode and write the open source code, International Components for Unicode (ICU), that is available to developers. "ICU is a key part of virtually every computer and mobile device on the planet," says Craig Cornelius, a senior software engineer at Google.

When an alphabet is added to Unicode, it profoundly impacts a community's access to technology and the prevalence of a language's everyday use. In 2019, for example, the alphabet of Wancho, a language spoken by around 59,000 people in northeastern India, was added to Unicode, and Google released a Noto Sans font for Wancho soon after.

"With font and keyboard input tools such as KeyMan, the community can now read and write their language on computers, communicate on social media, and develop Web content," he says. "Wancho is now a part of the Internet."

In 2013, Google launched google.com.mm in Burmese, a widely spoken but under-supported language in Myanmar. Burmese was added to Google Translate in 2014 and Gmail in 2015. The growth of mobile phone usage in Myanmar has been phenomenal, says Cornelius, adding, "The Burmese user interface helped many new users participate in the Internet. With Unicode's support for Burmese, search, social media, video, blogging, and many other services are now available."

The data-driven paradigm

Unicode supports linguistic diversity. However, AI-based services that use machine learning and automatic speech recognition (ASR) technologies, such as speech-to-text and translation systems, also need data. Google Translate, for example, is based on a Neural Machine Translation system (NMT), and such systems require vast amounts of training data to learn.

According the Google Translate team, two of the biggest factors affecting whether a language is added to the platform are the availability of data and engagement from the Translate Community. Access to data in English, Spanish, Arabic, and German poses few challenges, but it may take a long time until sufficient data is available to add a sparsely used language.

This data-driven paradigm is the root of the linguistic diversity challenge, says Khalid Choukri of the European Language Resources Association (ELRA), a not-for-profit organization created to promote language resources and evaluation for the Human Language Technology sector, in a European context. "Everything is based on machine learning from data, and for lesser-resourced languages, we don't have enough data to train our tools."

It is no small problem, Choukri says. There are multiple languages in use in some African countries, China, and India that have enormous numbers of speakers. "I'm talking about languages with a few million speakers that don't have access to these technologies."

According to Choukri, industry, academia, and policymakers all have a role to play in shaping solutions. Research breakthroughs that enable language datasets to be processed using less resources are urgently required, he says.

Udell also points to the data. He believes a solution may be found in grassroots efforts via a more accessible contribution pipeline. "There's no reason that you couldn't crowdsource the data collection. That would not be particularly expensive for these companies but will go a long way to making their technology more widely available."

Crowdsourced data can be helpful in refining language products and in building user interfaces, says Cornelius. "However, one of the best motivators for increased support for a language is when the use of the language increases online [via user-generated content on blogs, YouTube, and websites] to the point where it becomes worthwhile to develop more tools and services in that language."

Increased availability and improved processing of training data, collaborations between academia and industry, crowdsourcing, and community engagement will be key to improving linguistic diversity in AI. While Cornelius thinks it is unlikely that any organization will ever support all languages in full, "It is possible that every person will find at least one of her/his languages available on technology," he says.

Karen Emslie is a location-independent freelance journalist and essayist.