News
Artificial Intelligence and Machine Learning

Can ChatGPT Learn Chinese or Swahili?

Considering how large language models might act differently if trained in different languages.

Posted
many word balloons coming from a chabot, illustration

The English-speaking world was rocked by the advent of ChatGPT in November 2022. Here, suddenly, was a chatbot that could do a credible imitation of a human being, producing text that seemed like it was written by a real person. People worried that ChatGPT, Google’s Bard, and the like would cause widespread cheating as students turned over their writing assignments to a machine, or that they would lead to the mass production of misinformation and propaganda, outstripping the abilities of Russian troll farms.

Those concerns arise in languages other than English as well. So far, however, chatbots based on Large Language Models (LLMs) appear to perform best in English, while sometimes struggling to mimic humans in other tongues. That may change as the models improve and are provided with more data, but for the time being, other languages, particularly those spoken in Asian countries, are more of a challenge.

Take the Japanese language. Wataru Zaitsu, a criminal psychologist at Mejiro University in Tokyo and former police forensics specialist, studies authorship identification, which is used, for example, to link text messages to criminal suspects. Zaitsu wanted to know if it were possible to identify ChatGPT text in Japanese. He and Mingzhe Jin, a computer scientist at Japan’s Kyoto University of Advanced Studies, selected 72 academic papers in Japanese psychology journals written by 32 authors and extracted sentences from those papers, creating texts that were each approximately 1,000 words long. They then instructed ChatGPT-3.5 and ChatGPT-4 to each produce 72 texts of similar lengths based on the titles of those papers. They fed those texts to a classifier based on a random forest, a machine learning algorithm that sorts through a hierarchy of possible outcomes to reach an answer.

The classifier was able to identify the machine-written texts with 100% accuracy, even when a human might not. “At first glance, it was very difficult for me to distinguish between them without machine learning, but by using random forests, it was very easy for us to distinguish them,” Zaitsu says.

The classifier relied on various stylistic aspects of the writing that were different between the human and the AI output, such as the placement of commas and the rate of so-called function words that mostly convey grammatical meaning, such as articles and prepositions. Classifiers often do better with longer pieces of text, and LLMs are always evolving, so that 100% accuracy figure may not hold and may not apply to other types of text.

Zaitsu also tested his classifier against fake public comments, the kind bad actors might use to sow disinformation or stir up conflict. With zero-shot learning, in which the model is given no labeled examples of what it is looking for but receives information about the attributes its target should have, the classifier was 100% accurate, he says. With one-shot learning, based on a single labeled example that the model must learn to generalize from, accuracy reached 95%.

A difference of styles

Part of the challenge for LLMs is that Japanese and other Asian languages have features that do not have an equivalent in English. Zaitsu points out that, for instance, Japanese does not use spaces to separate words. It also mixes different characters. It uses kanji, pictographic characters adopted from Chinese that can represent whole words, as well as katakana and hiragana, in which the characters represent syllables. This can make the process of creating tokens, the basic elements of a language that can be rearranged to form new text, tricky. “Languages with character sets distinct from the Latin alphabet introduce additional complexities related to tokenization, an essential pre-processing step for any LLM,” says Soroush Vosoughi, a computer scientist at Dartmouth University.

Many languages operate differently depending on the relationship, social status, or sex of the speakers. Jieun Kiaer, a professor of Korean linguistics at the U.K.’s University of Oxford and a native Korean speaker, tried out ChatGPT and Google Bard to see how they performed in Asian tongues. “It worked quite well,” she says. “It was not totally a disaster with Korean and Japanese.” However, the chatbots’ attempts to speak as if to a family member contained laughably obvious mistakes, she says.

She told the chatbot she was conversing with it as if it were her younger brother. Koreans typically address a younger sibling by his or her name, while a boy would call his older sister nuna. ChatGPT confidently decided that she was talking to her older brother, whom she would call oppa, while he used her name. The endings of Korean words also vary depending on whether the speech is formal or informal. Despite instructing the chatbot to speak in one or the other manner, it repeatedly mixed up the two, she says.

She does not worry about students using ChatGPT to write essays that they then pass off as their own work. Its output contains hints that should be obvious to any native speaker that the text was generated by a computer. “It’s to do with this subtle difference of naturalness, which you may not be able to pick up in English, but in Asian languages you can feel something not human-like,” she says. In central Asian languages, where training data is limited, the rate of obvious mistakes climbs, Kiser says.

In fact, many limitations come down to the amount of data available to train LLMs. Thien Nguyen, a professor of computer science at the University of Oregon, looked at the CommonCrawl corpus, a selection of text scraped from the Internet and used in training LLMs, to get an estimate of how much of each language it contained. English was far and away the predominant tongue, making up nearly 46% of the CommonCrawl data. The next-largest language was Russian, at 6% of the data. Several other western European languages, plus Japanese, Chinese, and Vietnamese, rounded out the list of languages categorized as “high-resource,” those that made up at least 1% of the data. Two dozen other languages represented less than 1% each, with Assamese, spoken in northern India, at the bottom of the list.

Nguyen compared ChatGPT’s performance in English versus other languages on a variety of tasks, including identifying parts of speech, recognizing named entities, answering questions, and common-sense reasoning. For recognizing parts of speech, it performed as well or better in other languages than in English. It did notably worse on other tasks in non-English languages.

More than raw data

Giving LLMs the ability to handle other languages is not merely a matter of somehow finding more training data, Nguyen says. “No matter what you do, you wouldn’t be able to match the data you have in English,” he says. Some repositories of data, such as journal articles, simply do not exist in some languages. “I don’t think you can find a very good source of scientific papers in some low-resource language, because even the people who work in those languages are going to write their findings in English.”

Fortunately, some of what LLMs learn about one language can be applied to different tongues, he says. The deep neural networks that create the models focus separate layers of the network on individual aspects of language. One layer may learn semantics—the meaning of words—while another discovers the syntax of how words fit together, and yet another learns how a collection of words constitutes a discourse. Models can transfer some of these features between languages, though Nguyen says it probably works better between those with more similarities in characters, words, and grammar, such as English and German, than between, say, English and Mandarin.

To deal with smaller volumes of data, Naver, a South Korean Internet search company, focused on providing morphemes at the tokenization step. A morpheme is a word, such as “woman” or “happy,” that cannot be broken down further without changing its meaning. “It’s not reflecting the complex language characteristics of Korean, but it doesn’t totally ignore the language characteristics,” says Jung-Woo Ha, head of AI innovation at Naver, speaking through a translator. Then, because only a small percentage of the CommonCrawl corpus is in Korean, the company added its 20-plus years of search engine data, including blog posts and its own dictionary and encyclopedia.

Combining search data and morphemes allowed Naver to create a trillion Korean tokens. To train the model in English, Naver simply used the much greater volume of English text. The resulting LLM, HyperCLOVA X, was released in August 2023 and does well in both languages, Ha says. Naver is also working on getting the model to produce text in Arabic, but it does not do as good a job in that language. While there are a lot of documents available in Arabic, Ha says, they have not been processed for use in pre-training.

Another effort, by the Mohamed bin Zayed University of Artificial Intelligence in the United Arab Emirates, collaborating with AI companies Inception in Abu Dhabi and Cerberus Systems in the U.S., trained an LLM using only English and Arabic. The result, called Jais, outperforms other models on Arabic and is comparable to other models in English, the group says.

The rise of LLMs has raised a number of concerns in English-speaking countries, such as the models’ tendency to hallucinate, insisting some statements are factual when in fact they are not. There also are worries about privacy, bias, and the accuracy of training data. Nguyen says many of those concerns are not yet on the radar in other countries, which are just focusing on creating working LLMs in the first place. “Many of these issues are not even considered yet in other languages, because you don’t have the model yet,” he says.

Nguyen is also concerned that, as they learn from studying English, the models also will incorporate a bias toward Western values and styles, potentially crowding out characteristics that make another country and its language unique. He believes that is an issue to which researchers should pay more attention. “In the long term, we don’t know how that will affect the cultures,” he says.

Further Reading

  • Zaitsu, W. and Jin, M., Distinguishing ChatGPT(-3.5, -4)-generated and human-written papers through Japanese stylometric analysis, PLoS, 2023, 10.1371/journal.pone.0288453
  • Lai, V.D., Ngo, N.T., Veyseh, A.P.B., Man, H., Dernoncourt, F., Bui, T., Nguyen, T.H., ChatGPT Beyond English: Towards a Comprehensive Evaluation of Large Language Models in Multilingual Learning, arXiv:2304.05613 [cs.CL], 2023, 10.48550/arXiv.2304.05613
  • Kim, B., Kim, H., Lee, S-W., et al, What Changes Can Large-scale Language Models Bring? Intensive Study on HyperCLOVA: Billions-scale Korean Generative Pretrained Transformers, arXiv:2109.04650 [cs.CL], 2021, 10.48550/arXiv.2109.04650
  • Kiaer, J., Alongside AI: A Linguist’s response to the recent release of ChatGPT, 2023, https://issuu.com/kiaerjieun/docs/alongside_ai1-2/s/18686108
  • Wang, W., Jiao, W., Huang, J., Dai, R., Huang, J., Tu, Z., Lyu, M.R., Not All Countries Celebrate Thanksgiving: On the Cultural Dominance in Large Language Models, 2023 https://arxiv.org/abs/2310.12481
  • All about Naver’s very own AI model ‘HyperCLOVA X’, 2023, https://www.youtube.com/watch?v=23PbH9zy2jc

Join the Discussion (0)

Become a Member or Sign In to Post a Comment

The Latest from CACM

Shape the Future of Computing

ACM encourages its members to take a direct hand in shaping the future of the association. There are more ways than ever to get involved.

Get Involved

Communications of the ACM (CACM) is now a fully Open Access publication.

By opening CACM to the world, we hope to increase engagement among the broader computer science community and encourage non-members to discover the rich resources ACM has to offer.

Learn More