Bigger, Not Necessarily Better

When it comes to artificial intelligence, bigger is typically better. Large language models (LLMs) that power chatbots such as OpenAI’s ChatGPT and Google’s Bard generally are able to generate more sophisticated answers when prompted as they increase in size in terms of training compute, the amount of data they are trained on and the number of parameters, such as weights and variables, they contain. Large-scale models can also often solve difficult tasks in fields such as math and coding that are beyond what they have been trained to do.

However, researchers are also finding that the opposite can happen: LLMs can perform certain tasks less well as they become bigger, a phenomenon that has been called inverse scaling. “Unlike other problems where scale will address them, scale doesn’t solve the problem,” says Ameya Prabhu, a machine learning Ph.D. candidate at the University of Oxford’s Torr Vision Group in the U.K. “In fact, it makes the problem far worse.”

Researchers are now trying to better understand inverse scaling by identifying different examples and trying to home in on what might cause it. It is largely thought to be linked to how LLMs are trained and the fact that they are designed to predict the next word in a sequence of words. As language models are increasingly used to help with a variety of real-world tasks, improving our knowledge of what they are learning and where they break down can help with the development of mitigation strategies to improve their performance and make sure they are not causing harm.

“It is definitely important to understand, catalog, and discuss when and where these models can fail and how they can fail, not only in the abstract but also on specific tasks and with specific data,” says Nicholas Mattei, an assistant professor of computer science at Tulane University in New Orleans and the vice chair of the ACM Special Interest Group on Artificial Intelligence (ACM SIGAI).

One of the first tasks that was described as inverse scaling involves LLMs generating false statements based on learned misconceptions from training data, such as conspiracy theories. A team of researchers created a benchmark called TruthfulQA made up of over 800 questions from different categories that humans might not answer correctly due to false beliefs. They were able to show that larger language models generally produced fewer correct answers than smaller ones. and LLMs performed far less well than humans overall.

“It shows how language models are just repeating things they’ve been exposed to in the past and that might not be what we want them to do,” says James Michaelov, a graduate student in the Department of Cognitive Science at the University of California, San Diego. “We don’t want them to repeat misinformation.”

Tasks that demonstrate inverse scaling are not always easy to come across. In 2022, Prabhu and his colleagues wanted to further investigate the phenomenon but were finding it difficult to find examples of it. They launched a competition called the Inverse Scaling Prize, with up to $100,000 as a Grand Prize. Examples submitted would be evaluated on a range of criteria such as the importance of a task in terms of how much harm it could cause, and whether the effect was widely seen on different models. “We wanted to open it up to the community and get more input,” says Prabhu.

None of the submissions they received qualified for the Grand Prize or even a second prize, since they were not able to strongly demonstrate real-world implications for failing at a task. However, 11 entries were chosen as third-place prize-winners, and each was awarded $5,000.

From these tasks, Prabhu and his colleagues were able to identify four causes of inverse scaling. Some examples of the phenomenon seemed to be the result of a distractor task, where an incorrect answer was given because the LLM was responding to a similar task and not the actual one. LLMs also seemed to generate incorrect answers when they were picking up on spurious correlations when prompted with several examples. Unwanted imitation of training data, similar to what was found with the TruthfulQA benchmark, was also a cause identified, while a fourth cause seen was the inability to override patterns seen in the training data, even when prompted to do so.

“Language models can show significant behavioral differences when they are explicitly asked to do something (differently) from what you would expect them to do if they were just parroting the training data,” says Prabhu. “But they still can’t let go of some of the stronger biases.”

The results of the inverse scaling contest prompted other researchers to follow-up, leading to the discovery of U-shaped scaling, where worse performance seen on certain tasks seems to be reversed once a model reaches a certain larger size threshold. So far, many problems have been remedied by increasing model size, so it was suggested that scaling up a model even more could be a fix.

However, Prabhu cautions that enlargement does not always alleviate inverse scaling. “There are only some tasks where we saw U-shaped scaling,” he says.

Another team investigated whether language models demonstrate inverse scaling with quantifiers – words like ‘most’ and ‘few’ that drastically alter the meaning of a sentence, but which are used relatively infrequently. Previous work suggested large language models don’t take the meaning of these words into account in specific contexts. Michaelov and his colleague Benjamin K. Bergen wanted to further evaluate how LLMs deal with quantifiers and whether this changes with model size. An inverse scaling relationship had previously been reported for many tasks involving negation, for example.

The study tested a range of LLMs, such as GPT-2, GPT-3 and GPT-J, on more than 900 sentences involving different quantifiers and meanings; for example, comparing ‘few squirrels gather nuts’ with ‘few squirrels gather nails.’ Michaelov and Bergen looked at how sensitive LLMs were to the quantifiers and how plausible they considered different critical final words in similar sentence pairs.

The team found that all LLMs tested generally had trouble predicting credible final words when the word ‘few’ was used and their performance got worse as their size increased, demonstrating inverse scaling. Michaelov doesn’t think the models become less able to account for the quantifier itself, but rather that they get better at learning about relationships between critical words as they get bigger, for example that ‘squirrels’ and ‘nuts’ are strongly related. “It’s more that the kinds of things that they learn don’t necessarily align with the kinds of tasks that people expect or hope for them to do,” he says.

Inverse scaling suggests LLMs may not be as generalizable as they sometimes seem and that their output should not be immediately trusted. Bergen, a professor in the Department of Cognitive Science at the University of California in San Diego, says although later generations of LLMs are being trained on more data, it should not be assumed that their performance will improve or that they will have the same capabilities. “They are often being treated as foundation models and all sorts of applications get built off of them, from industrial applications to government applications,” Bergen says. “They are not stable in the same way as handcrafted systems that we’re used to in these contexts, so that poses its range of challenges.”

Michaelov is now following up by trying to gain a deeper understanding of the factors that drive an increase or decrease in performance in LLMs. He thinks that simply focusing on scale can be shortsighted, and a more specific understanding is needed of the role of different components, such as model parameters or training data. “There are more empirical questions (that need to be addressed) about what exactly gets better depending on how much larger the models are,” says Michaelov. “That’s the area I’ve become interested in.”

Sandrine Ceurstemont is a freelance science writer based in London, U.K.