Artificial Intelligence and Machine Learning

What Would the Chatbot Say?

Seeking to explain the unanticipated abilities of large language models.

Credit: Getty Images unicorn horse on a blue and pink chevron background, illustration

Sebastien Bubeck is a senior principal research manager in the Machine Learning Foundations Group at Microsoft Research. Bubeck often generates stories about unicorns for his young daughter using a chatbot powered by GPT-4, the latest large language model (LLM) by OpenAI, which can produce complex text responses when prompted by a user. It made him wonder what the system thought a unicorn looked like, so he asked it to draw one in TikZ, a programming language that creates vector graphics by drawing features such as points, lines, and shapes. GPT-4 returned some code which Bubeck then compiled.

The result was a primitive depiction of a unicorn.

“I (nearly) fell from my chair,” said Bubeck during a panel discussion at the Heidelberg Laureate Forum, a networking conference for math and computer science researchers that took place in Heidelberg, Germany in September. “It looks shitty but that’s the whole point: it didn’t copy it from the Internet.”

The unicorn is an example of what many artificial intelligence (AI) researchers are referring to as emergent behavior: unexpected abilities whose sources and mechanisms are hard to discern and which are beyond what LLMs have been trained to do. Despite having solely learned from text, for example, GPT-4 is able to cross modalities and have some sort of “mental image” of what a unicorn would look like. In other cases, it has been able to solve difficult tasks in fields such as math, coding, and medicine that don’t simply require memorization, but rather combining skills and concepts from several domains.

Researchers are now trying to gain insight into how emergent abilities in LLMs come about. This behavior has been observed since the advent of transformer models, which predict the next word in a sentence or the answer to a question by learning patterns in language, such as grammar and syntax, from text on which they are trained. They learn the strength of connections between words, for instance, such that the word ‘ear’ is more likely to be followed by the word ‘phone’ or ‘plug’ rather than the word ‘happy’. Previous architectures, such as Recurrent Neural Networks (RNNs), are less complex, since they process the words in a sentence one at a time, without considering the context.

Transformer models also are typically trained on vast quantities of data, which is often thought to be the reason for their improved performance and surprising behavior. “These large language models have been trained on billions and billions of documents, transcripts, pretty much anything they could find in the public (domain) on the Web,” says James Hendler, Tetherless World Professor of Computer, Web and Cognitive Sciences at Rensselaer Polytechnic Institute (RPI) in Troy, NY, and a member of the advisory board of the ACM Special Interest Group on Artificial Intelligence (ACM SIGAI). “That covered so much more than anyone really realized (and led to) what people are referring to as emergent properties.”

Other factors related to model size could help explain unexpected abilities. In recent work, Colin Raffel, now an associate professor at Canada’s University of Toronto and an associate research director at the Vector Institute, and his colleagues, examined how emergent behavior relates to the amount of training computation and the number of parameters in a model (settings that can be adjusted to control the quality and creativity of generated text), which both relate to model complexity. They compared how different models such as GPT-3 and PaLM, which varied in terms of these two factors, performed on various tasks such as solving a word-based math problem or reciting a famous quote but changing one of the words. They would consider the ability to complete a task to be emergent if models demonstrated random performance below a certain complexity scale but well above random performance above that scale. “There were a lot of tasks where emergence happened,” says Raffel. “(The paper) points out how widespread this phenomenon is.”

Raffel and his colleagues acknowledge there are confounding factors. A model often is considered to have emergent abilities if it successfully gives a correct answer to a prompt that previous models of lesser scales were not able to answer correctly. For example, if GPT-4 is asked to multiply two numbers with a large number of digits, its ability is measured based on the product it generates. However, a previous model may have been on the right track even if its final answer was incorrect, meaning there may actually be incremental improvements as scale increases. “I don’t think that means that emergent abilities don’t exist,” says Raffel. “It basically just means that (in some cases) models are suddenly able to perform a task under the natural definition of performing the task.”

Prompting strategies, such as how a question is asked or how a task is described, also can play a role in the answers a model generates. If a smaller-scale model can’t perform a task and a larger one suddenly can, it may not necessarily be a sign of emergent behavior, but rather indicative of a better understanding of what it was asked to do. “We can’t really decouple the capabilities of a model with the model’s ability to understand a prompt,” says Raffel. “Maybe we will (be able to) in the future.”

However, the fact that current LLMs such as GPT-4 seem to be able to produce unexpected, novel creations could suggest that emergence is a result of some type of intelligence. In recent work, Bubeck and his colleagues investigated an early version of GPT-4 to see how it performed on various tasks, from drawing a unicorn to composing a short tune. They found that in all the tasks, and without requiring special prompting, its performance was similar to that of humans, and often much improved compared to prior models such as ChatGPT. “GPT-4 is a demonstration that some form of intelligence can be achieved when you train a huge neural network on a huge quantity of data,” says Bubeck.

Although LLMs may be able to respond to some prompts as well as humans, many researchers would argue that they only meet certain definitions of intelligence. “These systems are not intelligent in the way we refer to it as humans,” says Hendler. “They’re still limited in certain capabilities but more importantly, they have no intentionality and no objectives.”

There are still many aspects of LLM intelligence that need to be better understood. Bubeck would like to figure out exactly how huge these models need to be to demonstrate smart behavior, and similarly, what minimal requirements are needed for some semblance of intelligence to emerge. “To me, this is an era-defining question and I am focused on it more than ever,” says Bubeck. “I believe that we need to actually run experiments and try to build such minimal ingredients.”

Bubeck and his colleagues are one of many teams tackling this by building smaller-scale LLMs whose performance is similar to larger ones, yet require much less training time and cost. They have recently been developing a series of such models called phi that are being trained on high-quality synthetic data, such as information one might find in textbooks. “We recently open-sourced phi-1.5, which is a 1-billion-parameter model that displays many of the emergent capabilities we see in much bigger models,” says Bubeck. This model also outperformed some larger models on common sense and logical reasoning tasks.

As new versions of LLMs are created that are increasingly complex, emergent abilities could also become more sophisticated and widespread. However, not everyone agrees.

Bill Gates, former CEO and chairman of Microsoft, recently said in an interview that he thought the abilities of such systems had peaked and that GPT-5 would not surpass GPT-4. Instead, he believes they will improve in terms of reliability and interpretability.

Hendler is of a similar opinion. “I don’t think we will see significant surprises,” he says. “I actually think that LLMs will mostly develop as specialized systems for particular problems and areas rather than generalizing further and further.”

Sandrine Ceurstemont is a freelance science writer based in London, U.K.

Join the Discussion (0)

Become a Member or Sign In to Post a Comment

The Latest from CACM

Shape the Future of Computing

ACM encourages its members to take a direct hand in shaping the future of the association. There are more ways than ever to get involved.

Get Involved

Communications of the ACM (CACM) is now a fully Open Access publication.

By opening CACM to the world, we hope to increase engagement among the broader computer science community and encourage non-members to discover the rich resources ACM has to offer.

Learn More