Is It Possible to Truly Understand Performance in LLMs?

The lightning-like growth of large language models (LLMs) has taken the world by storm. Generative artificial intelligence (AI) is radically reshaping business, education, government, academia and other parts of society. Yet, for all the remarkable capabilities these systems deliver—and they are clearly impressive—a major question emerges: how can data scientists measure model performance and fully understand how they gain abilities and skills?

It is far from an abstract question. Constructing high-functioning AI models hinges on critical metrics and benchmarks. These criteria, in turn, require an understanding of what constitutes correctness. As data scientists drill down into models, they soon recognize that the choice of metrics and what key performance indicators they plug into a model influence outcomes. This includes everything from real world reliability to the amount of energy and resources required to construct an LLM.

That’s where a concept called emergence enters the picture.^a In LLMs, certain skills and capabilities appear or dramatically improve on larger-scale models. This process does not take place along a predictable trend line. It’s advantageous to know what the threshold for emergence is, because it’s a key to building better models and allocating time, energy, and resources efficiently.

There’s a catch, however. How data scientists interpret model accuracy may determine whether emergence occurs, or how and when it occurs.

“How people measure and interpret results has a significant impact on AI tooling and training,” said Sanmi Koyejo, an assistant professor of computer science at Stanford University. Recently, he and a pair of Ph.D. students embarked on a mission to better understand the somewhat-cryptic but critical factors that define emergence and effective scaling. They wanted to know whether spikes in performance are real, or whether the measurement system creates the appearance of emergence.

The research appeared in a 2023 paper titled Are Emergent Abilities of Large Language Models a Mirage?^b “It’s essential to build models that behave in predictable ways and understand why, when and where we’re hitting critical mass,” added Rylan Schaeffer, who collaborated with Koyejo on the research.

Metrics Matter

It is a widely accepted concept in the artificial intelligence space: more data leads to better models. There’s plenty of evidence to support this contention. A 2022 study dubbed BIG-bench revealed a surprising finding: both GPT-3 and LAMDA, two leading LLMs, struggled with basic arithmetic when given fewer parameters.^c Yet, when GPT-3 hit 13 billion parameters—it could suddenly solve addition problems accurately. LAMDA demonstrated a similar breakthrough at 68 billion parameters.

This “emergent” capability occurs in several key areas: arithmetic problems, word dexterity, language translations, logical and analogical reasoning, and so-called zero-shot and few-shot learning. The latter refers to the need to fine-tune smaller LLMs on specific tasks, while larger models learn on their own. For example, ChatGPT-3 demonstrated an ability to solve a wide array of problems with little or no specific task training.^d

This suggests there is a critical mass of parameters required for LLMs to grasp fundamental mathematical concepts. Yet, this sudden jump—emergence—remains mysterious and somewhat random. At times, advances within models take place in steady and anticipated ways; in other moments, abilities and skills suddenly leap forward for no explicable reason, other than the model has reached a certain number of parameters.

Understanding why emergence occurs, if it occurs at all, is part of a broader desire to shine a light inside the black box of LLMs. Despite remarkable performance advances over the last couple of years, little is known about how systems “learn,” connect words and concepts, and arrive at an answer. “Assessing the intelligence and actual abilities within systems is difficult,” said Melanie Mitchell, a professor at the Santa Fe Institute. For example, she said, “LLMs can now pass a bar exam, but they would fail at practicing law. High performance on benchmarks and real-world results are entirely different things.”

Nevertheless, understanding whether emergence is real or an artifact that results from specific measurement methods is a crucial piece of the overall puzzle. Data scientists typically rely on a straightforward method to gauge accuracy: is the information correct or not? In many instances, they assign one point for a correct answer and zero points for an incorrect answer. “On the surface, it often seems like a simple determination, but once you dive into a model, you discover things can become incredibly complicated. How you measure things determines what results you obtain,” Schaeffer said.

Consider: if you ask a group of basketball players to shoot 100 three-point shots and track each player’s results, it’s possible to rank each player by an exact percentage. However, if you alter the measurement method—say you group players into two categories, based on whether they made 90% of their shots—perhaps one player reaches the benchmark while the other 99 fail. “Yet it’s possible that all the rest shot just under 90%. That would indicate a 1% success rate when the average score was around 88%, Schaeffer explained.

Change the measurement criteria and you change the results. For example, if you add 100 additional players, and three players from the second group suddenly meet the 90% threshold by each hitting only one more shot, the average percentage for the entire group of 200 will tick up by a percent. Yet, the success rate at the 90% cutoff appears to have improved by 300% (like emergence in an LLM model)—despite little or no improvement.

“A sharp increase may simply be an artifact of the measurement system that’s being used,” Mitchell said. “It may appear there’s a sharp spike when the real outcome is smoother and more predictable.”

When the Stanford team drilled into 29 different metrics commonly used to evaluate model performance, they found that 25 of them demonstrated no emergent properties. With the use of more refined metrics, a continuous, linear growth pattern emerged as the model grew larger. Even the other four metrics had explanations for emergence, Schaeffer said. “They’re all sharp, deforming, non-continuous metrics. So, if an error occurs because one digit is wrong, it causes the same outcome as if a billion digits are wrong.”

Partial Credit

All of this is relevant because software engineers and data scientists lack unlimited resources to train and build LLMs. Depending on the specific model, design, and purpose, it often is necessary to condense or round off data to conserve time and resources, including the high cost of using GPUs.

There also are considerations for how the model works in the physical world. How an LLM behaves and what it does can impact economic decisions, public policies, safety, and how autonomous vehicles and other machines act and react to real-world situations and events.

A simplistic “pass or fail” approach to LLMs doesn’t cut it, the researchers argue. “Not allowing for a partial credit and not building this information into the benchmarking framework may lead to misleading and problematic results that can undermine AI,” Koyejo said. “Emergence shouldn’t drive the way we make decision or design things.”

Added Brando Miranda, a Stanford University Ph.D. student who also served on the research team and helped author the paper, “There’s a need to develop methods that promote greater consistency and better predictability.”

In other words, data scientists might have to rethink the fundamental definition of success—and introduce more precise metrics and measurement systems. Criteria and metrics should depend less on whether a model did the exact right thing and, instead, on how close it is to the real world truth or desired result, the researchers argue. An all-or-nothing approach may function well for an arithmetic equation such as 1+1 = 2 or when an LLM produces a direct word translation, such as “gato” to “cat” from Spanish to English.

The real world of AI is far more complicated, however. For example, what happens if an AI model gets 99% of an algebraic equation right, but misses a single variable or coefficient? What about an LLM that generates an excellent summary of a document, but with a single factual error? Expanding measurement criteria to how well the model predicted both right and wrong things completely changes the equation, Miranda noted. So, if an LLM is spitting out language translations, it isn’t only about getting the specific words right, it’s about the overall quality of the translation and how accurately it conveys the intended message.

Scoring systems and benchmarking methods deserve additional study, Mitchell said. However, getting to a higher plane may prove difficult. For one thing, human subjectivity can creep into a scoring model—particularly those with components or factors subject to interpretation. For another, “machine learning systems can sometimes incorporate ‘shortcuts’—spurious statistical associations—to obtain high scores on benchmarks without possessing the understanding that the benchmark was supposed to measure,” she explained.

Indeed, a study conducted by researchers in Taiwan found that an AI system that performed nearly as well as humans tapped statistical clues in the dataset to achieve random accuracy.^e It performed this feat by analyzing certain keywords, such as “not,” and their position in a sentence. Once the researchers eliminated the words, performance plummeted. In the end, the high scores were merely an illusion—or artifacts based on the scoring method.

Truth and Consequences

Some in the scientific community contend that if a system gets to the right answer, it doesn’t matter how or why. Others, such as Tianshi Li, an assistant professor in the Khoury College of Computer Sciences at Northeastern University, believe a lack of explainability and transparency in LLMs and other AI systems undermines public trust, particularly in critical areas such as data security, privacy, and public safety. “Transparency is in dire need at many layers,” she said.

Yet despite questions about unpredictability that could arise from emergent systems, the scientific community isn’t completely sold on the idea that emergence is merely a function of measurements, metrics, and scoring systems. Some data scientists argue that even with more robust tools and techniques, sudden jumps in knowledge likely will continue to occur when LLMs reach a critical size. They argue the research conducted by the Stanford team does not fully account for emergence.

The Stanford researchers concede further exploration of the topic is needed, and a deeper understanding of various factors is required. This includes studying other dimensions of model behavior, such as generalization, robustness, and interpretability. In July 2024, the trio co-authored another paper that further explored the concept of predictability in terms of downstream performance.^f They found that performance typically degrades—even when a more-nuanced multiple choice scoring system is introduced—if scoring is based only on correct answers and does not incorporate incorrect data.

A deeper understanding of LLM behavior and its real-world impacts could change the way data scientists gauge results—and build models. If performance is a result of the measurement techniques used, then it’s vital to consider factors like model size and task complexity when data scientists create an LLM. With a better grasp of “sharpness,” it is possible to build better models.

On the other hand, if emergence is a real thing, there’s a need to understand how, when, and why it occurs. This could help avoid unpredictable behavior and possibly catastrophic outcomes.

“If we want to build the best possible models, we have to understand how they work and why they do the things they do,” Mitchell said. “We have to make them both robust and safe.”

Is It Possible to Truly Understand Performance in LLMs?

Metrics Matter

Partial Credit

Truth and Consequences

Further Reading

Is It Possible to Truly Understand Performance in LLMs?

DOI

Join the Discussion (0)

Become a Member or Sign In to Post a Comment

The Latest from CACM

Shape the Future of Computing

Communications of the ACM (CACM) is now a fully Open Access publication.

Metrics Matter

Partial Credit

Truth and Consequences

Further Reading

Is It Possible to Truly Understand Performance in LLMs?

DOI

Related Reading

Join the Discussion (0)

Become a Member or Sign In to Post a Comment

Shape the Future of Computing

Communications of the ACM (CACM) is now a fully Open Access publication.